We will be offering mothur and R workshops throughout 2019. Learn more.

Difference between revisions of "Classify.seqs"

From mothur
Jump to: navigation, search
(New page: Classify sequences according to a taxonomy corresponding to a reference taxonomy using one of several methods ==Options== * candidate (fasta file for input sequences) * template (fasta fi...)
 
Line 2: Line 2:
  
 
==Options==
 
==Options==
* candidate (fasta file for input sequences)
+
* fasta (fasta file for input sequences)
* template (fasta file for template datbase corresponding to sequences provided in the hierarchy report)
+
* template (fasta file for template database)
* alignreport (the align.report file from align.seqs)
+
 
* taxonomy (the taxonomy file for the database)
 
* taxonomy (the taxonomy file for the database)
 
* search (values=blast, kmer, suffix; kmer default)
 
* search (values=blast, kmer, suffix; kmer default)
* ksize (takes an int between 5 and 10; 8 default)
+
* method (values=knn and bayesian; default bayesian)
* hits (takes an int; default=1)
+
* ksize (default = 8)
* consensus (takes a float; default=0.50)
+
* match, mismatch, gapopen, gapextend (options for blast search)
* confidence (takes a float; default=0.00)
+
* cutoff (confidence cutoff; default=0.00)
 
+
* numwanted (number of sequences to use in knn method; default = 10)
 
+
* processors (number of processors to use; default = 1)
==Required==
+
see below
+
  
  
 
==Output==
 
==Output==
 
* taxonomy (each sequence is followed by a string indicating its taxonomy)
 
* taxonomy (each sequence is followed by a string indicating its taxonomy)
* heirarchy (a textual hierarchy tree indicating the number of sequences that belong to each level of the hierarchy)
+
* tax.summary (a textual hierarchy tree indicating the number of sequences that belong to each level of the hierarchy)
 
+
 
+
==Algorithm==
+
 
+
===Option 1===
+
alignreport and hierarchy provided; no other options are meaningful:
+
 
+
# Read the taxonomy file into a map<string,string> where the key is the template sequence name and the value is the taxonomy string.
+
#Open the alignreport file and for each sequence:
+
#* Get the template name (third column)
+
#* Find the template name in the taxonomy map
+
#* Output the candidate sequence name and the taxonomy string
+
# Close both taxonomy files.
+
# Generate the database hierarchy from the hierarchy file and store it in RAM.
+
# Read in the new taxonomy file and count the number of times a sequence corresponds to each level of the hierarchy.
+
# Print out the so that only non-zero counts are reported
+
 
+
 
+
===Option 2===
+
candidate, template, and hierarchy provided; search and ksize, may be provided:
+
 
+
# Depending on the search method selected generate search database.
+
# Read the taxonomy file into a map<string,string> where the key is the template sequence name and the value is the taxonomy string.
+
# Read through the candidate sequences without pulling them into RAM and…
+
#* For each sequence use database method to find closest match.
+
#* Find the template name in the taxonomy map
+
#* Output the candidate sequence name and the taxonomy string
+
# Close both taxonomy files and the candidate file, clear out the search database.
+
# Generate the database hierarchy from the hierarchy file and store it in RAM.
+
# Read in the new taxonomy file and count the number of times a sequence corresponds to each level of the hierarchy.
+
# Print out the so that only non-zero counts are reported
+
 
+
 
+
===Option 3===
+
candidate, template, and hierarchy provided and hits>1; search, ksize, and consensus may be provided:
+
 
+
# Depending on the search method selected generate search database.
+
# Read the taxonomy file into a map<string,string> where the key is the template sequence name and the value is the taxonomy string.
+
# Read through the candidate sequences without pulling them into RAM and for each sequence:
+
#* Use database method to find closest N matches (hits=N) – will need to modify the search command.
+
#* Find the N template names and their taxonomy strings
+
#* Based on consensus generate a consensus taxonomy string for the N matches
+
# Close both taxonomy files and the candidate file, clear out the search database.
+
# Generate the database hierarchy from the hierarchy file and store it in RAM.
+
# Read in the new taxonomy file and count the number of times a sequence corresponds to each level of the hierarchy.
+
# Print out the so that only non-zero counts are reported
+
 
+
  
===Option 4===
+
KNN and Bayesian methods modeled from algorithms in
candidate, template, and hierarchy provided and default!=0; ksize and confidence may be provided
+
Naı¨ve Bayesian Classifier for Rapid Assignment of rRNA Sequences
 +
into the New Bacterial Taxonomy􏰎†
 +
Qiong Wang,1 George M. Garrity,1,2 James M. Tiedje,1,2 and James R. Cole1*
 +
Center for Microbial Ecology1 and Department of Microbiology and Molecular Genetics,2 Michigan State University,
 +
East Lansing, Michigan 48824
 +
Received 10 January 2007/Accepted 18 June 2007

Revision as of 15:02, 17 November 2009

Classify sequences according to a taxonomy corresponding to a reference taxonomy using one of several methods

Options

  • fasta (fasta file for input sequences)
  • template (fasta file for template database)
  • taxonomy (the taxonomy file for the database)
  • search (values=blast, kmer, suffix; kmer default)
  • method (values=knn and bayesian; default bayesian)
  • ksize (default = 8)
  • match, mismatch, gapopen, gapextend (options for blast search)
  • cutoff (confidence cutoff; default=0.00)
  • numwanted (number of sequences to use in knn method; default = 10)
  • processors (number of processors to use; default = 1)


Output

  • taxonomy (each sequence is followed by a string indicating its taxonomy)
  • tax.summary (a textual hierarchy tree indicating the number of sequences that belong to each level of the hierarchy)

KNN and Bayesian methods modeled from algorithms in Naı¨ve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy􏰎† Qiong Wang,1 George M. Garrity,1,2 James M. Tiedje,1,2 and James R. Cole1* Center for Microbial Ecology1 and Department of Microbiology and Molecular Genetics,2 Michigan State University, East Lansing, Michigan 48824 Received 10 January 2007/Accepted 18 June 2007