We will be offering mothur and R workshops throughout 2019. Learn more.

Classify.seqs

From mothur
Revision as of 13:08, 24 November 2009 by Westcott (Talk | contribs)

Jump to: navigation, search

Classify sequences according to a taxonomy corresponding to a reference taxonomy using one of several methods files for examples

Options

  • fasta (fasta file for input sequences)
  • template (fasta file for template database)
  • taxonomy (the taxonomy file for the database)
  • search (values=blast, kmer, suffix; kmer default)
  • method (values=knn and bayesian; default bayesian)
  • ksize (default = 8)
  • match, mismatch, gapopen, gapextend (options for blast search)
  • cutoff (confidence cutoff; default=0.00)
  • numwanted (number of sequences to use in knn method; default = 10)
  • processors (number of processors to use; default = 1)

Output

  • taxonomy (each sequence is followed by a string indicating its taxonomy)
  • tax.summary (a textual hierarchy tree indicating the number of sequences that belong to each level of the hierarchy)

methods

knn

When finding the taxonomy of a given query sequence in the fasta file, the knn method finds the closest matches in the template using the search method provided. It looks at the taxonomy of those closest matches and assigns the query sequence the taxonomy that all the closest sequences have in common.

mothur > classify.seqs(fasta=abrecovery.fasta, taxonomy=silva.full.taxonomy, template=silva.bacteria.fasta, method=knn)
         Reading in the silva.full.taxonomy taxonomy...	DONE.
         Generating search database...    DONE.
         It took 43 seconds generate search database. 
         Classifying sequence 100
         Classifying sequence 200
         It took 2 secs to classify 242 sequences.

numwanted

You can choose the number of closest matches to consider using the numwanted parameter.

mothur > classify.seqs(fasta=abrecovery.fasta, taxonomy=silva.full.taxonomy, template=silva.bacteria.fasta, method=knn, numwanted=5)

bayesian

When finding the taxonomy of a given query sequence in the fasta file, the bayesian method looks at the query sequence kmer by kmer. The only valid search option with the bayesian method is kmer and by default mothur uses kmer size 8. The method looks at all taxonomies represented in the template, and calculates the probability a sequence from a given taxonomy would contain a specific kmer. Then calculates the probability a query sequence would be in a given taxonomy based on the kmers it contains, and assign the query sequence to the taxonomy with the highest probability. This method also runs a bootstrapping algorithmn to find the confidence limit of the assignment by randomly choosing with replacement 1/8 of the kmers in the query and then finding the taxonomy.


mothur > classify.seqs(fasta=abrecovery.fasta, taxonomy=silva.full.taxonomy, template=silva.bacteria.fasta, method=bayesian)
         Reading in the silva.full.taxonomy taxonomy...	DONE.
         Generating search database...    DONE.
         It took 43 seconds generate search database. 
         Reading template probabilities...     DONE.
         It took 21 seconds get probabilities. 
         Classifying sequence 100
         Classifying sequence 200
         It took 123 secs to classify 242 sequences.
example output:
AY457906	Bacteria(100);Bacteroidetes-Chlorobi(100);Bacteroidetes(100);Bacteroides-Prevotella(100);Bacteroides(100);Bacteroides_uniformis(63);

cutoff

The cutoff parameter allows you to truncate the taxonomy if the confidence for that assignment falls below the cutoff. By default the cutoff is 0.

mothur > classify.seqs(fasta=abrecovery.fasta, taxonomy=silva.full.taxonomy, template=silva.bacteria.fasta, method=bayesian, cutoff=80)

with this cutoff the above example output becomes:

AY457906	Bacteria(100);Bacteroidetes-Chlorobi(100);Bacteroidetes(100);Bacteroides-Prevotella(100);Bacteroides(100);


processors

The processors parameter allows you to run the command with multiple processors. By default processors is 1, and use of multiple processors is not available for windows users.

mothur > classify.seqs(fasta=abrecovery.fasta, taxonomy=silva.full.taxonomy, template=silva.bacteria.fasta, method=bayesian, processors=2)
         Reading in the silva.full.taxonomy taxonomy...	DONE.
         Generating search database...    DONE.
         It took 42 seconds generate search database. 
         Reading template probabilities...     DONE.
         It took 21 seconds get probabilities. 
         Classifying sequence 100
         Classifying sequence 100
         It took 69 secs to classify 242 sequences.


KNN and Bayesian methods modeled from algorithms in Naı¨ve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy􏰎† Qiong Wang,1 George M. Garrity,1,2 James M. Tiedje,1,2 and James R. Cole1* Center for Microbial Ecology1 and Department of Microbiology and Molecular Genetics,2 Michigan State University, East Lansing, Michigan 48824 Received 10 January 2007/Accepted 18 June 2007