We will be offering mothur and R workshops throughout 2019. Learn more.

Difference between revisions of "Classify.seqs"

From mothur
Jump to: navigation, search
(method=bayesian)
Line 233: Line 233:
  
 
Note that with one processor it took about 79 sec to do the actual classification.
 
Note that with one processor it took about 79 sec to do the actual classification.
 +
 +
==name==
 +
 +
The name parameter allows you add a names file with your fasta file, if you enter multiple fasta files, you must enter matching names files for them.
  
 
[[Category:Commands]]
 
[[Category:Commands]]

Revision as of 20:46, 26 January 2010

The classify.seqs command allows the user to use several different methods to assign their sequences to the taxonomy outline of their choice. Current methods include using a k-nearest neighbor consensus and Bayesian approach. Taxonomy outlines and reference sequences can be obtained from the taxonomy outline page. The command requires that you provide a fasta-formatted input and database sequence file and a taxonomy file for the reference sequences. To complete this tutorial, you are encouraged to obtain the AbRecovery dataset.

Output

mothur will output two files from the classify.seqs command: a *.taxonomy file which contains a taxonomy string for each sequence and a *.tax.summary file which contains a taxonomic outline indicating the number of sequences that were found for your collection at each level. For example, abrecovery.taxonomy may look like:

AY457915	Bacteria(100);Firmicutes(99);Clostridiales(99);Johnsonella_et_rel.(99);Johnsonella_et_rel.(99);Johnsonella_et_rel.(91);Eubacterium_eligens_et_rel.(89);Lachnospira_pectinoschiza(80);
AY457914	Bacteria(100);Firmicutes(100);Clostridiales(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(95);Eubacterium_eligens_et_rel.(92);Eubacterium_eligens(84);Eubacterium_eligens(81);
AY457913	Bacteria(100);Firmicutes(100);Clostridiales(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(100);Roseoburia_et_rel.(97);Roseoburia_et_rel.(97);Eubacterium_ramulus_et_rel.(90);uncultured(90);
AY457912	Bacteria(100);Firmicutes(99);Clostridiales(99);Johnsonella_et_rel.(99);Johnsonella_et_rel.(99);
AY457911	Bacteria(100);Firmicutes(99);Clostridiales(98);Ruminococcus_et_rel.(96);Anaerofilum-Faecalibacterium(92);Faecalibacterium(92);Faecalibacterium_prausnitzii(90);

or

AY457915	Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;Johnsonella_et_rel.;Eubacterium_eligens_et_rel.;Lachnospira_pectinoschiza;
AY457914	Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;Johnsonella_et_rel.;Eubacterium_eligens_et_rel.;Eubacterium_eligens;Eubacterium_eligens;
AY457913	Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;Roseoburia_et_rel.;Roseoburia_et_rel.;Eubacterium_ramulus_et_rel.;uncultured;
AY457912	Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;
AY457911	Bacteria;Firmicutes;Clostridiales;Ruminococcus_et_rel.;Anaerofilum-Faecalibacterium;Faecalibacterium;Faecalibacterium_prausnitzii;

This output indicates the sequence identifier in the first column as well as it's taxonomy. In the former case, bootstrap values are provided. In the latter case, they are either not calculated (method=knn) or suppressed (method=bayesian, probs=F).

The second output file, abrecovery.tax.summary may look something like the following:

0	0	Root	1	242
1	0.1	Bacteria	5	242
2	0.1.1	Actinobacteria	1	13
3	0.1.1.1	Actinomycetaceae-Bifidobacteriaceae	1	13
4	0.1.1.1.1	Bifidobacteriaceae	1	13
5	0.1.1.1.1.1	Bifidobacterium_choerinum_et_rel.	0	13
2	0.1.2	Bacteroidetes-Chlorobi	1	81
3	0.1.2.1	Bacteroidetes	1	81
4	0.1.2.1.1	Bacteroides-Prevotella	1	55
5	0.1.2.1.1.1	Bacteroides	2	55
6	0.1.2.1.1.1.1	Bacteroides_acidofaciens	0	10
6	0.1.2.1.1.1.2	Bacteroides_vulgatus	0	28
2	0.1.3	Beta_Gammaproteobacteria	2	41
3	0.1.3.1	Betaproteobacteria	1	12
4	0.1.3.1.1	Burkholderiales	1	9
5	0.1.3.1.1.1	Comamonas_et_rel.	0	1
3	0.1.3.2	Gammaproteobacteria_1	1	29
4	0.1.3.2.1	Enterobacteriales	2	29
5	0.1.3.2.1.1	Escherichia_et_rel.	1	27
6	0.1.3.2.1.1.1	Escherichia_et_rel.	2	25
7	0.1.3.2.1.1.1.1	Citrobacter-Escherichia_-Salmonella-Shigella	0	1
7	0.1.3.2.1.1.1.2	Klebsiella_pneumoniae_et_rel.	0	2
5	0.1.3.2.1.2	Pasteurellales_et_rel.	1	2
6	0.1.3.2.1.2.1	Pasteurellales	0	2
2	0.1.4	Firmicutes	3	106
3	0.1.4.1	Bacillales_Mollicutes	0	1
3	0.1.4.2	Clostridia	1	1
4	0.1.4.2.1	Acidaminococcaceae	0	1
3	0.1.4.3	Clostridiales	3	103
4	0.1.4.3.1	Acetobacterium_et_rel.	0	2
4	0.1.4.3.2	Johnsonella_et_rel.	1	41
5	0.1.4.3.2.1	Johnsonella_et_rel.	2	41
6	0.1.4.3.2.1.1	Roseoburia_et_rel.	1	6
7	0.1.4.3.2.1.1.1	Roseoburia_et_rel.	0	6
6	0.1.4.3.2.1.2	Ruminocoocus	0	2
4	0.1.4.3.3	Ruminococcus_et_rel.	1	55
5	0.1.4.3.3.1	Anaerofilum-Faecalibacterium	1	55
6	0.1.4.3.3.1.1	Faecalibacterium	1	55
7	0.1.4.3.3.1.1.1	Faecalibacterium_prausnitzii	0	53
2	0.1.5	uncultured_OD1-OP11-WS6-TM7	1	1
3	0.1.5.1	TM7	1	1
4	0.1.5.1.1	uncultured	0	1

The first column indicates the taxonomic level in the outline. Obviously, the Root is the highest one can go. In this case the deepest any of the sequences go is to level 7. The second column indicates the "pedigree" for each lineage. The third column is the name of the lineage. Column four indicates the number of children lineages that the current lineage has. Finally, the last column indicates the number of sequences that were found in that lineage.


method=bayesian

When finding the taxonomy of a given query sequence in the fasta file, the bayesian method looks at the query sequence kmer by kmer. The method looks at all taxonomies represented in the template, and calculates the probability a sequence from a given taxonomy would contain a specific kmer. Then calculates the probability a query sequence would be in a given taxonomy based on the kmers it contains, and assign the query sequence to the taxonomy with the highest probability. This method also runs a bootstrapping algorithmn to find the confidence limit of the assignment by randomly choosing with replacement 1/8 of the kmers in the query and then finding the taxonomy. This is the method that is implemented by the RDP and is described by Want et al. This is the default method in classify.seqs.

mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy)

or

mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, method=bayesian) 

Reading in the silva.slv.taxonomy taxonomy...	DONE.
Generating search database...    DONE.
It took 5 seconds generate search database. 
Reading template probabilities...     DONE.
It took 3 seconds get probabilities. 
Classifying sequence 100
Classifying sequence 200

It took 75 secs to classify 242 sequences.


ksize

The only valid search option with the bayesian method is kmer and by default mothur uses kmer size 8. If you would like to change the kmer size you can use the ksize option:

mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, ksize=6)


iters

The iters option allows you to specify how many iterations to do when calculating the bootstrap confidence score for your taxonomy. The default is 100.

mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, iters=1000)


cutoff

By default, the classify.seqs will return a full taxonomy for every sequence, regardless of the bootstrap value for that taxonomic assignment. For example, running the default parameters will yield the following output:

AY457915	Bacteria(100);Firmicutes(99);Clostridiales(99);Johnsonella_et_rel.(98);Johnsonella_et_rel.(98);Johnsonella_et_rel.(92);Eubacterium_eligens_et_rel.(89);Lachnospira_pectinoschiza(82);
AY457914	Bacteria(100);Firmicutes(99);Clostridiales(97);Johnsonella_et_rel.(97);Johnsonella_et_rel.(97);Johnsonella_et_rel.(94);Eubacterium_eligens_et_rel.(91);Eubacterium_eligens(82);Eubacterium_eligens(79);
AY457913	Bacteria(100);Firmicutes(100);Clostridiales(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(100);Roseoburia_et_rel.(100);Roseoburia_et_rel.(100);Eubacterium_ramulus_et_rel.(89);uncultured(89);
AY457912	Bacteria(100);Firmicutes(100);Clostridiales(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(92);Eubacterium_eligens_et_rel.(81);Lachnospira_pectinoschiza(37);
AY457911	Bacteria(100);Firmicutes(99);Clostridiales(98);Ruminococcus_et_rel.(93);Anaerofilum-Faecalibacterium(91);Faecalibacterium(91);Faecalibacterium_prausnitzii(90);

You will notice that sequence AY457912 has a bootstrap value of 37% for the assignment to the Lachnospira pectinoschiza. This isn't much of a vote of confidence for this assignment. The current thinking seems to be to use a minimum cutoff of 60%. Let's go crazy and set a value of 80%:


mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, cutoff=80)
AY457915	Bacteria(100);Firmicutes(99);Clostridiales(99);Johnsonella_et_rel.(99);Johnsonella_et_rel.(99);Johnsonella_et_rel.(91);Eubacterium_eligens_et_rel.(89);Lachnospira_pectinoschiza(80);
AY457914	Bacteria(100);Firmicutes(100);Clostridiales(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(95);Eubacterium_eligens_et_rel.(92);Eubacterium_eligens(84);Eubacterium_eligens(81);
AY457913	Bacteria(100);Firmicutes(100);Clostridiales(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(100);Roseoburia_et_rel.(97);Roseoburia_et_rel.(97);Eubacterium_ramulus_et_rel.(90);uncultured(90);
AY457912	Bacteria(100);Firmicutes(99);Clostridiales(99);Johnsonella_et_rel.(99);Johnsonella_et_rel.(99);
AY457911	Bacteria(100);Firmicutes(99);Clostridiales(98);Ruminococcus_et_rel.(96);Anaerofilum-Faecalibacterium(92);Faecalibacterium(92);Faecalibacterium_prausnitzii(90);

You should notice two things. First, there are no bootstrap values below 80 for any of the taxonomy assignments. Second, the bootstrap values may change slightly. This is acceptable as the bootstrapping is a randomized process. The default number of iterations is 100.

probs

Sometimes you may find the output of bootstrap values with your taxonomy to be tedious. To get around this you can use the probs option to have the probabilities excluded from the output:

mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, cutoff=80, probs=F)
AY457915	Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;Johnsonella_et_rel.;Eubacterium_eligens_et_rel.;Lachnospira_pectinoschiza;
AY457914	Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;Johnsonella_et_rel.;Eubacterium_eligens_et_rel.;Eubacterium_eligens;Eubacterium_eligens;
AY457913	Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;Roseoburia_et_rel.;Roseoburia_et_rel.;Eubacterium_ramulus_et_rel.;uncultured;
AY457912	Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;
AY457911	Bacteria;Firmicutes;Clostridiales;Ruminococcus_et_rel.;Anaerofilum-Faecalibacterium;Faecalibacterium;Faecalibacterium_prausnitzii;

method=knn

The k-Nearest Neighbor algorithm involves identifying the k-most similar sequences in a database that are similar to your sequence. By default, mothur will find the 10 most similar sequences in the database. Once mothur has identified the k-most similar sequences, she will use the taxonomy information for each sequence to determine the consensus taxonomy. mothur gives you the ability to determine the method that is used to find the closest matches, the value of k This classification method can be implemented by the following.

mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, method=knn)
Reading in the silva.slv.taxonomy taxonomy...	DONE.
Generating search database...    DONE.
It took 6 seconds generate search database. 
Classifying sequence 100
Classifying sequence 200

It took 2 secs to classify 242 sequences.

This will output the following in the abrecovery.taxonomy file:

AY457915	Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;
AY457914	Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;
AY457913	Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;Roseoburia_et_rel.;Roseoburia_et_rel.;
...

numwanted

If you instead only want the value of k to be 3, the following command would be used:

mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, method=knn, numwanted=3)

Reading in the silva.slv.taxonomy taxonomy...	DONE.
Generating search database...    DONE.
It took 5 seconds generate search database. 
Classifying sequence 100
Classifying sequence 200

It took 2 secs to classify 242 sequences.

This would produce the following output in the abrecovery.taxonomy file:

AY457915	Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;
AY457914	Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;
AY457913	Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;Roseoburia_et_rel.;Roseoburia_et_rel.;
...


If you are using the phylotype command as a down stream analysis, you probably only want to consider 1 nearest neighbor:

mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, method=knn, numwanted=1)


This would produce the following in the abrecovery.taxonomy file:

AY457915	Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;Johnsonella_et_rel.;Eubacterium_eligens_et_rel.;Lachnospira_pectinoschiza;
AY457914	Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;Johnsonella_et_rel.;Eubacterium_eligens_et_rel.;Eubacterium_eligens;Eubacterium_eligens;
AY457913	Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;Roseoburia_et_rel.;Roseoburia_et_rel.;Eubacterium_ramulus_et_rel.;uncultured;

As you should be able to see, these taxonomy lines are considerably longer and probably should not be as trustworthy as those when you are considering more neighbors.


search=kmer and ksize

By default, the k-nearest neighbor approach searches for nearest neighbors by kmer searching as is done in the align.seqs command. The default size of kmers is 8, which seems to be a fairly decent choice regardless of which part of the 16S rRNA gene you are interested in. As we pointed out in the development of the align.seqs command, kmer searching is superior in accuracy and speed compared to blast or suffix tree searching methods.

mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, method=knn)

or

mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, method=knn, search=kmer)

If you would like to change the kmer size you use the ksize option:

mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, method=knn, search=kmer, ksize=6)


search=blast, match, mismatch, gapopen, and gapextend

Assuming that you have put the blast binaries into your PATH variable, it is possible to use blastn to find nearest neighbors. It can be implemented as:

mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, method=knn, search=blast)

You can also change the various blast-related options for match, mismatch, gapopen, and gapextend values:

mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, method=knn, search=blast, gapopen=-5, gapextend=-1)


search=suffix

An alternative method for finding the k-nearest neighbors is to use a suffix tree to perform the search. Again, this is the same method that is available within the align.seqs command. It can be implemented as:

mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, method=knn, search=suffix)


processors

The processors parameter allows you to run the command with multiple processors. By default processors is 1, and use of multiple processors is not available for windows users.

mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, processors=2)
 

Reading in the silva.slv.taxonomy taxonomy... DONE. Generating search database... DONE. It took 5 seconds generate search database. Reading template probabilities... DONE. It took 3 seconds get probabilities. Classifying sequence 100 Classifying sequence 100

It took 42 secs to classify 242 sequences.

Note that with one processor it took about 79 sec to do the actual classification.

name

The name parameter allows you add a names file with your fasta file, if you enter multiple fasta files, you must enter matching names files for them.