Classify.seqs
From mothur
The classify.seqs command allows the user to use several different methods to assign their sequences to the taxonomy outline of their choice. Current methods include using a k-nearest neighbor consensus and Bayesian approach. Taxonomy outlines and reference sequences can be obtained from the taxonomy outline page. The command requires that you provide a fasta-formatted input and database sequence file and a taxonomy file for the reference sequences. To complete this tutorial, you are encouraged to obtain the AbRecovery dataset.
Contents |
Output
mothur will output two files from the classify.seqs command: a *.taxonomy file which contains a taxonomy string for each sequence and a *.tax.summary file which contains a taxonomic outline indicating the number of sequences that were found for your collection at each level. For example, abrecovery.taxonomy may look like:
AY457915 Bacteria(100);Firmicutes(99);Clostridiales(99);Johnsonella_et_rel.(99);Johnsonella_et_rel.(99);Johnsonella_et_rel.(91);Eubacterium_eligens_et_rel.(89);Lachnospira_pectinoschiza(80); AY457914 Bacteria(100);Firmicutes(100);Clostridiales(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(95);Eubacterium_eligens_et_rel.(92);Eubacterium_eligens(84);Eubacterium_eligens(81); AY457913 Bacteria(100);Firmicutes(100);Clostridiales(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(100);Roseoburia_et_rel.(97);Roseoburia_et_rel.(97);Eubacterium_ramulus_et_rel.(90);uncultured(90); AY457912 Bacteria(100);Firmicutes(99);Clostridiales(99);Johnsonella_et_rel.(99);Johnsonella_et_rel.(99); AY457911 Bacteria(100);Firmicutes(99);Clostridiales(98);Ruminococcus_et_rel.(96);Anaerofilum-Faecalibacterium(92);Faecalibacterium(92);Faecalibacterium_prausnitzii(90);
or
AY457915 Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;Johnsonella_et_rel.;Eubacterium_eligens_et_rel.;Lachnospira_pectinoschiza; AY457914 Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;Johnsonella_et_rel.;Eubacterium_eligens_et_rel.;Eubacterium_eligens;Eubacterium_eligens; AY457913 Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;Roseoburia_et_rel.;Roseoburia_et_rel.;Eubacterium_ramulus_et_rel.;uncultured; AY457912 Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.; AY457911 Bacteria;Firmicutes;Clostridiales;Ruminococcus_et_rel.;Anaerofilum-Faecalibacterium;Faecalibacterium;Faecalibacterium_prausnitzii;
This output indicates the sequence identifier in the first column as well as it's taxonomy. In the former case, bootstrap values are provided. In the latter case, they are either not calculated (method=knn) or suppressed (method=bayesian, probs=F).
The second output file, abrecovery.tax.summary may look something like the following:
taxlevel rankID taxon daughterlevels total 0 0 Root 2 242 1 0.1 Bacteria 50 242 2 0.1.2 Actinobacteria 38 13 3 0.1.2.3 Actinomycetaceae-Bifidobacteriaceae 10 13 4 0.1.2.3.7 Bifidobacteriaceae 6 13 5 0.1.2.3.7.2 Bifidobacterium_choerinum_et_rel. 8 13 6 0.1.2.3.7.2.1 Bifidobacterium_angulatum_et_rel. 1 11 7 0.1.2.3.7.2.1.1 unclassified 1 11 8 0.1.2.3.7.2.1.1.1 unclassified 1 11 9 0.1.2.3.7.2.1.1.1.1 unclassified 1 11 10 0.1.2.3.7.2.1.1.1.1.1 unclassified 1 11 11 0.1.2.3.7.2.1.1.1.1.1.1 unclassified 1 11 12 0.1.2.3.7.2.1.1.1.1.1.1.1 unclassified 1 11 6 0.1.2.3.7.2.5 Bifidobacterium_longum_et_rel. 1 2 ...
The first column indicates the taxonomic level in the outline. Obviously, the Root is the highest one can go. In this case the deepest any of the sequences go is to level 12. The second column indicates the "pedigree" for each lineage. The third column is the name of the lineage. Column four indicates the number of children lineages that the current lineage has. Finally, the last column indicates the number of sequences that were found in that lineage.
method=bayesian
When finding the taxonomy of a given query sequence in the fasta file, the bayesian method looks at the query sequence kmer by kmer. The method looks at all taxonomies represented in the template, and calculates the probability a sequence from a given taxonomy would contain a specific kmer. Then calculates the probability a query sequence would be in a given taxonomy based on the kmers it contains, and assign the query sequence to the taxonomy with the highest probability. This method also runs a bootstrapping algorithmn to find the confidence limit of the assignment by randomly choosing with replacement 1/8 of the kmers in the query and then finding the taxonomy. This is the method that is implemented by the RDP and is described by Want et al. This is the default method in classify.seqs.
mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.full.taxonomy)
or
mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.full.taxonomy, method=bayesian)
Reading template taxonomy... DONE. Reading template probabilities... DONE. It took 23 seconds get probabilities. Classifying sequences from abrecovery.fasta ... Classifying sequence 100 Classifying sequence 200 It took 19 secs to classify 242 sequences.
ksize
The only valid search option with the bayesian method is kmer and by default mothur uses kmer size 8. If you would like to change the kmer size you can use the ksize option:
mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, ksize=6)
iters
The iters option allows you to specify how many iterations to do when calculating the bootstrap confidence score for your taxonomy. The default is 100.
mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, iters=1000)
cutoff
By default, the classify.seqs will return a full taxonomy for every sequence, regardless of the bootstrap value for that taxonomic assignment. For example, running the default parameters will yield the following output:
AY457915 Bacteria(100);Firmicutes(100);Clostridiales(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(100);Eubacterium_eligens_et_rel.(100);Lachnospira_pectinoschiza(100);unclassified;unclassified;unclassified;unclassified;unclassified; AY457914 Bacteria(100);Firmicutes(100);Clostridiales(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(100);Eubacterium_eligens_et_rel.(100);Eubacterium_eligens(99);Eubacterium_eligens(99);unclassified;unclassified;unclassified;unclassified; AY457913 Bacteria(100);Firmicutes(100);Clostridiales(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(100);Roseoburia_et_rel.(100);Roseoburia_et_rel.(100);Eubacterium_ramulus_et_rel.(100);uncultured(100);unclassified;unclassified;unclassified;unclassified; AY457912 Bacteria(100);Firmicutes(100);Clostridiales(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(92);Eubacterium_eligens_et_rel.(92);Lachnospira_pectinoschiza(54);unclassified;unclassified;unclassified;unclassified;unclassified; AY457911 Bacteria(100);Firmicutes(100);Clostridiales(100);Ruminococcus_et_rel.(100);Anaerofilum-Faecalibacterium(100);Faecalibacterium(100);Faecalibacterium_prausnitzii(100);unclassified;unclassified;unclassified;unclassified;unclassified;unclassified; AY457910 Bacteria(100);Firmicutes(100);Clostridiales(100);Ruminococcus_et_rel.(100);Anaerofilum-Faecalibacterium(100);Faecalibacterium(100);Faecalibacterium_prausnitzii(100);unclassified;unclassified;unclassified;unclassified;unclassified;unclassified;
You will notice that sequence AY457912 has a bootstrap value of 54% for the assignment to the Lachnospira pectinoschiza. This isn't much of a vote of confidence for this assignment. The current thinking seems to be to use a minimum cutoff of 60%. Let's go crazy and set a value of 80%:
mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, cutoff=80)
AY457915 Bacteria(100);Firmicutes(100);Clostridiales(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(100);Eubacterium_eligens_et_rel.(100);Lachnospira_pectinoschiza(100);unclassified;unclassified;unclassified;unclassified;unclassified; AY457914 Bacteria(100);Firmicutes(100);Clostridiales(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(100);Eubacterium_eligens_et_rel.(100);Eubacterium_eligens(99);Eubacterium_eligens(99);unclassified;unclassified;unclassified;unclassified; AY457913 Bacteria(100);Firmicutes(100);Clostridiales(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(100);Roseoburia_et_rel.(100);Roseoburia_et_rel.(100);Eubacterium_ramulus_et_rel.(100);uncultured(100);unclassified;unclassified;unclassified;unclassified; AY457912 Bacteria(100);Firmicutes(100);Clostridiales(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(100);Johnsonella_et_rel.(92);Eubacterium_eligens_et_rel.(92);unclassified;unclassified;unclassified;unclassified;unclassified;unclassified;
You should notice two things. First, there are no bootstrap values below 80 for any of the taxonomy assignments. Second, the bootstrap values may change slightly. This is acceptable as the bootstrapping is a randomized process. The default number of iterations is 100.
probs
Sometimes you may find the output of bootstrap values with your taxonomy to be tedious. To get around this you can use the probs option to have the probabilities excluded from the output:
mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, cutoff=80, probs=F)
AY457915 Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;Johnsonella_et_rel.;Eubacterium_eligens_et_rel.;Lachnospira_pectinoschiza;unclassified;unclassified;unclassified;unclassified;unclassified; AY457914 Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;Johnsonella_et_rel.;Eubacterium_eligens_et_rel.;Eubacterium_eligens;Eubacterium_eligens;unclassified;unclassified;unclassified;unclassified; AY457913 Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;Roseoburia_et_rel.;Roseoburia_et_rel.;Eubacterium_ramulus_et_rel.;uncultured;unclassified;unclassified;unclassified;unclassified; AY457912 Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;Johnsonella_et_rel.;Eubacterium_eligens_et_rel.;unclassified;unclassified;unclassified;unclassified;unclassified;unclassified;
method=knn
The k-Nearest Neighbor algorithm involves identifying the k-most similar sequences in a database that are similar to your sequence. By default, mothur will find the 10 most similar sequences in the database. Once mothur has identified the k-most similar sequences, she will use the taxonomy information for each sequence to determine the consensus taxonomy. mothur gives you the ability to determine the method that is used to find the closest matches, the value of k This classification method can be implemented by the following.
mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, method=knn)
Reading in the silva.slv.taxonomy taxonomy... DONE. Generating search database... DONE. It took 6 seconds generate search database. Classifying sequence 100 Classifying sequence 200 It took 2 secs to classify 242 sequences.
This will output the following in the abrecovery.taxonomy file:
AY457915 Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.; AY457914 Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.; AY457913 Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;Roseoburia_et_rel.;Roseoburia_et_rel.; ...
numwanted
If you instead only want the value of k to be 3, the following command would be used:
mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, method=knn, numwanted=3) Reading in the silva.slv.taxonomy taxonomy... DONE. Generating search database... DONE. It took 5 seconds generate search database. Classifying sequence 100 Classifying sequence 200 It took 2 secs to classify 242 sequences.
This would produce the following output in the abrecovery.taxonomy file:
AY457915 Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.; AY457914 Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.; AY457913 Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;Roseoburia_et_rel.;Roseoburia_et_rel.; ...
If you are using the phylotype command as a down stream analysis, you probably only want to consider 1 nearest neighbor:
mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, method=knn, numwanted=1)
This would produce the following in the abrecovery.taxonomy file:
AY457915 Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;Johnsonella_et_rel.;Eubacterium_eligens_et_rel.;Lachnospira_pectinoschiza; AY457914 Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;Johnsonella_et_rel.;Eubacterium_eligens_et_rel.;Eubacterium_eligens;Eubacterium_eligens; AY457913 Bacteria;Firmicutes;Clostridiales;Johnsonella_et_rel.;Johnsonella_et_rel.;Roseoburia_et_rel.;Roseoburia_et_rel.;Eubacterium_ramulus_et_rel.;uncultured;
As you should be able to see, these taxonomy lines are considerably longer and probably should not be as trustworthy as those when you are considering more neighbors.
search=kmer and ksize
By default, the k-nearest neighbor approach searches for nearest neighbors by kmer searching as is done in the align.seqs command. The default size of kmers is 8, which seems to be a fairly decent choice regardless of which part of the 16S rRNA gene you are interested in. As we pointed out in the development of the align.seqs command, kmer searching is superior in accuracy and speed compared to blast or suffix tree searching methods.
mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, method=knn)
or
mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, method=knn, search=kmer)
If you would like to change the kmer size you use the ksize option:
mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, method=knn, search=kmer, ksize=6)
search=blast, match, mismatch, gapopen, and gapextend
Assuming that you have put the blast binaries into your PATH variable, it is possible to use blastn to find nearest neighbors. It can be implemented as:
mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, method=knn, search=blast)
You can also change the various blast-related options for match, mismatch, gapopen, and gapextend values:
mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, method=knn, search=blast, gapopen=-5, gapextend=-1)
search=suffix
An alternative method for finding the k-nearest neighbors is to use a suffix tree to perform the search. Again, this is the same method that is available within the align.seqs command. It can be implemented as:
mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, method=knn, search=suffix)
search=distance
An alternative method for finding the k-nearest neighbors is to find the distance from the query sequence to each sequence in the template.
mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.slv.taxonomy, method=knn, search=distance)
processors
The processors parameter allows you to run the command with multiple processors. By default processors is 1, and use of multiple processors is not available for windows users. If you are using the mpi-enabled version, processors is set to the number of processes you have running.
mothur > classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.full.taxonomy, processors=2) Reading template taxonomy... DONE. Reading template probabilities... DONE. It took 22 seconds get probabilities. Classifying sequences from abrecovery.fasta ... Classifying sequence 100 Classifying sequence 100
It took 11 secs to classify 242 sequences.
Note that with one processor it took about 19 sec to do the actual classification.
name
The name parameter allows you add a names file with your fasta file, if you enter multiple fasta files, you must enter matching names files for them.
group
The group parameter allows you add a group file to be used to generate group totals in the .summary file. For example, if you ran the command classify.seqs(fasta=abrecovery.fasta, template=silva.bacteria.fasta, taxonomy=silva.full.taxonomy, group=abrecovery.groups) the summary file would look like:
taxlevel rankID taxon daughterlevels total A B C 0 0 Root 2 242 84 84 74 1 0.1 Bacteria 50 242 84 84 74 2 0.1.2 Actinobacteria 38 13 0 13 0 3 0.1.2.3 Actinomycetaceae-Bifidobacteriaceae 10 13 0 13 0 4 0.1.2.3.7 Bifidobacteriaceae 6 13 0 13 0 5 0.1.2.3.7.2 Bifidobacterium_choerinum_et_rel. 8 13 0 13 0 6 0.1.2.3.7.2.1 Bifidobacterium_angulatum_et_rel. 1 11 0 11 0 7 0.1.2.3.7.2.1.1 unclassified 1 11 0 11 0 8 0.1.2.3.7.2.1.1.1 unclassified 1 11 0 11 0 9 0.1.2.3.7.2.1.1.1.1 unclassified 1 11 0 11 0 10 0.1.2.3.7.2.1.1.1.1.1 unclassified 1 11 0 11 0 11 0.1.2.3.7.2.1.1.1.1.1.1 unclassified 1 11 0 11 0 12 0.1.2.3.7.2.1.1.1.1.1.1.1 unclassified 1 11 0 11 0 6 0.1.2.3.7.2.5 Bifidobacterium_longum_et_rel. 1 2 0 2 0 7 0.1.2.3.7.2.5.1 unclassified 1 2 0 2 0 8 0.1.2.3.7.2.5.1.1 unclassified 1 2 0 2 0 9 0.1.2.3.7.2.5.1.1.1 unclassified 1 2 0 2 0 ...
so you can see that all the sequences classified to taxon Actinobacteria are from sample B.