We will be hosting a mothur workshop in December. Learn more.

Pre.cluster

From mothur

The pre.cluster command implements a pseudo-single linkage algorithm with the goal of removing sequences that are likely due to pyrosequencing errors. A version of this algorithm was developed by Sue Huse and will be published in a forthcoming paper in Environmental Microbiology. The basic idea is that abundant sequences are more likely to generate erroneous sequences than rare sequences. With that in mind, the algorithm proceeds by ranking sequences in order of their abundance. Then we walk through the list of sequences looking for rarer sequences that are within some threshold of the original sequence. Those that are within the threshold are merged with the larger sequence. The original Huse method performs this task on a distance matrix, whereas we do it based on the original sequences. The advantage of our approach is that the algorithm works on aligned sequences instead of a distance matrix. This is advantageous because by pre-clustering you remove a large number of sequences making the distance calculation much faster.


Contents

Default settings

The pre.cluster command expects a fasta-formatted file and a names or count file and that the sequences are in the same order in both files. Both of these files can be generated by the unique.seqs command. For example, if you are following along with the Sogin data analysis example and have aligned, filtered, and unique'd your sequences, then enter the following to perform the pre.clustering command:

mothur > unique.seqs(fasta=sogin.unique.filter.fasta, name=sogin.names)
mothur > pre.cluster(fasta=sogin.unique.filter.unique.fasta, name=sogin.unique.filter.names)

or

mothur > unique.seqs(fasta=sogin.unique.filter.fasta, count=sogin.count_table)
mothur > pre.cluster(fasta=sogin.unique.filter.unique.fasta, count=sogin.unique.filter.count_table)


Will result in the following output:

0	21821	86
100	20286	1621
200	19824	2083
...
21700	16380	5527
21800	16377	5530
21900	16376	5531

Total number of sequences before precluster was 21907. pre.cluster removed 5531 sequences.


This output indicates, by column, the number of sequences processed, the number of sequences that will be found in the final dataset, and the number of sequences that have been clustered away. This should accelerate as the function runs. In this example, this step merged 5,531 sequences with other sequences, leaving you with a set of 16,376 sequences to work with. As an additional benefit to removing potentially erroneous sequences, the reduced dataset will run about 1.8 times faster through dist.seqs than the original and should cluster much faster as well. Two files are created - a *.precluster.fasta and a *.precluster.names file containing the new sequence and names file or *.precluster.count_table for further processing.

Options

group

If you provide a groupfile or your count file contains group information, mothur will pre.cluster sample by sample.

mothur > pre.cluster(fasta=stool.trim.unique.good.filter.unique.fasta, name=stool.trim.unique.good.filter.names, group=stool.good.groups)

or

mothur > pre.cluster(fasta=stool.trim.unique.good.filter.unique.fasta, count=stool.trim.unique.good.filter.count_table)


The screen output will look like:

Processing group F11Fcsw:
0	355	5
100	292	68
200	284	76
300	281	79
360	281	79
Total number of sequences before pre.cluster was 360.
pre.cluster removed 79 sequences.
...


diffs

By default the pre.cluster command will look for sequences that are within 1 mismatch of the sequence being considered. With the diffs option you can change this threshold. For example:

mothur > pre.cluster(fasta=sogin.unique.filter.unique.fasta, name=sogin.unique.filter.names, diffs=2)

0	21777	130
100	19165	2742
200	18419	3488
...
21700	13273	8634
21800	13270	8637
21900	13267	8640

Total number of sequences before precluster was 21907. pre.cluster removed 8640 sequences.

topdown

The topdown parameter allows you to specify whether to cluster from largest abundance to smallest or smallest to largest. Default=T, meaning largest to smallest.

fasta only

As shown above, pre.cluster expects you to provide a name file so that it can acquire the abundance information from each sequence. If you do not provide the name file the command will automatically run your data through unique.seqs to generate to get the information it needs.


Caveat emptor

Something to keep in mind is that when you set the number of mismatches to 2, you are allowing that the maximum difference between sequences within a cluster to be 4 (2 from the dominant sequence in one direction, and 2 in any other direction). This difference of 4 bases, could your ability to distinguish signal from noise. For example, with this Sogin datasets, the sequences are ~60 bp V6 pyrotags. A difference of 4 bases is 6.7%! Alternatively, when using diffs=1, the difference of 2 bases is 3.3%. The assumption of the algorithm is that these mismatches are noise; however, it doesn't make sense to then analyze your data at 3% by either level of diffs. Remember that this method does not actually remove the noise, it just clusters sequences that are likely to be noisy. To remove the noise you would need to use a program like Chris Quince's Pyronoise. Considering using PyroNoise may not be practical for many people, the pre.cluster option may be your best bet.

Revisions

  • 1.22.0 Added group option.
  • 1.23.0 Added processors option for by group processing.
  • 1.23.0 Added map file to output.
  • 1.28.0 Added count option
  • 1.30.0 Added topdown parameter
  • 1.33.0 Improved work balance load between processors.
Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox