We will be offering mothur and R workshops throughout 2019. Learn more.

Cluster

From mothur
Revision as of 17:51, 21 May 2009 by Westcott (Talk | contribs) (cluster())

Jump to: navigation, search

Once a distance matrix gets read into mothur, there are several algorithms available to cluster sequences. Presently, mothur implements three clustering methods:

  • Nearest neighbor: Each of the sequences within an OTU are at most X% distant from the most similar sequence in the OTU.
  • Furthest neighbor: All of the sequences within an OTU are at most X% distant from all of the other sequences within the OTU.
  • Average neighbor: This method is a middle ground between the other two algorithms.

If there is an algorithm that you would like to see implemented, please consider either contributing to the mothur project or contacting the developers and we'll see what we can do. The furthest neighbor algorithm is the default option. For this tutorial you should download the AmazonData.zip file and decompress it.

cluster()

In order for the cluster() command to work, a distance matrix needs to be stored in memory. To save memory, at the end of a cluster command the distance matrix is deleted, so in order to proceed with this tutorial you must run the read.dist command before each cluster command. If you haven't already, run the reading distance matrices tutorial and load the distance matrix into memory by executing the command:

mothur > read.dist(phylip=98_sq_phylip_amazon.dist)

By default cluster() executes the furthest neighbor clustering algorithm. For a detailed description of this and the other algorithms check out the example clustering calculations page. Next lets run the cluster() command:

mothur > cluster()

This command will generate the following output:

unique	2	94	2	
0.00	2	92	3	
0.01	2	88	5	
0.02	4	84	2	2	1	
0.03	4	75	6	1	2	
0.04	4	69	9	1	2	
0.05	4	55	13	3	2	
0.06	4	48	14	2	4	
0.07	4	44	16	2	4	
0.08	7	36	15	4	2	1	0	1	
0.09	7	36	12	4	3	0	0	2	
0.10	7	35	12	2	3	0	0	3
0.11	14	30	9	3	5	0	0	1	0	0	0	0	0	0	1	
...

Outputted to the screen is a label describing the distance cutoff used to form OTUs, the number of sequences in the largest OTU, the number of OTUs with only one sequence, with two, etc. Running the cluster() command generates three output files whose names end in sabund, rabund, and list. The data outputted to the screen is the same as that in the sabund file. You will notice that the sample rabund, sabund, and list files each have a ".fn." tag inserted after the name of the distance matrix. fn corresponds to the algorithm that was used to perform the clustering. In this case furthest neighbor (fn) was used. Other possibilities include "an" for average neighbor and "nn" for nearest neighbor.


The method option

By default cluster() uses the furthest neighbor algorithm; this can be changed with the method option. By running the following command you will get the same output as just running cluster():

mothur > cluster(method=furthest)

To obtain a nearest neighbor clustering of the data use the method option to produce the subsequent output:

mothur > cluster(method=nearest)
unique	2	94	2	
0.00	2	92	3	
0.01	4	86	4	0	1	
0.02	4	83	2	1	2	
0.03	4	75	6	1	2	
0.04	4	68	8	2	2	
0.05	5	53	13	2	2	1	
0.06	13	47	12	2	2	0	0	0	0	0	0	0	0	1	
0.07	16	41	10	2	2	0	0	1	0	0	0	0	0	0	0	0	1	
...

To obtain an average neighbor clustering of the data again use the method option to produce the subsequent output:

mothur > cluster(method=average)
unique	2	94	2	 
0.00	2	92	3	
0.01	3	87	4	1	
0.02	4	83	2	1	2	
0.03	4	75	6	1	2	
0.04	4	69	9	1	2	
0.05	4	55	13	3	2	
0.06	4	48	14	2	4	
0.07	7	42	15	2	2	1	0	1	
...


The cutoff option

Similar to reading in the distance matrix, you can set a cutoff value for performing the clustering operation. This will provide a similar boost in speed if you didn't set the cutoff for reading in the matrix. If you already set the cutoff value when reading in the matrix, then don't worry about it for clustering unless you want an even smaller distance. The cutoff can be set for the cluster command as follows:

mothur > cluster(cutoff=0.05) 
unique	2	94	2	
0.00	2	92	3	
0.01	2	88	5	
0.02	4	84	2	2	1	
0.03	4	75	6	1	2	
0.04	4	69	9	1	2	
0.05	4	55	13	3	2	


The precision option

Perhaps the most commonly asked question is why the cluster command produces data for both the "unique" and "0.00" lines. Aren't they the same? No. The "unique" line represents data for the situation where all of the sequences in an OTU are identical; the "0.00" line represents data for the situation where all of the sequences in an OTU have pairwise distances less than 0.0049. We made the decision that because there is error in everything, we should round these distances as well and not apply a hard cutoff at 0.01, 0.02, etc. If you want greater precision, there is a precision option in the read.dist() and cluster() commands:

mothur > cluster(cutoff=0.02, precision=1000)
unique	2	94	2	
0.003	2	92	3	 
0.006	2	90	4	
0.008	2	88	5	
0.017	3	87	4	1	
0.018	3	86	3	2	
0.020	4	84	2	2	1

Remember that the 16S rRNA gene is roughly 1,500 bp long. So it would seem silly to have a precision greater than 1,000. Just because you can calculate a number to 20 digits, doesn't mean they're all significant.

Missing distances

Perhaps the second most commonly asked question is why there isn't a line for distance 0.XX. If you notice the previous example the distances jump from 0.003 to 0.006. Where are 0.004 and 0.005? mothur only outputs data if the clustering has been updated for a distance. So if you don't have data at your favorite distance, that means that nothing changed between the previous distance and the next one. Therefore if you want OTU data for a distance of 0.005 in this case, you would use the data from 0.003.


Variability

You may notice that if you run the same command multiple times for the same dataset you might get slightly different out for some distances:

mothur > cluster()
unique	2	94	2	
0.00	2	92	3	
0.01	2	88	5	
0.02	4	84	2	2	1	
0.03	4	75	6	1	2	
0.04	4	69	9	1	2	
0.05	4	55	13	3	2	
0.06	4	48	14	2	4	
0.07	4	44	16	2	4	
0.08	7	35	17	3	2	1	0	1	
...
mothur > cluster()
unique	2	94	2	
0.00	2	92	3	
0.01	2	88	5	
0.02	4	84	2	2	1	
0.03	4	75	6	1	2	
0.04	4	69	9	1	2	
0.05	4	55	13	3	2	
0.06	4	48	14	2	4	
0.07	4	44	16	2	4	
0.08	7	36	15	4	2	1	0	1	
...

At a distance of 0.08 these two executions diverge from one another. This is because there was a tie. A sequence could have joined more than one pre-existing OTU. mothur is programmed to randomly select the OTU that it should join. Because of this, it is possible to get differences between runs. This is just a byproduct of using an algorithm-based approach to clustering.