Hcluster
From mothur
The hcluster command can be used to assign sequences to OTUs and outputs a .list, .rabund, .sabund and .sorted.dist files. It does not store the distance matrix in RAM like the cluster command, allowing large distance files to be processed. It is slower than cluster on small files, but becomes more competitive on large files. Presently, hcluster implements one clustering method:
- Furthest neighbor: All of the sequences within an OTU are at most X% distant from all of the other sequences within the OTU.
If there is an algorithm that you would like to see implemented, please consider either contributing to the mothur project or contacting the developers and we'll see what we can do. The furthest neighbor algorithm is the default option. For this tutorial you should download the AmazonData.zip file and decompress it.
Contents |
Default settings
In order for the hcluster() command to work, a distance matrix needs to be provide by using either the phylip or column and name parameters. By default hcluster() executes the furthest neighbor clustering algorithm. For a detailed description of this and the other algorithms check out the example clustering calculations page.
mothur > hcluster(phylip=98_sq_phylip_amazon.dist)
or
mothur > hcluster(column=96_lt_column_11_amazon.dist, name=amazon1.names)
This command will generate the following output:
mothur > hcluster(phylip=98_sq_phylip_amazon.dist) It took 1 seconds to sort. unique 2 94 2 0.00 2 92 3 0.01 2 88 5 0.02 4 84 2 2 1 0.03 4 75 6 1 2 0.04 4 69 9 1 2 0.05 4 55 13 3 2 0.06 4 48 14 2 4 0.07 4 44 16 2 4 ...
Outputted to the screen is a label describing the distance cutoff used to form OTUs, the number of sequences in the largest OTU, the number of OTUs with only one sequence, with two, etc. Running the hcluster() command generates four output files whose names end in sabund, rabund, list and .sorted.dist. The data outputted to the screen is the same as that in the sabund file. You will notice that the sample rabund, sabund, and list files each have a ".fn." tag inserted after the name of the distance matrix. fn corresponds to the algorithm that was used to perform the clustering. In this case furthest neighbor (fn) was used.
Options
method
By default hcluster() uses the furthest neighbor algorithm. We hope to add nearest neighbor and average neighbor soon.
cutoff
Similar to reading in the distance matrix, you can set a cutoff value for performing the clustering operation. This will provide a similar boost in speed. The cutoff can be set for the cluster command as follows:
mothur > hcluster(phylip=98_sq_phylip_amazon.dist, cutoff=0.05) It took 0 seconds to sort. unique 2 94 2 0.00 2 92 3 0.01 2 88 5 0.02 4 84 2 2 1 0.03 4 75 6 1 2 0.04 4 69 9 1 2 0.05 4 55 13 3 2 It took 0 seconds to cluster.
hard
By default the cutoff parameter is set to cutoff + (5 / (precision * 10.0)). So if you set the cutoff to 0.03 with precision of 100, the cutoff would be set to 0.035. If you want a hard cutoff of 0.03, then you need to set the hard parameter to true.
mothur > hcluster(cutoff=0.03, hard=t)
precision
Perhaps the most commonly asked question is why the hcluster command produces data for both the "unique" and "0.00" lines. Aren't they the same? No. The "unique" line represents data for the situation where all of the sequences in an OTU are identical; the "0.00" line represents data for the situation where all of the sequences in an OTU have pairwise distances less than 0.0049. We made the decision that because there is error in everything, we should round these distances as well and not apply a hard cutoff at 0.01, 0.02, etc. If you want greater precision, there is a precision option hcluster() command:
mothur > hcluster(phylip=98_sq_phylip_amazon.dist, cutoff=0.02, precision=1000) It took 0 seconds to sort. unique 2 94 2 0.003 2 92 3 0.006 2 90 4 0.008 2 88 5 0.010 2 88 5 0.012 2 88 5 0.017 3 87 4 1 0.018 3 86 3 2 0.020 4 84 2 2 1 It took 0 seconds to cluster.
Remember that the 16S rRNA gene is roughly 1,500 bp long. So it would seem silly to have a precision greater than 1,000. Just because you can calculate a number to 20 digits, doesn't mean they're all significant.
sorted
The hcluster command is able to read one distance at a time because it sorts the distances first. For large files this sort can take a long time. The sorted parameter allows you to indicate that your column distance file is already sorted. By default the sorted parameter is false, but you can change that by entering:
mothur > hcluster(column=96_lt_column_11_amazon.sorted.dist, name=amazon1.names, sorted=t)
showabund
The showabund parameter allows you to shut off the screen output of the hcluster command. You can do this by entering:
mothur > hcluster(column=96_lt_column_11_amazon.dist, name=amazon1.names, showabund=f) It took 0 seconds to sort. It took 0 seconds to cluster.
Finer points
Variability
You may notice that if you run the same command multiple times for the same dataset you might get slightly different out for some distances:
mothur > hcluster(phylip=98_lt_phylip_amazon.dist) It took 0 seconds to sort. unique 2 94 2 0.00 2 92 3 0.01 2 88 5 0.02 4 84 2 2 1 0.03 4 75 6 1 2 0.04 4 69 9 1 2 0.05 4 55 13 3 2 0.06 4 48 14 2 4 0.07 4 44 16 2 4 0.08 7 36 15 4 2 1 0 1 0.09 7 36 12 4 3 0 0 2 0.10 7 35 12 2 3 0 0 3 0.11 14 31 8 2 6 0 0 1 0 1 ...
mothur > hcluster(phylip=98_lt_phylip_amazon.dist) It took 0 seconds to sort. unique 2 94 2 0.00 2 92 3 0.01 2 88 5 0.02 4 84 2 2 1 0.03 4 75 6 1 2 0.04 4 69 9 1 2 0.05 4 55 13 3 2 0.06 4 48 14 2 4 0.07 4 44 16 2 4 0.08 7 35 17 3 2 1 0 1 0.09 7 35 14 3 3 0 0 2 0.10 7 34 13 3 2 0 0 3 0.11 14 30 9 3 5 0 0 1 0 1 ...
At a distance of 0.11 these two executions diverge from one another. This is because there was a tie. A sequence could have joined more than one pre-existing OTU. mothur is programmed to randomly select the OTU that it should join. Because of this, it is possible to get differences between runs. This is just a byproduct of using an algorithm-based approach to clustering.
This command is an implementation of the HCluster algorythmn described in
ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences by
Yijun Sun1,2,*, Yunpeng Cai2, Li Liu1, Fahong Yu1, Michael L. Farrell3, William McKendree3 and William Farmerie1 1
Interdisciplinary Center for Biotechnology Research, 2Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32610-3622 and 3Materials Technology Directorate, Air Force Technical Applications Center, 1030 S. Highway A1A, Patrick AFB, FL 32925-3002, USA Received January 28, 2009; Revised April 14, 2009; Accepted April 15, 2009