We will be offering an R workshop December 18-20, 2019. Learn more.

Difference between revisions of "Cluster"

From mothur
Jump to: navigation, search
Line 6: Line 6:
 
* [[AGC]]:
 
* [[AGC]]:
 
* [[DGC]]:
 
* [[DGC]]:
 +
* [[Opti]]: OTUs are assembled using metrics to determine the quality of clustering.
  
If there is an algorithm that you would like to see implemented, please consider either contributing to the mothur project or contacting the developers and we'll see what we can do.  The average neighbor algorithm is the default option.  For this tutorial you should download the [[Media:AmazonData.zip|AmazonData.zip]] file and decompress it.
+
If there is an algorithm that you would like to see implemented, please consider either contributing to the mothur project or contacting the developers and we'll see what we can do.  The average neighbor algorithm is the default option.  For this tutorial you should download the [[Media:Final.zip|Final.zip]] file and decompress it.
  
 
__TOC__
 
__TOC__
Line 17: Line 18:
 
To read in a [[phylip-formatted distance matrix]] you need to use the phylip option:
 
To read in a [[phylip-formatted distance matrix]] you need to use the phylip option:
  
  mothur > cluster(phylip=98_sq_phylip_amazon.dist)
+
  mothur > cluster(phylip=final.phylip.dist)
 
+
or
+
 
+
mothur > cluster(phylip=98_lt_phylip_amazon.dist)
+
  
 
Whereas dotur required you to indicate whether the matrix was square or lower-triangular, mothur is able to figure this out for you.
 
Whereas dotur required you to indicate whether the matrix was square or lower-triangular, mothur is able to figure this out for you.
Line 27: Line 24:
 
Once you execute the command, mothur reads in the matrix and generates a progress bar:
 
Once you execute the command, mothur reads in the matrix and generates a progress bar:
  
  mothur > cluster(phylip=98_lt_phylip_amazon.dist)
+
  mothur > cluster(phylip=final.phylip.dist)
 
  *******************#****#****#****#****#****#****#****#****#****#****#
 
  *******************#****#****#****#****#****#****#****#****#****#****#
 
  Reading matrix:    |||||||||||||||||||||||||||||||||||||||||||||||||||
 
  Reading matrix:    |||||||||||||||||||||||||||||||||||||||||||||||||||
Line 33: Line 30:
  
  
===column & name===
+
===column & name or count ===
To read in a [[column-formatted distance matrix]] you must provide a filename for the name option.  The .names file was generated during the [[unique.seqs]] command.
+
To read in a [[column-formatted distance matrix]] you must provide a filename for the name or count option.  The .names file was generated during the [[unique.seqs]] command.
  
  mothur > cluster(column=96_sq_column_amazon.dist, name=amazon.names)
+
  mothur > cluster(column=final.dist, name=final.names)
 
or
 
or
  mothur > cluster(column=96_lt_column_amazon.dist, name=amazon.names)
+
  mothur > cluster(column=final.dist, count=final.count_table)
  
 
Again, the column-formatted distance matrix can be square or lower-triangle and mothur will figure it out.
 
Again, the column-formatted distance matrix can be square or lower-triangle and mothur will figure it out.
Line 47: Line 44:
 
There are several reasons to be interested in providing a name file with your distance matrix.  First, as sequencing collections increase in size, the number of duplicate sequences is increasing.  This is especially the case with sequences generated via pyrosequencing.  Sogin and colleagues [http://www.pnas.org/content/103/32/12115.full] found that less than 50% of their sequences were unique.  Because the alignments and distances for the duplicate sequences are the same, re-processing each duplicate sequence takes a considerable amount of computing time and memory.   
 
There are several reasons to be interested in providing a name file with your distance matrix.  First, as sequencing collections increase in size, the number of duplicate sequences is increasing.  This is especially the case with sequences generated via pyrosequencing.  Sogin and colleagues [http://www.pnas.org/content/103/32/12115.full] found that less than 50% of their sequences were unique.  Because the alignments and distances for the duplicate sequences are the same, re-processing each duplicate sequence takes a considerable amount of computing time and memory.   
  
Example from amazon.names:
+
Example from final.names:
 
  ...
 
  ...
  U68616 U68616
+
  GQY1XT001EYE6M GQY1XT001EYE6M,GQY1XT001D69D7,GQY1XT001A1LWJ
  U68617 U68617
+
  GQY1XT001EXZXC GQY1XT001EXZXC
  U68618 U68618,U68620
+
  GQY1XT001EXZLY GQY1XT001EXZLY
  U68619 U68619
+
GQY1XT001EXOOM GQY1XT001EXOOM
  U68621 U68621
+
GQY1XT001EX24Z GQY1XT001EX24Z,GQY1XT001AMCGM
 +
  GQY1XT001EWUBU GQY1XT001EWUBU,GQY1XT001DJLCH,GQY1XT001B50B7
 +
  GQY1XT001EWJBM GQY1XT001EWJBM
 
  ...
 
  ...
  
Line 66: Line 65:
 
  ...
 
  ...
  
A names file is not required (unless you are using the column= option), but depending on the data set to be analyzed, could significantly accelerate the processing time of downstream calculations.  Although this is a simple example, the 98 sequence amazon data set has two pairs of duplicate sequences (U68618 and U68620) and (U68667 and U68641).  The distance matrix in the file 96_lt_phylip_amazon.dist is a lower triangle matrix for the 96 unique sequences.  While you could just read the matrix in and analyze the set of 96 unqiue sequences, this would give a considerably different analysis than if you used the entire 98 sequence data set.  Considering the frequency of sequences is critical for pretty much every analysis in mothur, we want to use the name file to artificially inflate the matrix to its full size.  In this case we use the namefile option:
+
A count or name file is not required (unless you are using the column= option), but depending on the data set to be analyzed, could significantly accelerate the processing time of downstream calculations.  In this simple example, the final dataset contains 51474 sequences.  The distance matrix in the file final.phylip.dist is a lower triangle matrix for the 3772 unique sequences.  While you could just read the matrix in and analyze the set of 3772 unqiue sequences, this would give a considerably different analysis than if you used the entire 51474 sequence data set.  Considering the frequency of sequences is critical for pretty much every analysis in mothur, we want to use the name or count file to artificially inflate the matrix to its full size.  In this case we use the namefile option:
  
  mothur > cluster(phylip=96_lt_phylip_amazon.dist, name=amazon.names)
+
  mothur > cluster(phylip=final.phylip.dist, name=final.names)
  
 
mothur remembers that the distances for the reference sequence also apply to all of the sequences listed in the second column.  Using a name file can considerably accelerate the amount of processing time required to analyze some data sets.
 
mothur remembers that the distances for the reference sequence also apply to all of the sequences listed in the second column.  Using a name file can considerably accelerate the amount of processing time required to analyze some data sets.
  
By default cluster() executes the average neighbor clustering algorithm.  For a detailed description of this and the other algorithms check out the [[example clustering calculations]] page.  Next lets run the cluster() command:
+
By default cluster() executes the opticlust clustering algorithm.  For a detailed description of this and the other algorithms check out the [[example clustering calculations]] page.  Next lets run the cluster() command:
  
  mothur > cluster(phylip=98_sq_phylip_amazon.dist)
+
  mothur > cluster(phylip=final.phylip.dist, name=final.names)
  
 
This command will generate the following output:
 
This command will generate the following output:
  
  unique 2 94 2  
+
  Clustering /Users/sarahwestcott/desktop/release/final.phylip.dist
  0.00 2 92 3
+
   
  0.01 3 87 4 1
+
   
  0.02 4 83 2 1 2
+
  iter time label num_otus cutoff tp tn fp fn sensitivity specificity ppv npv fdr accuracy mcc f1score
0.03 4 75 6 1 2
+
  0 0 0.03 3772 0.03 0 7059666 0 52440 0 1 0 0.992627 0 0.992627 0 0
  0.04 4 69 9 1 2
+
  1 0 0.03 1261 0.03 27541 7053368 6298 24899 0.525191 0.999108 0.813883 0.996482 0.186117 0.995614 0.651823 0.638417
  0.05 4 55 13 3 2
+
  2 0 0.03 1184 0.03 30143 7052434 7232 22297 0.574809 0.998976 0.806502 0.996848 0.193498 0.995848 0.678933 0.671224
  0.06 4 48 14 2 4
+
  3 0 0.03 1176 0.03 30254 7052676 6990 22186 0.576926 0.99901 0.812319 0.996864 0.187681 0.995898 0.682669 0.67468
  0.07 7 42 15 2 2 1 0 1
+
  4 0 0.03 1175 0.03 30283 7052713 6953 22157 0.577479 0.999015 0.813272 0.996868 0.186728 0.995907 0.683404 0.675387
  ...
+
  5 0 0.03 1176 0.03 30256 7052761 6905 22184 0.576964 0.999022 0.814187 0.996864 0.185813 0.99591 0.683487 0.67535
  
Outputted to the screen is a label describing the distance cutoff used to form OTUs, the number of sequences in the largest OTU, the number of OTUs with only one sequence, with two, etc.  Running the cluster() command generates three output files whose names end in [[sabund file|sabund]], [[rabund file|rabund]], and [[list file|list]].  The data outputted to the screen is the same as that in the sabund file.  You will notice that the sample rabund, sabund, and list files each have a ".an." tag inserted after the name of the distance matrix.  an corresponds to the algorithm that was used to perform the clustering.  In this case average neighbor (an) was used.  Other possibilities include "fn" for furthest neighbor and "nn" for nearest neighbor.
+
Running the cluster() command generates three output files whose names end in [[sabund file|sabund]], [[rabund file|rabund]], and [[list file|list]].  The data outputted to the screen is the same as that in the sabund file.  You will notice that the sample rabund, sabund, and list files each have a ".opti." tag inserted after the name of the distance matrix.  opti corresponds to the algorithm that was used to perform the clustering.  In this case opticlust (opti) was used.  Other possibilities include "an" for average neighbor, "fn" for furthest neighbor, "nn" for nearest neighbor.  Vsearch clustering algorithms include: "agc" and "dgc".
  
 
===count===
 
===count===
 
The [[Count_File | count]] file is similar to the name file in that it is used to represent the number of duplicate sequences for a given representative sequence.  Mothur will use this information to form the correct OTU's.  Unlike, when you use a names file the list file generated will contain only the unique names, so be sure to include the count file in downstream analysis with the list file.
 
The [[Count_File | count]] file is similar to the name file in that it is used to represent the number of duplicate sequences for a given representative sequence.  Mothur will use this information to form the correct OTU's.  Unlike, when you use a names file the list file generated will contain only the unique names, so be sure to include the count file in downstream analysis with the list file.
  
  mothur > make.table(name=amazon.names)
+
  mothur > make.table(name=final.names)
 
   
 
   
  Example from amazon.count_table:
+
  Example from final.count_table:
   ...
+
    
   U68616 1
+
   Representative_Sequence total
  U68617 1
+
GQY1XT001CFHYQ 467
  U68618 2
+
GQY1XT001C44N8 3677
  U68619 1
+
GQY1XT001C296C 4652
  U68621 1
+
GQY1XT001ARCB1 2202
 +
GQY1XT001CFWVZ 1967
 +
GQY1XT001DHF2X 2137
 +
GQY1XT001AEGCJ 2140
 +
GQY1XT001CPCVN 2837
 
   ...
 
   ...
 
   
 
   
  mothur > cluster(phylip=96_lt_phylip_amazon.dist, count=amazon.count_table)
+
  mothur > cluster(phylip=final.phylip.dist, count=amazon.count_table)
  
 
==Options==
 
==Options==
 
===method===
 
===method===
By default cluster() uses the average neighbor algorithm; this can be changed with the method option.  By running the following command you will get the same output as just running cluster():
+
The methods available in mothur include opticlust (opti), average neighbor (average), furthest neighbor (furthest), nearest neighbor (nearest), Vsearch agc (agc), Vsearch dgc (dgc).  By default cluster() uses the opticlust algorithm; this can be changed with the method option.   
  
  mothur > cluster(phylip=98_sq_phylip_amazon.dist, method=average)
+
  mothur > cluster(column=final.dist, count=final.count_table, method=opti)
  
To obtain a nearest neighbor clustering of the data use the method option to produce the subsequent output:
+
To obtain a average neighbor clustering of the data use the method option to produce the subsequent output:
  
  mothur > cluster(phylip=98_sq_phylip_amazon.dist, method=nearest)
+
  mothur > cluster(column=final.dist, count=final.count_table, method=average, cutoff=0.15)
unique 2 94 2
+
0.00 2 92 3
+
0.01 4 86 4 0 1
+
0.02 4 83 2 1 2
+
0.03 4 75 6 1 2
+
0.04 4 68 8 2 2
+
0.05 5 53 13 2 2 1
+
0.06 13 47 12 2 2 0 0 0 0 0 0 0 0 1
+
0.07 16 41 10 2 2 0 0 1 0 0 0 0 0 0 0 0 1
+
...
+
  
To obtain an furthest neighbor clustering of the data again use the method option to produce the subsequent output:
+
  unique 4652 2678 442 124 69 59 35 32 20 17 16 18 12 9 7 6 6 3 2 9 9 7 9 6 5 6 3 21
 
+
  0.01 4964 1896 316 115 65 40 24 20 19 13 10 12 9 11 4 9 5 7 4 2 3 7 7 7 4 6 8 51
mothur > cluster(phylip=98_sq_phylip_amazon.dist, method=furthest)
+
  0.02 5451 986 184 84 49 28 17 14 17 7 10 7 9 3 6 2 6 4 2 1 3 2 1 3 3 2 3 01
 
+
  0.03 6129 624 144 49 39 17 14 11 8 10 7 5 8 6 5 1 4 5 0 2 1 0 3 2 1 0 1 11
  unique 2 94 2
+
  0.04 6159 411 108 44 22 11 9 6 7 7 8 5 7 3 4 1 4 3 2 1 1 1 2 1 2 0 1 11
  0.01 2 88 5
+
  0.05 7023 258 92 36 23 11 2 5 2 6 8 4 6 2 5 1 3 3 1 1 0 1 2 1 0 0 1 01
  0.02 4 85 3 1 1
+
changed cutoff to 0.0533794
  0.03 4 83 2 1 2
+
  0.04 4 73 7 1 2
+
  0.05 4 64 10 2 2
+
0.06 4 50 13 2 4
+
0.07 7 45 14 2 3 0 0 1
+
  
 
===cutoff===
 
===cutoff===
Similar to reading in the distance matrix, you can set a cutoff value for performing the clustering operation.  This will provide a similar boost in speed if you didn't set the cutoff for reading in the matrix.  If you already set the cutoff value when reading in the matrix, then don't worry about it for clustering unless you want an even smaller distance.  The cutoff can be set for the cluster command as follows:
+
With the opticlust method the list file is created for the cutoff you set.  The default cutoff is 0.03.  With the average neighbor, furthest neighbor and nearest neighbor methods the cutoff should be significantly higher than the desired distance in the list file.  We suggest cutoff=0.20.  This will provide a boost in speed and less RAM will be required than if you didn't set the cutoff for reading in the matrix.  The cutoff can be set for the cluster command as follows:
  
  mothur > cluster(phylip=98_sq_phylip_amazon.dist, cutoff=0.05)  
+
  mothur > cluster(column=final.dist, count=final.count_table, cutoff=0.05)  
  unique 2 94 2
+
   
  0.00 2 92 3
+
  iter time label num_otus cutoff tp tn fp fn sensitivity specificity ppv npv fdr accuracy mcc f1score
  0.01 2 88 5
+
  0 0 0.05 3772 0.05 0 6918676 0 193430 0 1 0 0.972803 0 0.972803 0 0
  0.02 4 84 2 2 1
+
  1 0 0.05 699 0.05 126630 6892041 26635 66800 0.654655 0.99615 0.826216 0.990401 0.173784 0.986863 0.729012 0.730498
  0.03 4 75 6 1 2
+
2 1 0.05 625 0.05 133759 6887952 30724 59671 0.691511 0.995559 0.813209 0.991411 0.186791 0.98729 0.743526 0.747439
  0.04 4 69 9 1 2
+
  3 0 0.05 620 0.05 134033 6887845 30831 59397 0.692928 0.995544 0.812991 0.99145 0.187009 0.987313 0.744201 0.748173
  0.05 4 55 13 3 2
+
  4 0 0.05 621 0.05 133952 6888036 30640 59478 0.692509 0.995571 0.813843 0.991439 0.186157 0.987329 0.744378 0.748289
 +
  5 0 0.05 621 0.05 133952 6888036 30640 59478 0.692509 0.995571 0.813843 0.991439 0.186157 0.987329 0.744378 0.748289
  
 
===precision===
 
===precision===
Perhaps the most commonly asked question is why the cluster command produces data for both the "unique" and "0.00" lines.  Aren't they the same?  No.  The "unique" line represents data for the situation where all of the sequences in an OTU are identical; the "0.00" line represents data for the situation where all of the sequences in an OTU have pairwise distances less than 0.0049.  We made the decision that because there is error in everything, we should round these distances as well and not apply a hard cutoff at 0.01, 0.02, etc.  If you want greater precision, there is a precision option in the read.dist() and cluster() commands:
+
If you want greater precision, there is a precision option in the cluster() command:
  
  mothur > cluster(phylip=98_sq_phylip_amazon.dist, cutoff=0.02, precision=1000)
+
  mothur > cluster(column=final.dist, count=final.count_table, method=average,  precision=1000, cutoff=0.10)
  unique 2 94 2
+
  unique 4652 2678 442 124 69 59 35 32 20 17 16 18 12 9 7 6 6 3 2 9 9 7 9 6 5 6 3 21
  0.003 2 92 3  
+
  0.004 4964 2300 411 112 72 38 31 23 21 10 16 15 12 14 12 4 3 8 4 5 6 7 8 4 7 5 5 31
  0.006 2 90 4
+
  0.005 4964 2276 396 103 70 40 29 23 19 10 15 16 13 12 9 6 5 6 4 7 6 6 8 4 6 6 6 41
  0.008 2 88 5
+
  0.006 4964 2259 371 110 71 36 28 25 21 8 15 16 12 11 5 7 6 6 5 7 4 6 9 4 4 5 7 51
  0.017 3 87 4 1
+
0.007 4964 2253 361 104 65 38 27 25 21 8 15 15 10 11 5 7 7 7 5 6 4 6 9 6 4 6 7 51
  0.018 3 86 3 2
+
  0.008 4964 1970 361 121 69 39 23 24 20 12 14 12 13 11 7 9 6 7 3 5 4 8 6 5 5 8 5 71
  0.020 4 84 2 2 1
+
0.009 4964 1937 353 111 68 36 27 22 20 11 13 13 9 10 4 10 7 8 4 4 3 7 4 6 4 5 8 71
 +
0.010 4964 1901 313 114 68 39 25 20 20 13 10 12 8 11 2 8 6 7 4 2 3 7 7 7 6 5 8 51
 +
  0.011 4983 1603 296 127 64 33 27 20 17 14 9 11 10 7 3 6 8 7 5 4 3 6 7 7 3 4 6 61
 +
0.012 5011 1500 282 118 57 27 27 19 14 18 7 9 7 6 5 5 8 4 4 5 3 4 4 6 5 3 3 61
 +
0.013 5371 1453 258 113 55 25 26 16 13 14 7 9 6 6 7 5 7 4 3 4 3 3 4 5 5 4 4 51
 +
0.014 5380 1419 247 100 55 19 22 19 12 12 8 8 8 6 5 5 6 4 3 1 6 3 2 6 3 4 5 61
 +
0.015 5447 1249 236 111 54 22 21 17 13 13 8 6 7 5 7 3 8 2 3 1 5 3 3 4 4 3 5 31
 +
0.016 5450 1207 223 100 54 22 19 16 14 11 9 7 8 4 6 5 7 2 2 2 4 2 2 3 4 3 4 31
 +
0.017 5450 1173 200 97 56 25 21 14 15 10 9 7 8 4 6 5 7 1 1 1 4 2 2 3 3 3 4 31
 +
  0.018 5450 1154 192 89 52 26 19 14 15 11 9 7 8 3 4 5 7 2 1 1 3 3 2 3 3 3 3 21
 +
0.019 5450 1020 207 85 53 29 19 15 15 13 8 6 7 2 6 5 5 2 2 1 3 3 1 4 3 2 3 11
 +
  0.020 5451 987 186 80 49 29 16 16 16 9 9 7 8 3 7 2 6 4 2 1 3 2 1 3 3 2 3 01
 +
0.021 5451 968 172 70 44 29 16 19 14 9 9 8 9 3 6 2 6 5 2 1 2 2 1 3 3 1 2 11
 +
0.022 6114 884 185 69 39 29 13 18 14 9 8 7 9 4 5 2 7 5 2 2 1 1 2 3 3 1 2 11
 +
0.023 6114 846 180 66 39 25 14 17 12 8 10 6 9 4 6 2 6 5 2 2 1 1 2 2 2 2 2 11
 +
0.024 6114 817 166 69 41 23 11 16 13 10 10 5 8 4 5 0 8 6 2 2 2 1 2 2 2 1 3 11
 +
0.025 6115 802 162 62 40 23 11 14 12 9 12 6 7 4 5 0 8 6 1 1 1 1 3 2 1 1 2 01
 +
0.026 6115 744 168 62 37 20 13 11 11 10 11 7 7 5 3 1 5 6 0 2 2 1 3 2 1 1 2 01
 +
0.027 6115 725 163 57 38 18 15 10 11 10 11 7 7 4 4 1 4 6 0 2 2 1 3 1 1 0 3 01
 +
0.028 6115 705 146 52 40 17 15 12 12 9 8 7 9 3 5 1 4 5 0 2 2 1 2 2 1 0 2 01
 +
0.029 6118 671 147 51 39 16 15 13 10 7 7 8 9 5 5 1 4 5 0 2 2 1 2 2 1 0 2 01
 +
0.030 6129 622 148 50 39 17 14 12 9 8 6 7 9 6 5 1 4 5 0 2 1 0 3 2 1 0 1 11
 +
0.031 6129 603 133 54 37 17 15 10 9 9 7 5 7 8 6 1 4 5 0 2 1 0 3 2 1 0 1 11
 +
0.032 6130 583 127 46 38 15 16 7 9 8 7 6 6 6 6 2 4 5 0 2 0 1 3 2 1 0 0 11
 +
0.033 6136 533 130 50 37 14 14 6 8 7 6 7 7 5 5 2 4 4 0 2 0 1 3 2 1 0 0 11
 +
0.034 6153 525 125 47 35 15 13 6 8 8 6 7 7 4 3 2 4 4 0 2 1 1 3 2 1 0 0 11
 +
changed cutoff to 0.0342406
  
 
Remember that the 16S rRNA gene is roughly 1,500 bp long.  So it would seem silly to have a precision greater than 1,000.  Just because you can calculate a number to 20 digits, doesn't mean they're all significant.
 
Remember that the 16S rRNA gene is roughly 1,500 bp long.  So it would seem silly to have a precision greater than 1,000.  Just because you can calculate a number to 20 digits, doesn't mean they're all significant.
Line 168: Line 183:
 
===sim===
 
===sim===
 
The sim parameter is used to indicate that your input file contains similarity values instead of distance values. The default is false, if sim=true then mothur will convert the similarity values to distances.
 
The sim parameter is used to indicate that your input file contains similarity values instead of distance values. The default is false, if sim=true then mothur will convert the similarity values to distances.
 
mothur > cluster(column=96_lt_column_11_amazon.dist, name=amazon.names, sim=t)
 
  
 
==Clustering with vsearch==
 
==Clustering with vsearch==
The vsearch program is written by [https://github.com/torognes/vsearch the vsearch team].  You can now use vsearch clustering methods through mothur.   NOTE: vsearch is not available for Windows.
+
The vsearch program is written by [https://github.com/torognes/vsearch the vsearch team].  You can now use vsearch clustering methods through mothur.  
  
 
===fasta===
 
===fasta===
 
Vsearch requires a fasta file to cluster.
 
Vsearch requires a fasta file to cluster.
  
  mothur > cluster(fasta=amazon.fasta, method=agc)
+
  mothur > cluster(fasta=final.fasta, count=final.count_table, method=agc)
  
===name===
+
===Vsearch methods===
The name parameter allows you to enter the name file associated with your fasta file.
+
 
+
mothur > cluster(fasta=amazon.fasta, name=amazon.names, method=agc)
+
 
+
===count===
+
The count parameter allows you to enter the count file associated with your fasta file.
+
 
+
mothur > cluster(fasta=amazon.fasta, count=amazon.count_table, method=agc)
+
 
+
===method===
+
 
The available clustering methods are agc and dgc.  
 
The available clustering methods are agc and dgc.  
  
  mothur > cluster(fasta=amazon.fasta, method=dgc)
+
  mothur > cluster(fasta=final.fasta, count=final.count_table, method=dgc)
  
 
==Finer points==
 
==Finer points==
Line 202: Line 205:
 
You may notice that if you run the same command multiple times for the same dataset you might get slightly different out for some distances:
 
You may notice that if you run the same command multiple times for the same dataset you might get slightly different out for some distances:
  
  mothur > cluster(phylip=98_sq_phylip_amazon.dist, method=furthest)
+
  mothur > cluster(column=final.dist, count=final.count_table)
  unique 2 94 2
+
 
  0.00 2 92 3
+
  iter time label num_otus cutoff tp tn fp fn sensitivity specificity ppv npv fdr accuracy mcc f1score
  0.01 2 88 5
+
  0 0 0.03 3772 0.03 0 7059666 0 52440 0 1 0 0.992627 0 0.992627 0 0
  0.02 4 84 2 2 1
+
  1 0 0.03 1249 0.03 27469 7053405 6261 24971 0.523818 0.999113 0.814379 0.996472 0.185621 0.995609 0.651167 0.637554
  0.03 4 75 6 1 2
+
  2 0 0.03 1178 0.03 30311 7052546 7120 22129 0.578013 0.998991 0.809783 0.996872 0.190217 0.995887 0.682234 0.674545
  0.04 4 69 9 1 2
+
  3 0 0.03 1174 0.03 30877 7052133 7533 21563 0.588806 0.998933 0.803879 0.996952 0.196121 0.995909 0.686061 0.679736
  0.05 4 55 13 3 2
+
  4 0 0.03 1172 0.03 31138 7051940 7726 21302 0.593783 0.998906 0.801204 0.996988 0.198796 0.995919 0.687808 0.682073
  0.06 4 48 14 2 4
+
  5 0 0.03 1173 0.03 31237 7051921 7745 21203 0.595671 0.998903 0.801319 0.997002 0.198681 0.99593 0.688956 0.683358
  0.07 4 44 16 2 4
+
  6 0 0.03 1173 0.03 31268 7051953 7713 21172 0.596262 0.998907 0.802134 0.997007 0.197866 0.995939 0.689655 0.684044
  0.08 7 35 17 3 2 1 0 1
+
  7 0 0.03 1172 0.03 31336 7051924 7742 21104 0.597559 0.998903 0.801883 0.997016 0.198117 0.995944 0.6903 0.684805
  ...
+
  8 0 0.03 1172 0.03 31394 7051880 7786 21046 0.598665 0.998897 0.801276 0.997024 0.198724 0.995946 0.690677 0.685309
 +
  9 0 0.03 1173 0.03 31367 7051922 7744 21073 0.59815 0.998903 0.801999 0.997021 0.198001 0.995948 0.690694 0.685236
  
mothur > cluster(phylip=98_sq_phylip_amazon.dist, method=furthest)
 
unique 2 94 2
 
0.00 2 92 3
 
0.01 2 88 5
 
0.02 4 84 2 2 1
 
0.03 4 75 6 1 2
 
0.04 4 69 9 1 2
 
0.05 4 55 13 3 2
 
0.06 4 48 14 2 4
 
0.07 4 44 16 2 4
 
0.08 7 36 15 4 2 1 0 1
 
...
 
  
At a distance of 0.08 these two executions diverge from one anotherThis is because there was a tieA sequence could have joined more than one pre-existing OTUmothur is programmed to randomly select the OTU that it should joinBecause of this, it is possible to get differences between runsThis is just a byproduct of using an algorithm-based approach to clustering.
+
mothur > cluster(column=final.dist, count=final.count_table)
 +
 
 +
iter time label num_otus cutoff tp tn fp fn sensitivity specificity ppv npv fdr accuracy mcc f1score
 +
0 0 0.03 3772 0.03 0 7059666 0 52440 0 1 0 0.992627 0 0.992627 0 0
 +
  1 0 0.03 1250 0.03 29483 7051666 8000 22957 0.562223 0.998867 0.78657 0.996755 0.21343 0.995647 0.66296 0.655739
 +
  2 0 0.03 1165 0.03 31938 7050687 8979 20502 0.609039 0.998728 0.780556 0.997101 0.219444 0.995855 0.687484 0.684212
 +
  3 0 0.03 1167 0.03 32087 7050748 8918 20353 0.61188 0.998737 0.782514 0.997122 0.217486 0.995884 0.68997 0.686757
 +
  4 0 0.03 1169 0.03 31986 7051007 8659 20454 0.609954 0.998773 0.78696 0.997108 0.21304 0.995907 0.690857 0.687243
 +
  5 0 0.03 1168 0.03 31948 7051085 8581 20492 0.60923 0.998785 0.788275 0.997102 0.211725 0.995912 0.691029 0.687283
 +
6 0 0.03 1170 0.03 31966 7051075 8591 20474 0.609573 0.998783 0.788175 0.997105 0.211825 0.995913 0.69118 0.687463
 +
7 0 0.03 1170 0.03 31932 7051142 8524 20508 0.608924 0.998793 0.789302 0.9971 0.210698 0.995918 0.69131 0.687478
 +
8 0 0.03 1170 0.03 31905 7051185 8481 20535 0.60841 0.998799 0.790001 0.997096 0.209999 0.99592 0.691326 0.687415
 +
 
 +
The variability is caused by the randomization of the sequences.
  
 
==Revisions==
 
==Revisions==
Line 238: Line 243:
 
* 1.38.0 - Fixes bug with age method.
 
* 1.38.0 - Fixes bug with age method.
 
* 1.38.1 - Removes hard parameter.
 
* 1.38.1 - Removes hard parameter.
 +
* 1.39.0 - Adds opticlust method.
 +
* 1.39.1 - Corrects printing issues with opticlust method.
  
 
[[Category:Commands]]
 
[[Category:Commands]]

Revision as of 19:52, 3 February 2017

Once a distance matrix gets read into mothur, the cluster command can be used to assign sequences to OTUs. Presently, mothur implements three clustering methods:

  • Nearest neighbor: Each of the sequences within an OTU are at most X% distant from the most similar sequence in the OTU.
  • Furthest neighbor: All of the sequences within an OTU are at most X% distant from all of the other sequences within the OTU.
  • Average neighbor: This method is a middle ground between the other two algorithms.
  • AGC:
  • DGC:
  • Opti: OTUs are assembled using metrics to determine the quality of clustering.

If there is an algorithm that you would like to see implemented, please consider either contributing to the mothur project or contacting the developers and we'll see what we can do. The average neighbor algorithm is the default option. For this tutorial you should download the Final.zip file and decompress it.

Default settings

Either a phylip-formatted distance matrix or a column-formatted distance matrix must be inputted for cluster to be successful, the default output of the dist.seqs command is the column-format. If you have a favorite format, please let us know and we can work with you to incorporate that feature into mothur. Because the phylip format is so popular most software can generate this format for you.

phylip

To read in a phylip-formatted distance matrix you need to use the phylip option:

mothur > cluster(phylip=final.phylip.dist)

Whereas dotur required you to indicate whether the matrix was square or lower-triangular, mothur is able to figure this out for you.

Once you execute the command, mothur reads in the matrix and generates a progress bar:

mothur > cluster(phylip=final.phylip.dist)
*******************#****#****#****#****#****#****#****#****#****#****#
Reading matrix:    |||||||||||||||||||||||||||||||||||||||||||||||||||
**********************************************************************


column & name or count

To read in a column-formatted distance matrix you must provide a filename for the name or count option. The .names file was generated during the unique.seqs command.

mothur > cluster(column=final.dist, name=final.names)

or

mothur > cluster(column=final.dist, count=final.count_table)

Again, the column-formatted distance matrix can be square or lower-triangle and mothur will figure it out.

name

A names file contains two columns. The first column contains the name of a reference sequence that is in a distance matrix and the second column contains the names of the sequences (separated by commas) that the reference sequence represents. The list of names in the second column should always contain at least the reference sequence name.

There are several reasons to be interested in providing a name file with your distance matrix. First, as sequencing collections increase in size, the number of duplicate sequences is increasing. This is especially the case with sequences generated via pyrosequencing. Sogin and colleagues [1] found that less than 50% of their sequences were unique. Because the alignments and distances for the duplicate sequences are the same, re-processing each duplicate sequence takes a considerable amount of computing time and memory.

Example from final.names:

...
GQY1XT001EYE6M	GQY1XT001EYE6M,GQY1XT001D69D7,GQY1XT001A1LWJ
GQY1XT001EXZXC	GQY1XT001EXZXC
GQY1XT001EXZLY	GQY1XT001EXZLY
GQY1XT001EXOOM	GQY1XT001EXOOM
GQY1XT001EX24Z	GQY1XT001EX24Z,GQY1XT001AMCGM
GQY1XT001EWUBU	GQY1XT001EWUBU,GQY1XT001DJLCH,GQY1XT001B50B7
GQY1XT001EWJBM	GQY1XT001EWJBM
...

Second, if you pre-screen a clone library using ARDRA then you may only have a sequence for a handful of clones, but you know the number of times that you have seen a sequence like it. In such a case the second column of the names file would contain the sequence name as well as dummy sequence names

...
AA1234	AA1234,AA1234.1,AA1234.2
AA1235	AA1235
AA1236	AA1236,AA1236.1
AA1237	AA1237,AA1237.1,AA1237.2,AA1237.3
AA1238	AA1238,AA1238.1
...

A count or name file is not required (unless you are using the column= option), but depending on the data set to be analyzed, could significantly accelerate the processing time of downstream calculations. In this simple example, the final dataset contains 51474 sequences. The distance matrix in the file final.phylip.dist is a lower triangle matrix for the 3772 unique sequences. While you could just read the matrix in and analyze the set of 3772 unqiue sequences, this would give a considerably different analysis than if you used the entire 51474 sequence data set. Considering the frequency of sequences is critical for pretty much every analysis in mothur, we want to use the name or count file to artificially inflate the matrix to its full size. In this case we use the namefile option:

mothur > cluster(phylip=final.phylip.dist, name=final.names)

mothur remembers that the distances for the reference sequence also apply to all of the sequences listed in the second column. Using a name file can considerably accelerate the amount of processing time required to analyze some data sets.

By default cluster() executes the opticlust clustering algorithm. For a detailed description of this and the other algorithms check out the example clustering calculations page. Next lets run the cluster() command:

mothur > cluster(phylip=final.phylip.dist, name=final.names)

This command will generate the following output:

Clustering /Users/sarahwestcott/desktop/release/final.phylip.dist


iter	time	label	num_otus	cutoff	tp	tn	fp	fn	sensitivity	specificity	ppv	npv	fdr	accuracy	mcc	f1score
0	0	0.03	3772	0.03	0	7059666	0	52440	0	1	0	0.992627	0	0.992627	0	0	
1	0	0.03	1261	0.03	27541	7053368	6298	24899	0.525191	0.999108	0.813883	0.996482	0.186117	0.995614	0.651823	0.638417	
2	0	0.03	1184	0.03	30143	7052434	7232	22297	0.574809	0.998976	0.806502	0.996848	0.193498	0.995848	0.678933	0.671224	
3	0	0.03	1176	0.03	30254	7052676	6990	22186	0.576926	0.99901	0.812319	0.996864	0.187681	0.995898	0.682669	0.67468	
4	0	0.03	1175	0.03	30283	7052713	6953	22157	0.577479	0.999015	0.813272	0.996868	0.186728	0.995907	0.683404	0.675387	
5	0	0.03	1176	0.03	30256	7052761	6905	22184	0.576964	0.999022	0.814187	0.996864	0.185813	0.99591	0.683487	0.67535	

Running the cluster() command generates three output files whose names end in sabund, rabund, and list. The data outputted to the screen is the same as that in the sabund file. You will notice that the sample rabund, sabund, and list files each have a ".opti." tag inserted after the name of the distance matrix. opti corresponds to the algorithm that was used to perform the clustering. In this case opticlust (opti) was used. Other possibilities include "an" for average neighbor, "fn" for furthest neighbor, "nn" for nearest neighbor. Vsearch clustering algorithms include: "agc" and "dgc".

count

The count file is similar to the name file in that it is used to represent the number of duplicate sequences for a given representative sequence. Mothur will use this information to form the correct OTU's. Unlike, when you use a names file the list file generated will contain only the unique names, so be sure to include the count file in downstream analysis with the list file.

mothur > make.table(name=final.names)

Example from final.count_table:
 
 Representative_Sequence	total
GQY1XT001CFHYQ	467
GQY1XT001C44N8	3677
GQY1XT001C296C	4652
GQY1XT001ARCB1	2202
GQY1XT001CFWVZ	1967
GQY1XT001DHF2X	2137
GQY1XT001AEGCJ	2140
GQY1XT001CPCVN	2837
 ...

mothur > cluster(phylip=final.phylip.dist, count=amazon.count_table)

Options

method

The methods available in mothur include opticlust (opti), average neighbor (average), furthest neighbor (furthest), nearest neighbor (nearest), Vsearch agc (agc), Vsearch dgc (dgc). By default cluster() uses the opticlust algorithm; this can be changed with the method option.

mothur > cluster(column=final.dist, count=final.count_table, method=opti)

To obtain a average neighbor clustering of the data use the method option to produce the subsequent output:

mothur > cluster(column=final.dist, count=final.count_table, method=average, cutoff=0.15)
unique	4652	2678	442	124	69	59	35	32	20	17	16	18	12	9	7	6	6	3	2	9	9	7	9	6	5	6	3	21
0.01	4964	1896	316	115	65	40	24	20	19	13	10	12	9	11	4	9	5	7	4	2	3	7	7	7	4	6	8	51
0.02	5451	986	184	84	49	28	17	14	17	7	10	7	9	3	6	2	6	4	2	1	3	2	1	3	3	2	3	01
0.03	6129	624	144	49	39	17	14	11	8	10	7	5	8	6	5	1	4	5	0	2	1	0	3	2	1	0	1	11
0.04	6159	411	108	44	22	11	9	6	7	7	8	5	7	3	4	1	4	3	2	1	1	1	2	1	2	0	1	11
0.05	7023	258	92	36	23	11	2	5	2	6	8	4	6	2	5	1	3	3	1	1	0	1	2	1	0	0	1	01
changed cutoff to 0.0533794

cutoff

With the opticlust method the list file is created for the cutoff you set. The default cutoff is 0.03. With the average neighbor, furthest neighbor and nearest neighbor methods the cutoff should be significantly higher than the desired distance in the list file. We suggest cutoff=0.20. This will provide a boost in speed and less RAM will be required than if you didn't set the cutoff for reading in the matrix. The cutoff can be set for the cluster command as follows:

mothur > cluster(column=final.dist, count=final.count_table, cutoff=0.05) 

iter	time	label	num_otus	cutoff	tp	tn	fp	fn	sensitivity	specificity	ppv	npv	fdr	accuracy	mcc	f1score
0	0	0.05	3772	0.05	0	6918676	0	193430	0	1	0	0.972803	0	0.972803	0	0	
1	0	0.05	699	0.05	126630	6892041	26635	66800	0.654655	0.99615	0.826216	0.990401	0.173784	0.986863	0.729012	0.730498	
2	1	0.05	625	0.05	133759	6887952	30724	59671	0.691511	0.995559	0.813209	0.991411	0.186791	0.98729	0.743526	0.747439	
3	0	0.05	620	0.05	134033	6887845	30831	59397	0.692928	0.995544	0.812991	0.99145	0.187009	0.987313	0.744201	0.748173	
4	0	0.05	621	0.05	133952	6888036	30640	59478	0.692509	0.995571	0.813843	0.991439	0.186157	0.987329	0.744378	0.748289	
5	0	0.05	621	0.05	133952	6888036	30640	59478	0.692509	0.995571	0.813843	0.991439	0.186157	0.987329	0.744378	0.748289	

precision

If you want greater precision, there is a precision option in the cluster() command:

mothur > cluster(column=final.dist, count=final.count_table, method=average,  precision=1000, cutoff=0.10)
unique	4652	2678	442	124	69	59	35	32	20	17	16	18	12	9	7	6	6	3	2	9	9	7	9	6	5	6	3	21
0.004	4964	2300	411	112	72	38	31	23	21	10	16	15	12	14	12	4	3	8	4	5	6	7	8	4	7	5	5	31
0.005	4964	2276	396	103	70	40	29	23	19	10	15	16	13	12	9	6	5	6	4	7	6	6	8	4	6	6	6	41
0.006	4964	2259	371	110	71	36	28	25	21	8	15	16	12	11	5	7	6	6	5	7	4	6	9	4	4	5	7	51
0.007	4964	2253	361	104	65	38	27	25	21	8	15	15	10	11	5	7	7	7	5	6	4	6	9	6	4	6	7	51
0.008	4964	1970	361	121	69	39	23	24	20	12	14	12	13	11	7	9	6	7	3	5	4	8	6	5	5	8	5	71
0.009	4964	1937	353	111	68	36	27	22	20	11	13	13	9	10	4	10	7	8	4	4	3	7	4	6	4	5	8	71
0.010	4964	1901	313	114	68	39	25	20	20	13	10	12	8	11	2	8	6	7	4	2	3	7	7	7	6	5	8	51
0.011	4983	1603	296	127	64	33	27	20	17	14	9	11	10	7	3	6	8	7	5	4	3	6	7	7	3	4	6	61
0.012	5011	1500	282	118	57	27	27	19	14	18	7	9	7	6	5	5	8	4	4	5	3	4	4	6	5	3	3	61
0.013	5371	1453	258	113	55	25	26	16	13	14	7	9	6	6	7	5	7	4	3	4	3	3	4	5	5	4	4	51
0.014	5380	1419	247	100	55	19	22	19	12	12	8	8	8	6	5	5	6	4	3	1	6	3	2	6	3	4	5	61
0.015	5447	1249	236	111	54	22	21	17	13	13	8	6	7	5	7	3	8	2	3	1	5	3	3	4	4	3	5	31
0.016	5450	1207	223	100	54	22	19	16	14	11	9	7	8	4	6	5	7	2	2	2	4	2	2	3	4	3	4	31
0.017	5450	1173	200	97	56	25	21	14	15	10	9	7	8	4	6	5	7	1	1	1	4	2	2	3	3	3	4	31
0.018	5450	1154	192	89	52	26	19	14	15	11	9	7	8	3	4	5	7	2	1	1	3	3	2	3	3	3	3	21
0.019	5450	1020	207	85	53	29	19	15	15	13	8	6	7	2	6	5	5	2	2	1	3	3	1	4	3	2	3	11
0.020	5451	987	186	80	49	29	16	16	16	9	9	7	8	3	7	2	6	4	2	1	3	2	1	3	3	2	3	01
0.021	5451	968	172	70	44	29	16	19	14	9	9	8	9	3	6	2	6	5	2	1	2	2	1	3	3	1	2	11
0.022	6114	884	185	69	39	29	13	18	14	9	8	7	9	4	5	2	7	5	2	2	1	1	2	3	3	1	2	11
0.023	6114	846	180	66	39	25	14	17	12	8	10	6	9	4	6	2	6	5	2	2	1	1	2	2	2	2	2	11
0.024	6114	817	166	69	41	23	11	16	13	10	10	5	8	4	5	0	8	6	2	2	2	1	2	2	2	1	3	11
0.025	6115	802	162	62	40	23	11	14	12	9	12	6	7	4	5	0	8	6	1	1	1	1	3	2	1	1	2	01
0.026	6115	744	168	62	37	20	13	11	11	10	11	7	7	5	3	1	5	6	0	2	2	1	3	2	1	1	2	01
0.027	6115	725	163	57	38	18	15	10	11	10	11	7	7	4	4	1	4	6	0	2	2	1	3	1	1	0	3	01
0.028	6115	705	146	52	40	17	15	12	12	9	8	7	9	3	5	1	4	5	0	2	2	1	2	2	1	0	2	01
0.029	6118	671	147	51	39	16	15	13	10	7	7	8	9	5	5	1	4	5	0	2	2	1	2	2	1	0	2	01
0.030	6129	622	148	50	39	17	14	12	9	8	6	7	9	6	5	1	4	5	0	2	1	0	3	2	1	0	1	11
0.031	6129	603	133	54	37	17	15	10	9	9	7	5	7	8	6	1	4	5	0	2	1	0	3	2	1	0	1	11
0.032	6130	583	127	46	38	15	16	7	9	8	7	6	6	6	6	2	4	5	0	2	0	1	3	2	1	0	0	11
0.033	6136	533	130	50	37	14	14	6	8	7	6	7	7	5	5	2	4	4	0	2	0	1	3	2	1	0	0	11
0.034	6153	525	125	47	35	15	13	6	8	8	6	7	7	4	3	2	4	4	0	2	1	1	3	2	1	0	0	11
changed cutoff to 0.0342406

Remember that the 16S rRNA gene is roughly 1,500 bp long. So it would seem silly to have a precision greater than 1,000. Just because you can calculate a number to 20 digits, doesn't mean they're all significant.

sim

The sim parameter is used to indicate that your input file contains similarity values instead of distance values. The default is false, if sim=true then mothur will convert the similarity values to distances.

Clustering with vsearch

The vsearch program is written by the vsearch team. You can now use vsearch clustering methods through mothur.

fasta

Vsearch requires a fasta file to cluster.

mothur > cluster(fasta=final.fasta, count=final.count_table, method=agc)

Vsearch methods

The available clustering methods are agc and dgc.

mothur > cluster(fasta=final.fasta, count=final.count_table, method=dgc)

Finer points

Missing distances

Perhaps the second most commonly asked question is why there isn't a line for distance 0.XX. If you notice the previous example the distances jump from 0.003 to 0.006. Where are 0.004 and 0.005? mothur only outputs data if the clustering has been updated for a distance. So if you don't have data at your favorite distance, that means that nothing changed between the previous distance and the next one. Therefore if you want OTU data for a distance of 0.005 in this case, you would use the data from 0.003.


Variability

You may notice that if you run the same command multiple times for the same dataset you might get slightly different out for some distances:

mothur > cluster(column=final.dist, count=final.count_table)
 
iter	time	label	num_otus	cutoff	tp	tn	fp	fn	sensitivity	specificity	ppv	npv	fdr	accuracy	mcc	f1score
0	0	0.03	3772	0.03	0	7059666	0	52440	0	1	0	0.992627	0	0.992627	0	0	
1	0	0.03	1249	0.03	27469	7053405	6261	24971	0.523818	0.999113	0.814379	0.996472	0.185621	0.995609	0.651167	0.637554	
2	0	0.03	1178	0.03	30311	7052546	7120	22129	0.578013	0.998991	0.809783	0.996872	0.190217	0.995887	0.682234	0.674545	
3	0	0.03	1174	0.03	30877	7052133	7533	21563	0.588806	0.998933	0.803879	0.996952	0.196121	0.995909	0.686061	0.679736	
4	0	0.03	1172	0.03	31138	7051940	7726	21302	0.593783	0.998906	0.801204	0.996988	0.198796	0.995919	0.687808	0.682073	
5	0	0.03	1173	0.03	31237	7051921	7745	21203	0.595671	0.998903	0.801319	0.997002	0.198681	0.99593	0.688956	0.683358	
6	0	0.03	1173	0.03	31268	7051953	7713	21172	0.596262	0.998907	0.802134	0.997007	0.197866	0.995939	0.689655	0.684044	
7	0	0.03	1172	0.03	31336	7051924	7742	21104	0.597559	0.998903	0.801883	0.997016	0.198117	0.995944	0.6903	0.684805	
8	0	0.03	1172	0.03	31394	7051880	7786	21046	0.598665	0.998897	0.801276	0.997024	0.198724	0.995946	0.690677	0.685309	
9	0	0.03	1173	0.03	31367	7051922	7744	21073	0.59815	0.998903	0.801999	0.997021	0.198001	0.995948	0.690694	0.685236	


mothur > cluster(column=final.dist, count=final.count_table)
 
iter	time	label	num_otus	cutoff	tp	tn	fp	fn	sensitivity	specificity	ppv	npv	fdr	accuracy	mcc	f1score
0	0	0.03	3772	0.03	0	7059666	0	52440	0	1	0	0.992627	0	0.992627	0	0	
1	0	0.03	1250	0.03	29483	7051666	8000	22957	0.562223	0.998867	0.78657	0.996755	0.21343	0.995647	0.66296	0.655739	
2	0	0.03	1165	0.03	31938	7050687	8979	20502	0.609039	0.998728	0.780556	0.997101	0.219444	0.995855	0.687484	0.684212	
3	0	0.03	1167	0.03	32087	7050748	8918	20353	0.61188	0.998737	0.782514	0.997122	0.217486	0.995884	0.68997	0.686757	
4	0	0.03	1169	0.03	31986	7051007	8659	20454	0.609954	0.998773	0.78696	0.997108	0.21304	0.995907	0.690857	0.687243	
5	0	0.03	1168	0.03	31948	7051085	8581	20492	0.60923	0.998785	0.788275	0.997102	0.211725	0.995912	0.691029	0.687283	
6	0	0.03	1170	0.03	31966	7051075	8591	20474	0.609573	0.998783	0.788175	0.997105	0.211825	0.995913	0.69118	0.687463	
7	0	0.03	1170	0.03	31932	7051142	8524	20508	0.608924	0.998793	0.789302	0.9971	0.210698	0.995918	0.69131	0.687478	
8	0	0.03	1170	0.03	31905	7051185	8481	20535	0.60841	0.998799	0.790001	0.997096	0.209999	0.99592	0.691326	0.687415	

The variability is caused by the randomization of the sequences.

Revisions

  • 1.27.0 - reduced memory by 50% and increased speed by 55%.
  • 1.28.0 - added count parameter
  • 1.34.0 - Bug Fix: nearest method caused crash.
  • 1.35.0 - Clustering commands did not include the count file info. when printing list file OTU order. Only effects clustering commands. *.pick commands must preserve otuLabels order. - http://www.mothur.org/forum/viewtopic.php?f=3&t=3460&p=10483#p10483.
  • 1.37.0 - Adds vsearch clustering methods: agc and dgc. #169
  • 1.38.0 - Fixes bug with age method.
  • 1.38.1 - Removes hard parameter.
  • 1.39.0 - Adds opticlust method.
  • 1.39.1 - Corrects printing issues with opticlust method.