We will be offering mothur and R workshops throughout 2019. Learn more.

Rarefaction

From mothur
Revision as of 14:38, 14 January 2009 by Westcott (Talk | contribs)

Jump to: navigation, search

Validate output by making calculations by hand

Example Calculations

*.collect and *.rarefaction

These are the collector's curve and rarefaction curve data for the number of observed OTUs as a function of distance between sequences and the number of sequences sampled. This is merely a count of the number of OTUs observed at any given point in the sampling process.

By theory, the rarefaction curve should match the following expression:


<math>S_n = S_t - \left ( \frac{\sum_{i=1}^{S_t}{N - N_i \choose n} }Template:N \choose n \right ) </math>

(*.collect and *.rarefaction)

where,

<math>S_n</math> = Average number of OTUs observed after drawing n individuals.

<math>S_t</math> = Total number of OTUs in sample of N total individuals.

Below is a comparison of the MOTHUR output and the theoretical for a distance of 0.03 from the Amazonian dataset where the total number of sequences was 98, and there were 84 total OTUs with 75 singletons, 6 doubletons, 1 tripleton, and 2 quadrupletons. The MOTHUR output was obtained using 10,000 random iterations and shows that the absolute error between the two approaches is very small. As the number of iterations decreases or increases, the percent error increases or decreases, respectively.

n 	Theory  DOTUR Diff. %Error  	n 	Theory  DOTUR   Diff. %Error
1 	1.000   1.000 0.000  0.000  	50 	45.620  45.634 -0.014 	0.031
2 	1.996   1.995 0.001  0.049  	51 	46.461  46.477 -0.016 	0.035
3 	2.987   2.986 0.001  0.024  	52 	47.299  47.317 -0.017 	0.037
4 	3.974   3.971 0.003  0.069  	53 	48.136  48.154 -0.019 	0.038
5 	4.956   4.954 0.002  0.045  	54 	48.970  48.995 -0.025 	0.051
6 	5.935   5.934 0.001  0.010  	55 	49.802  49.822 -0.019 	0.038
7 	6.909   6.909 0.000  0.002  	56 	50.633  50.647 -0.014 	0.028
8 	7.880   7.881 -0.001 0.014  	57 	51.461  51.478 -0.016 	0.032
9 	8.846   8.847 -0.001 0.013  	58 	52.288  52.305 -0.018 	0.034
10 	9.808   9.809 -0.001 0.011  	59 	53.112  53.126 -0.014 	0.027
11 	10.767 10.768 -0.001 0.012  	60 	53.935  53.946 -0.011 	0.020
12 	11.721 11.721  0.000 0.001  	61 	54.755  54.767 -0.012 	0.022
13 	12.672 12.675 -0.003 0.028  	62 	55.574  55.586 -0.012 	0.021
14 	13.619 13.623 -0.004 0.031  	63 	56.391  56.401 -0.009 	0.017
15 	14.562 14.566 -0.004 0.027  	64 	57.206  57.213 -0.007 	0.012
16 	15.502 15.510 -0.008 0.051  	65 	58.020  58.026 -0.006 	0.010
17 	16.438 16.446 -0.008 0.046  	66 	58.832  58.840 -0.008 	0.014
18 	17.371 17.378 -0.007 0.040  	67 	59.642  59.647 -0.005 	0.009
19    18.300 18.305 -0.005 0.027      68      60.450  60.448  0.002   0.003
20 	19.225 19.233 -0.007 0.037  	69 	61.256  61.260 -0.004   0.006
21 	20.148 20.158 -0.010 0.049  	70 	62.061  62.057  0.004   0.007
22 	21.066 21.079 -0.013 0.061  	71 	62.865  62.861  0.004   0.006
23 	21.982 21.995 -0.012 0.057  	72 	63.666  63.667  0.000   0.000
24 	22.894 22.904 -0.010 0.042  	73 	64.467  64.469 -0.003   0.004
25 	23.804 23.814 -0.010 0.042  	74 	65.265  65.273 -0.008   0.013
26 	24.710 24.719 -0.010 0.039  	75 	66.062  66.070 -0.008   0.012
27 	25.613 25.623 -0.011 0.041  	76 	66.857  66.862 -0.005   0.007
28 	26.512 26.521 -0.008 0.032  	77 	67.651  67.654 -0.002   0.004
30 	28.303 28.313 -0.010 0.034  	79 	69.235  69.241 -0.007   0.010
31 	29.194 29.204 -0.010 0.033  	80 	70.024  70.035 -0.011   0.015
32 	30.082 30.093 -0.010 0.034  	81 	70.812  70.824 -0.011   0.016
33 	30.967 30.984 -0.016 0.052  	82 	71.599  71.612 -0.013   0.018
34 	31.850 31.863 -0.013 0.041  	83 	72.384  72.391 -0.007   0.010
35 	32.729 32.745 -0.015 0.047  	84 	73.168  73.175 -0.007   0.009
36 	33.606 33.618 -0.012 0.035  	85 	73.950  73.958 -0.008   0.010
37 	34.481 34.498 -0.017 0.050  	86 	74.731  74.746 -0.015   0.020
38 	35.352 35.371 -0.019 0.052  	87 	75.511  75.530 -0.019   0.026
39 	36.221 36.241 -0.020 0.055  	88 	76.289  76.302 -0.013   0.017
40 	37.088 37.109 -0.021 0.056  	89 	77.066  77.081 -0.015   0.020
41 	37.952 37.973 -0.022 0.057  	90 	77.842  77.856 -0.014   0.018
42 	38.813 38.835 -0.021 0.055  	91 	78.616  78.619 -0.003   0.004
43 	39.672 39.687 -0.014 0.036  	92 	79.389  79.391 -0.002   0.003
44 	40.529 40.542 -0.013 0.032  	93 	80.161  80.161  0.000   0.000
45 	41.383 41.391 -0.008 0.020  	94 	80.931  80.930  0.002   0.002
46 	42.235 42.245 -0.010 0.023  	95 	81.700  81.700  0.000   0.000
47 	43.085 43.093 -0.009 0.020  	96 	82.468  82.473 -0.005   0.006
48 	43.932 43.945 -0.013 0.029  	97 	83.235  83.237 -0.003   0.003
49 	44.777 44.792 -0.015 0.034  	98 	84.000  84.000  0.000   0.000

File Samples on the Amazonian Dataset

  • .collect

The first line contains the labels of all the columns. First numsequences which shows the frequency of the observed calculations. The frequency was set to the default of 100, so after each 100 selected the observed is calculated at each of the distances, with a calculation done after all are sampled. The following labels in the first line are the distances at which the calculations were made. Each additional line starts with the number of sequences sampled followed by the observed calculation at the column's distance. For instance, at distance 0.03, after 98 samples 84.00 were observed.

numsequences	unique	0.01	0.02	0.03	0.04	0.05	0.06	0.07	0.08	0.09	0.1
1	        1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
98	        96.00	93.00	89.00	84.00	81.00	73.00	68.00	66.00	59.00	57.00	55.00


  • .rarefaction

The first line contains the labels of all the columns. First numsampled which shows the frequency of the observed calculations. The frequency was set to 10, so after each 10 selected the observed is calculated at each of the distances, with a calculation done after all are sampled. The following labels in the first line are the distances at which the calculations were made, the lci (lower bound of confidence interval) and the hci (higher bound of confidence interval). Note: the entire file is not shown below. Each additional line starts with the number of sequences sampled followed by the observed calculation at the column's distance and the confidence intervals. For instance, at distance 0.01, after 80 samples over the default of 1000 iterations the average observed was 76.74, the lci was 74.65 and the hci was 78.84.

numsampled	0.01	lci	hci	0.02	lci	hci	0.03	lci	hci	
1		1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	
10		9.95	9.53	10.37	9.87	9.14	10.59	9.81	8.95	10.66	
20		19.79	18.90	20.68	19.47	18.07	20.86	19.22	17.52	20.92	
30		29.54	28.29	30.80	28.85	26.83	30.87	28.30	25.88	30.72	
40		39.20	37.64	40.77	38.01	35.55	40.47	37.10	34.23	39.97	
50		48.77	46.87	50.67	47.09	44.30	49.89	45.60	42.37	48.83	
60		58.23	56.14	60.31	56.03	53.05	59.01	53.91	50.52	57.30	
70		67.54	65.42	69.65	64.78	61.92	67.65	62.03	58.75	65.30	
80		76.74	74.65	78.84	73.49	71.06	75.91	69.96	67.00	72.92	
90		85.83	84.20	87.45	82.14	80.33	83.94	77.84	75.50	80.17	
98		93.00	93.00	93.00	89.00	89.00	89.00	84.00	84.00	84.00