Costello stool analysis

From mothur

Jump to: navigation, search

Recently, Costello and colleagues published a paper in Science where they sampled 9 people at four time points at 27 locations on their body. They also did some licking experiments, but we won't go there. Unfortunately, they used a somewhat odd sampling strategy where they sampled on days one and two and then on consecutive days three months later. I say "unfortunately" because this could have been a great longitudinal analysis if they would have sampled with a more even distribution. So it goes, I suppose. This paper comes out of Robin Knight's lab at the U of Colorado and his lab has an obvious bias towards "phylogenetic" approaches. Fortunately, for us, they posted their data on the Short Read Archive. In this tutorial, I'll use mothur to analyze the stool samples from 6 of these individuals (3 women and 3 men) showing how I would analyze the data using the methods available in mothur. Because the original sff file is not available, I've created mock fasta and qual files complete with the forward primer and the barcode. Feel free to download these files and follow along...


Contents

Getting sense of the data and preprocessing

The fasta file with all 24 libraries is found in the stool.fasta dataset. Let's go ahead and fire up mothur and get a sense of what the sequences look like using the summary.seqs command:

mothur > summary.seqs(fasta=stool.fasta)

		Start	End	NBases	Ambigs	Polymer
Minimum:	1	183	183	0	3
2.5%-tile:	1	243	243	0	4
25%-tile:	1	259	259	0	5
Median: 	1	267	267	0	5
75%-tile:	1	274	274	0	5
97.5%-tile:	1	287	287	0	6
Maximum:	1	373	373	0	6
# of Seqs:	37126


There is also a file that contains the 24 barcodes that they used as well as the forward and reverse primer. Note that their approach is to amplify the V1-V2 region and sequence from the 3' end of the amplicon back towards the 27f primer. So we need to do a couple of things - i) remove the forward primer, ii) remove the barcodes, iii) remove the low quality bases, iv) create a groups file, and v) get the reverse complement of the sequences. Based on mock community experiments where known sequences are resequenced, we have found that 1 mismatch to the barcode and 2 to the primer do not adversely affect sequence quality. We have also found good correlation between the quality scores and sequencing errors between 30 and 35. There are several options for trimming sequences based on quality scores in the trim.seqs command, but the one we trust the most is to use a moving window that is 50 bp long and to require that the average quality score over the region not drop below 35. Once it does, we trim to the end of the last window with an average over 35. In mock community experiments we find that this drops the error rate by 10-fold over not trimming the sequences. In addition, we will remove any sequence where the longest homopolymer is greater than 8 nt and if it contains an ambiguous base call (i.e. an "N").

mothur > trim.seqs(fasta=stool.fasta, oligos=stool.oligos, qfile=stool.qual, maxambig=0, maxhomop=8, flip=T, bdiffs=1, pdiffs=2, qwindowaverage=35, qwindowsize=50)

This will generate several files - stool.trim.fasta, stool.scrap.fasta, and stool.groups. We are interested in the *trim* and *groups file for our downstream processing. Let's see what the stool.trim.fasta and stool.scrap.fasta files look like:

mothur > summary.seqs(fasta=stool.trim.fasta)

		Start	End	NBases	Ambigs	Polymer
Minimum:	1	7	7	0	2
2.5%-tile:	1	27	27	0	2
25%-tile:	1	187	187	0	5
Median: 	1	213	213	0	5
75%-tile:	1	220	220	0	5
97.5%-tile:	1	235	235	0	6
Maximum:	1	237	237	0	6
# of Seqs:	37124


There are some things to keep in mind with this analysis. Because this is a somewhat simulated dataset it has about 10-fold fewer sequences than you would normally find. Also, when you get the sequences off the sequencer, there are typically ambiguous bases and long homopolymer runs in your sequences and you will want to remove those sequence - Costello already did this for us.

Next, we want to simplify the dataset. It is likely that a number of the 37,124 sequences are redundant. To obtain a non-redundant set of sequences, run the unique.seqs command and generate stool.trim.unique.fasta and stool.trim.names:

mothur > unique.seqs(fasta=stool.trim.fasta)

mothur > summary.seqs(fasta=stool.trim.unique.fasta)

		Start	End	NBases	Ambigs	Polymer
Minimum:	1	7	7	0	2
2.5%-tile:	1	67	67	0	3
25%-tile:	1	197	197	0	5
Median: 	1	213	213	0	5
75%-tile:	1	220	220	0	5
97.5%-tile:	1	235	235	0	6
Maximum:	1	237	237	0	6
# of Seqs:	15247


We see that we have basically reduced the number of sequences that we have to align, calculate distances for, and cluster by more than 2-fold. This is a significant reduction.

Next, let's use the SILVA-compatible alignment database and we will align the sequences using the align.seqs command and I'll make use of the two processors on my laptop (Macs rock!). I prefer the silva reference alignment, but if your computer has less than 2 GB of RAM you should probably stick with the greengenes reference alignment and tell your PI to order you some more RAM.

mothur > align.seqs(candidate=stool.trim.unique.fasta, template=silva.bacteria.fasta, processors=2)

Reading in the silva.bacteria.fasta template sequences...	DONE.
Aligning sequences...
It took 120 secs to align 15247 sequences.

That's right, it took 2 minutes to align 15,247 16S rRNA pyrotags. The alignment is stored as a fasta-formatted file called stool.trim.unique.align and a report file that describes the alignment is given in a file called stool.trim.unique.align.report. If you wanted to check the quality of the alignment you could do so using the align.check command. For now, we'll skip that step and you can come back and do it later. Next, we want to see what the aligned sequences look like:

mothur > summary.seqs(fasta=stool.trim.unique.align)

		Start	End	NBases	Ambigs	Polymer
Minimum:	1215	3150	7	0	2
2.5%-tile:	1791	6333	67	0	3
25%-tile:	2049	6333	197	0	5
Median: 	2067	6333	213	0	5
75%-tile:	2524	6333	220	0	5
97.5%-tile:	5416	6334	235	0	6
Maximum:	43012	43026	237	0	6
# of Seqs:	15247

Based on this output, we want to maximize the number of sequences that overlap over the longest span. I'll require that all of the sequences end by position 6333 and be at least 150 bp in length. Using the screen.seqs command we will remove any sequence that does not fit these parameters from our fasta, name, and group files.

mothur > screen.seqs(fasta=stool.trim.unique.align, name=stool.trim.names, group=stool.groups, minlength=150, end=6333)

This produces several files - stool.trim.unique.good.align, stool.trim.unique.bad.align, stool.trim.good.names, stool.trim.bad.names, stool.good.groups, and stool.bad.groups. Running summary.seqs on the good and bad align files we get:

mothur > summary.seqs(fasta=stool.trim.unique.good.align)

		Start	End	NBases	Ambigs	Polymer
Minimum:	1215	6333	155	0	3
2.5%-tile:	1789	6333	177	0	4
25%-tile:	2042	6333	207	0	5
Median: 	2062	6333	214	0	5
75%-tile:	2082	6333	222	0	5
97.5%-tile:	3135	6334	235	0	6
Maximum:	4072	6334	237	0	6
# of Seqs:	13548
 

Because this data was preprocessed before being deposited in the short read archive, many of hte bad reads have already been removed. But, typically you will find that some sequences do not actually map to the correct region of the alignment.

Next, we need to make sure that we are comparing the sequences over the same alignment coordinates and so we will use the filter.seqs command to remove any columns that only contain a gap character and any column that has a '.' in any sequence:

mothur > filter.seqs(fasta=stool.trim.unique.good.align, vertical=T, trump=., processors=2)

Length of filtered alignment: 281
Number of columns removed: 49719
Length of the original alignment: 50000
Number of sequences used to construct filter: 13548


This will generate the filter file (stool.filter) and the filtered fasta-formatted file (stool.trim.unique.good.filter.fasta). You should be able to see that the alignment length shrunk from 50000 positions to 259. That's a substantial reduction. Now let's see how long the sequences are:

mothur > summary.seqs(fasta=stool.trim.unique.good.filter.fasta)

		Start	End	NBases	Ambigs	Polymer
Minimum:	1	280	123	0	3
2.5%-tile:	1	281	138	0	3
25%-tile:	1	281	140	0	4
Median: 	1	281	140	0	5
75%-tile:	1	281	141	0	5
97.5%-tile:	1	281	149	0	6
Maximum:	27	281	160	0	6
# of Seqs:	13548


I have found that trimming sequences so that they overlap over the same region generates new duplicate sequences that weren't detected the first go around. So let's re-run the unique.seqs command making sure to use the name option so that our previous names file is included:

mothur > unique.seqs(fasta=stool.trim.unique.good.filter.fasta, name=stool.trim.good.names)

mothur > summary.seqs(fasta=stool.trim.unique.good.filter.unique.fasta)

		Start	End	NBases	Ambigs	Polymer
Minimum:	1	280	123	0	3
2.5%-tile:	1	281	138	0	3
25%-tile:	1	281	140	0	4
Median: 	1	281	140	0	5
75%-tile:	1	281	141	0	5
97.5%-tile:	1	281	153	0	6
Maximum:	27	281	160	0	6
# of Seqs:	7458


Now we are left with two new files - stool.trim.unique.good.filter.unique.fasta and stool.trim.unique.good.filter.names. We have reduced our dataset by about 25%. In a published paper called "Ironing out the wrinkles in the rare biosphere through improved OTU clustering" in Environmental Microbiology, Sue Huse, Mitch Sogin, and colleagues suggest using a preclustering step to reduce sequencing noise from pyrosequencing data. The basic idea is that abundant sequences are likely to generate sequences that are less abundant and differ from the dominant sequence type by about one base every 100 bases. We have implemented our version of this algorithm in the pre.cluster command. We will use it here...

mothur > pre.cluster(fasta=stool.trim.unique.good.filter.unique.fasta, name=stool.trim.unique.good.filter.names, diffs=1)
 
Total number of sequences before precluster was 7458.
pre.cluster removed 2608 sequences.

mothur > summary.seqs(fasta=stool.trim.unique.good.filter.unique.precluster.fasta)

		Start	End	NBases	Ambigs	Polymer
Minimum:	1	280	123	0	3
2.5%-tile:	1	281	138	0	3
25%-tile:	1	281	140	0	4
Median: 	1	281	140	0	5
75%-tile:	1	281	141	0	5
97.5%-tile:	1	281	152	0	6
Maximum:	27	281	160	0	6
# of Seqs:	4850


This makes it clear that we have merged 2,608 sequences with the more abundant sequence types and have reduced the dataset to 4,850 unique sequences.

Now we want to remove chimeric sequences from our analysis. Note: this is not as well-developed an area in the field of pyrosequencing analysis as we would hope for. Many programs (e.g. Bellerophon, Pintail, etc.) are available for identifying chimeras, but the either do not scale well or do a poor job. The Broad Institute has developed a program called ChimeraSlayer, which we have modified as chimera.slayer to be faster. It also calls more chimeras and makes more sense to us. The developers at the Broad institute suggest using their Gold sequence database. We provide a silva-compatible alignment of this sequence collection with the silva reference files.

Before we can run chimera.slayer, we need to apply the same filter to the reference alignment so we can use that to detect chimeras. So we need to use the hard option in filter.seqs:

mothur > filter.seqs(fasta=silva.gold.align, hard=stool.filter)

Length of filtered alignment: 281
Number of columns removed: 49719
Length of the original alignment: 50000
Number of sequences used to construct filter: 5181

Now we want to run chimera.slayer on stool.trim.unique.good.filter.unique.precluster.fasta:

mothur > chimera.slayer(fasta=stool.trim.unique.good.filter.unique.precluster.fasta, template=silva.gold.filter.fasta, minsnp=100, processors=2)

This will generate two files - stool.trim.unique.good.filter.unique.precluster.slayer.chimeras and stool.trim.unique.good.filter.unique.precluster.slayer.accnos. If you open stool.trim.unique.good.filter.unique.precluster.slayer.accnos you will see that about 1,715 sequences were predicted to be chimeric. This number may vary a bit from run to run since there is a randomization procedure involved. Now we want to use the remove.seqs command to remove the chimeric sequences from our orignial fasta, names, and groups files:

mothur > remove.seqs(accnos=stool.trim.unique.good.filter.unique.precluster.slayer.accnos, fasta=stool.trim.unique.good.filter.unique.precluster.fasta, name=stool.trim.unique.good.filter.unique.precluster.names, group=stool.good.groups, dups=T)

This command will generate stool.trim.unique.good.filter.unique.precluster.pick.fasta, stool.trim.unique.good.filter.unique.precluster.pick.names, and stool.good.pick.groups, respectively. If you compare stool.good.pick.groups and stool.good.groups you will notice that the total number of sequences being analyzed dropped from 30,657 to 26,813 or by about 13%. The Broad developers report that this is within the range of what is reasonable to expect. Although we are still processing 26,813 sequences, there are only 2,469 unique sequences.

Now we want to use our high quality sequences to generate a distance matrix. To do this we will use the dist.seqs command. Because I will be demonsrating how to do phylogenetic approaches later in this tutorial we want the full distance matrix

mothur > dist.seqs(fasta=stool.trim.unique.good.filter.unique.precluster.pick.fasta, output=lt, processors=2)

Alternatively, if you aren't a fan of that method, you would use the following command to save only those distances smaller than 0.20:

mothur > dist.seqs(fasta=stool.trim.unique.good.filter.unique.precluster.pick.fasta, cutoff=0.20, processors=2)

Either command option will take about 10 seconds. We will work with the output of the first command called stool.trim.unique.good.filter.unique.precluster.pick.phylip.dist.

At this point we have four files that we will need for all of our downstream processing: stool.good.pick.groups , stool.trim.unique.good.filter.unique.precluster.pick.fasta, stool.trim.unique.good.filter.unique.precluster.pick.names, and stool.trim.unique.good.filter.unique.precluster.pick.phylip.dist. These are some pretty gangly file names. Let's rename these files to something simpler using mothur's system command:

mothur > system(cp stool.good.pick.groups stool.final.groups)

mothur > system(cp stool.trim.unique.good.filter.unique.precluster.pick.phylip.dist stool.final.dist)

mothur > system(cp stool.trim.unique.good.filter.unique.precluster.pick.names stool.final.names)

mothur > system(cp stool.trim.unique.good.filter.unique.precluster.pick.fasta stool.final.fasta)

Because distance files tend to be rather large, mothur does not store them in RAM. So if we want to use the distance matrix, we need to re-read it into RAM. We will do this with the read.dist command:

mothur > read.dist(phylip=stool.final.dist, name=stool.final.names, cutoff=0.25)
********************#****#****#****#****#****#****#****#****#****#****#
Reading matrix:     ||||||||||||||||||||||||||||||||||||||||||||||||||||
***********************************************************************


Note that if you were using a column-formatted matrix you would instead use the following command:

mothur > read.dist(column=stool.final.dist, name=stool.final.names)

OTU-based analysis

Now we are ready to assign our sequences to OTUs using the cluster command with the average neighbor algorithm:

mothur > cluster(method=average)

This command will take about a minute to run and will provide some output as it goes along. This output is also found in the stool.final.an.sabund file. In addition a stool.final.an. rabund and stool.final.an.list file are generated as well. If the distance matrix is larger than the amount of RAM your computer has, you can also use the hcluster command. This has a small memory footprint, but can take considerably longer to run than the cluster command.

mothur > hcluster(phylip=stool.final.dist, name=stool.final.names, cutoff=0.25, method=average)

The output should be more or less the same as you get with the cluster command.


Alpha diversity

Now that we have assigned all of the sequences to OTUs we are interested in calculating the coverage, richness, and diversity of each sample. The first step is to use our group file (stool.final.groups) to parse the list file (stool.final.an.list). I like to do my analysis at a cutoff level of 0.10, but you are free to use whatever you want. To do the parsing, we will use the read.otu command with the list and group options:

mothur > read.otu(list=stool.final.an.list, group=stool.final.groups, label=0.03)

This will generate 24 rabund files (one for each of the samples) and a shared file. As an example of performing alpha diversity measurements, let's use the F11Fcsw sample. Data for this sample can be found in the stool.final.an.F11Fcsw.rabund file. First, we need to read the rabund file in using the read.otu command:

mothur > read.otu(rabund=stool.final.an.F11Fcsw.rabund)

To generate a collector's curve of the chao richness estimator we will use the collect.single command. Also, because there are 704 sequences in this sample, let's only spit data out every 5 sequences:

mothur > collect.single(calc=chao, freq=5)

This command will generate a file called stool.final.an.F11Fcsw.chao, which can be plotted in your favorite graphing software package.

Next, we'd like to generate a rarefaction curve describing how much of the biodiversity Costello was able to sample. We'll do this with the rarefaction.single command:

mothur > rarefaction.single(freq=5)

This will generate a file called stool.final.an.F11Fcsw.rarefaction, which again can be plotted in your favorite graphing software package.

Finally, let's get a printout of the number of sequences, the sample coverage, the npshannon estimator, the simpson diversity estimate, the number of observed OTUs, and the chao richness estimate using the summary.single command:

mothur > summary.single(calc=nseqs-coverage-npshannon-simpson-sobs-chao)

These data will be outputted to a table in a file called stool.final.an.F11Fcsw.summary. Interestingly, the sample coverage was only about 87%; so it seems that they could have obtained a few more sequences from this sample. An alternative way to run these last several commands is by reading in the shared file as follows:

mothur > read.otu(shared=stool.final.an.shared)
mothur > collect.single(calc=chao, freq=5)
mothur > rarefaction.single(freq=5)
mothur > summary.single(calc=nseqs-coverage-npshannon-simpson-sobs-chao)

This will give you the Chao1 collector's curve, rarefaction curve, and summary files for all 24 samples.


Beta diversity measurements

Now we'd like to compare the membership and structure of the 24 stool communities. To get started we need to read in the stool.final.an.shared file using the read.otu command:

mothur > read.otu(shared=stool.final.an.shared)

Let's start by generating a heatmap of the relative abundance of each OTU across the 24 samples using the heatmap.bin command and log2 scaling the relative abundance values:

mothur > heatmap.bin(scale=log2) 

This will generate an SVG-formatted file that can be visualized in Safari or manipulated in graphics software such as Adobe Illustrator. Needless to say these heatmaps can be a bit of Rorshock. A legend can be found at the bottom left corner of the heat map.

Now let's calculate the similarity of the membership and structure found in the 24 communities and visualize those similarities in a heatmap with the Jaccard and thetayc coefficients. We will do this with the heatmap.sim command:

mothur > heatmap.sim(calc=jclass-thetayc)

The output will be in two SVG-formatted files called stool.final.an.0.03jclass.heatmap.sim.svg and stool.final.an.0.10thetayc.heatmap.sim.svg. In all of these heatmaps the red colors indicate communities that are more similar than those with black colors.

When generating Venn diagrams we are limited by the number of samples that we can analyze simultaneously. Let's take a look at the Venn diagrams for the first female subject using the venn command:

mothur > venn(groups=F11Fcsw-F12Fcsw-F13Fcsw-F14Fcsw)

This generates an interesting looking 4-way Venn diagram in the stool.final.fn.0.03.venn.sharedsobs.svg file. This shows that there were a total of 501 OTUs observed between the 4 time points. Only 47 of those OTUs were shared by all four time points. Let's use two non-parametric estimators to see what the predicted number of overlapping OTUs is for subject 1 using the summary.shared command:

mothur > summary.shared(calc=sharedchao, groups=F11Fcsw-F12Fcsw-F13Fcsw-F14Fcsw, all=T)

Opening the stool.final.an.sharedmultiple.summary file we see a prediction that female subject 1's core microbiome contained at least 75 OTUs.

Next, let's generate a dendrogram to describe the similarity of the samples to each other. We will generate a dendrogram using the jclass and thetayc calculators within the tree.shared command:

mothur > tree.shared(calc=thetayc-jclass)

This command generates two newick-formatted tree files - stool.final.an.thetayc.0.03.tre and stool.final.an.jclass.0.03.tre - that can be visualized in software like TreeView or FigTree. Inspection of the both trees shows that individuals' communities cluster with themselves to the exclusion of the others.


Characterizing the OTUs

Finally, we might like to know what types of organisms are represented by our OTUs. One method of doing this is to first get representative sequences from each of the OTUs using the get.oturep command as follows:

mothur > get.oturep(fasta=stool.final.fasta, phylip=stool.final.dist, list=stool.final.an.list, label=0.03)

Because there were a total of 1,024 OTUs across all 24 samples, we now have a file called stool.final.an.0.03.rep.fasta, which contains 1,024 sequences, which are representative sequences from each of those OTUs. We can now classify those sequences using the classify.seqs command. By default this uses the Bayesian classifier:

mothur > classify.seqs(fasta=stool.final.an.0.03.rep.fasta, taxonomy=silva.bacteria.rdp6.tax, template=nogap.bacteria.fasta)

Reading in the silva.bacteria.rdp6.tax taxonomy...	DONE.
Generating search database...    DONE.
It took 21 seconds generate search database. 
Calculating template taxonomy tree...     DONE.
Calculating template probabilities...     DONE.
It took 74 seconds get probabilities. 
Classifying sequences from stool.final.an.0.03.rep.fasta ...
Classifying sequence 100
Classifying sequence 200
Classifying sequence 300
Classifying sequence 400
Classifying sequence 500
Classifying sequence 600
Classifying sequence 700
Classifying sequence 800
Classifying sequence 900
Classifying sequence 1000

It took 8 secs to classify 1024 sequences.
It took 0 secs to create the summary file for 1024 sequences.

This will generate the files stool.final.an.0.03.rep.silva.bacteria.rdp6.taxonomy and stool.final.an.0.03.rep.silva.bacteria.rdp6.tax.summary. The *taxonomy file contains the taxonomy information for each sequence and the *summary file contains the taxonomic outline with the number of sequences in each group. For instance, in the M4 samples, the first OTU was shared in all of his time points and it was also the dominant OTU. Looking at the stool.final.an.0.03.rep.silva.bacteria.rdp6.taxonomy file we see the following for the representative sequence of this OTU:

F32Fcsw_279159|1|2101	Bacteria(100);Bacteroidetes(99);Bacteroidia(99);Bacteroidales(99);Bacteroidaceae(98);Bacteroides(98);

This sequence is a member of the Bacteroides genus. This makes biological sense as Bacteroides are known to be abundant in stool. One possible problem with this approach is that a single OTU may represent multiple taxonomic lineages and picking one sequence may not be as "representative" as we think. In the next section we will see another method of finding the taxonomy for our OTUs.


Phylotype-based analyses

[ a work in progress ]




Phylogenetic-based analyses

The so-called phylogenetic-based methods are based on a phylogenetic tree to calculate measures of alpha and beta-diversity. As mentioned in the preamble, these are the metrics that Costello and her colleagues employed in their analyses. In the original Costello manuscript and pretty much every other paper that uses these metrics a Lane mask is applied. Here, we'll repeat that analysis without using a Lane mask. First, we will use clearcut within the mothur environment. For now, you will need a copy of clearcut, which you can obtain from our clearcut wiki page. Run the following:

mothur > clearcut(phylip=stool.final.dist)

This command will produce a file called stool.final.tre.

Alpha diversity

When using phylogenetic methods, alpha diversity is calculated as the total of the unique branch length in the tree. This is done using the phylo.diversity command:

mothur > read.tree(tree=stool.final.tre, name=stool.final.names, group=stool.final.groups)
mothur > phylo.diversity()

This will generate a file called stool.final.1.phylo.diversity. Reading the last line of the file you can ascertain the amount of branch length that is contributed by each group. Ideally, this number would probably be divided by the number of sequences or OTUs in each group.


Beta diversity

The unifrac-based metrics are used to assess the similarity between two communities membership (unifrac.unweighted) and structure (unifrac.weighted). We will use these metrics and generate PCoA plots to compare our 24 stool samples.

mothur > read.tree(tree=stool.final.tre, name=stool.final.names, group=stool.final.groups)
mothur > unifrac.unweighted(distance=T, random=F)
mohtur > pcoa(phylip=stool.final.tre1.unweighted.dist)

This series of commands will generate two files that are useful for generating PCoA plots. The first, stool.final.tre1.unweighted.pca, contains the 24 samples in separate rows and each column represents a different dimension that can be used to describe the data. One could plot axes 1 and 2 to obtain a 2-dimensional representation of this very multi-dimensional data. As an aside, PCoA analysis can be done with any type of analysis involving a distance matrix. By the same token, the construction of dendrograms as described above using the tree.shared command could be done with output from the unifrac commands. The second file that is produced is stool.final.tre1.unweighted.pca.loadings. This file will indicate what percentage of the variance each axis represents. In this case, the first two axes represent 13.1 and 10.4% of the variance. Here we run the unifrac.weighted analysis using PCoA and tree.shared:

mothur > read.tree(tree=stool.final.tre, name=stool.final.names, group=stool.final.groups)
mothur > unifrac.weighted(distance=T, random=F)
mohtur > pcoa(phylip=stool.final.tre1.weighted.dist)

In this case, the first two axes from the unifrac.weighted analysis together represent more than 50% of the variance.


Conclusion

There are many other ways that one could analyze the data generated by Costello and her colleagues. I encourage you to go back and change the settings, use different calculators, come up with a hypothesis using the data and test it. If you think of an analysis that you wish mothur would do, please let us know and we'll see about adding it to the package. There is a certain "pipeline" aspect to this analysis; however, it is also very much an art of working with sequences. If you want to basically do everything that was described above, you can use the stool.batch file and use mothur in the batch mode as follows:

$ ./mothur stool.batch
Personal tools