We will be offering mothur and R workshops throughout 2019. Learn more.

Esophageal community analysis

From mothur
Revision as of 16:56, 31 July 2009 by Westcott (Talk | contribs) (Single-sample analyses)

Jump to: navigation, search

In this tutorial we will complete a fairly comprehensive esophageal community analysis using greengenes, ARB, and the tools available in mothur. To complete this analysis, you need to download the folder compressed in the Esophagus.zip archive. These sequences were originally published by Pei et al. (2004).

Getting started

We need to generate a distance matrix and phylogenetic tree to perform most of the analyses we would like to do in mothur. First, we need to generate an alignment of the 710 16S rRNA gene sequences. For this tutorial we will use mothur. Be sure that you are in the "Esophagus" folder and get a copy of your favorite alignment database and put it in the same folder as well. Go ahead and fire up mothur. First, we will align the sequences, with one processor this should take about 40 seconds:

mothur > align.seqs(candidate=esophagus.fasta, template=core_set_aligned.imputed.fasta)


mothur > align.seqs(candidate=esophagus.fasta, template=core_set_aligned.imputed.fasta, processors=2)

Next, we want to make sure that our sequences overlap over the same region of the 16S rRNA gene. First let's take a look at a summary basic statistics of our sequence collection:

mothur > summary.seqs(fasta=esophagus.align)

		Start	End	NBases	Ambigs	Polymer
Minimum:	140	4456	831	0	4
2.5%-tile:	187	4477	840	0	4
25%-tile:	187	4512	855	0	5
Median: 	187	4532	865	0	5
75%-tile:	187	4554	870	0	5
97.5%-tile:	204	4610	900	5	7
Maximum:	486	6720	1378	20	8
# of Seqs:	710

Being sticklers for good quality sequence data, we shouldn't be satisfied with sequences that have any ambiguous bases, much less 5! So let's remove any sequences that have more than 5 ambiguous bases. While we're at it, let's remove any sequences where the alignment doesn't start by position 204 or end by position 4456 using the screen.seqs command. These correspond to E. coli positions 67 and 873. We also want to remove the problematic sequences from the group file:

mothur > screen.seqs(fasta=esophagus.align, group=esophagus.groups, start=204, end=4456, maxambig=5)

Again summarizing the output of screen.seqs shows us:

mothur > summary.seqs(fasta=esophagus.good.align)

		Start	End	NBases	Ambigs	Polymer
Minimum:	140	4470	840	0	4
2.5%-tile:	187	4477	840	0	4
25%-tile:	187	4512	857	0	5
Median: 	187	4532	866	0	5
75%-tile:	187	4554	870	0	5
97.5%-tile:	189	4609	900	2	7
Maximum:	204	4664	907	5	8
# of Seqs:	677

We see that we've lost about 33 sequences from our dataset with that screen. Now we want to make sure that we are only considering the same region of the alignment (i.e. positions 204 through 4470). We also want to remove any columns from the alignment that only contain gap characters. To do this we will use the filter.seqs command:

mothur > filter.seqs(fasta=esophagus.good.align, trump=., vertical=T)

This will yield the following output:

Length of filtered alignment: 1085 Number of columns removed: 6597 Length of the original alignment: 7682 Number of sequences used to construct filter: 677

Now we need to generate a phylip-formatted distance matrix so we can run some of the hypothesis testing procedures. We will do this using the dist.seqs command:

mothur > dist.seqs(fasta=esophagus.good.filter.fasta, phylip=T)


mothur > dist.seqs(fasta=esophagus.good.filter.fasta, phylip=T, processors=2)

Copy esophagus.good.filter.phylip.dist to a new file named esophagus.dist.

OTU-based Analyses

We are now ready to analyze our sequence collection in mothur. You are encouraged to look through the mothur manual to get a better idea of the options available for each command. Go ahead and open mothur in your terminal. Load the distance matrix using the read.dist command. Our analysis will focus on OTUs defined at distances less than or equal to 0.10:

mothur > read.dist(phylip=esophagus.dist, cutoff=0.10)

Clustering sequences

We will now cluster the sequences using the furthest neighbor algorithm:

mothur > cluster()

This will generate three files: esophagus.fn.sabund, esophagus.fn.list, and esophagus.fn.rabund. The sequences in esophagus.fasta come from three patients. To parse esophagus.fn.list into these three groups you need to make use of the esophagus.groups file using the read.otu command. We will be interested in OTUs defined as a group of unique sequences and those that are no more than 0.03, 0.05 and 0.10 apart from each other:

mothur > read.otu(list=esophagus.fn.list, group=esophagus.good.groups, label=unique-0.03-0.05-0.10)

This command will generate four files including esophagus.fn.shared, esophagus.fn.B.list, esophagus.fn.D.list, and esophagus.fn.C.list.

Single-sample analyses

We would like to know the richness of each of these list files as well as the richness that is shared between the three communities. As an example, let's analyze patient B. First, read in the list file:

mothur > read.otu(list=esophagus.fn.B.list)

To generate data for a rarefaction curve use the rarefaction.single command:

mothur > rarefaction.single()

This will generate esophagus.fn.B.rarefaction.

To generate a collector's curve for the Chao1 estimator and non-parametric Shannon Index use the collect.single command:

mothur > collect.single(calc=chao-npshannon)

This will generate the files esophagus.fn.B.chao and esophagus.fn.B.np_shannon. By plotting these data one can evaluate whether the parameters are sensitive to sampling.

For a quick summary of various parameters for each patient you can execute the following commands:

mothur > read.otu(rabund=esophagus.fn.B.rabund)
mothur > summary.single()
mothur > read.otu(rabund=esophagus.fn.C.rabund)
mothur > summary.single()
mothur > read.otu(rabund=esophagus.fn.D.rabund)
mothur > summary.single()

These commands will result in the generation of three files: esophagus.fn.B.summary, esophagus.fn.C.summary, and esophagus.fn.D.summary. The predicted Chao1 richness of these three communities at a cutoff of 0.03 was 90, 46, and 100. To get a summary of the overall enter the following commands:

mothur > read.otu(list=esophagus.fn.list, label=unique-0.03-0.05-0.10)
mothur > summary.single()

This generates the file esophagus.fn.summary. The Chao1 richness for a cutoff of 0.03 was 136.

Multiple-sample analyses

Obviously many sequences are shared between the three patients. To estimate the overlap you can use collect.shared and collect.summary. Here we will read in the shared file and calculate the shared richness, Jaccard coefficient, and thetaYC value between the three patients:

mothur > read.otu(shared=esophagus.fn.shared)
mothur > summary.shared()

This generates the file esophagus.fn.shared.summary. Using a shared Chao estimator, communities B & C share 30 OTUs, B & D share 61, and C & D share 23. The Jaccard Index for the same comparisons were 0.33, 0.63, and 0.22 when using the richness estimates were used to calculate the Jaccard Index. The thetaYC values were 0.33, 0.61, and 0.11. The Jaccard Index measures similarity in community membership while thetaYC measures similarity in community structure.

As a method of visualizing the data you can run the venn command using the observed and estimated richness:

mothur > venn(label=0.03, calc=sharedsobs-sharedchao)

With this SVG file, you can manually scale the diagram so that the regions are proportional to the richness represented in the region. If you had more groups, it would be interesting to try out the tree.shared command.

Hypothesis testing approaches

We have previously shown [1] that it is not appropriate to apply every possible hypothesis testing approach. Pick one and stick with it. Otherwise you will run the risk of identifying differences as being statistically significant that are not.


To execute the libshuff command you will need to read in the distance matrix with the group file:

mothur > read.dist(phylip=esophagus.dist, group=esophagus.good.groups) 
mothur > libshuff()

This will generate the following output:

Comparison          	dCXYScore	Significance
B-C                 	0.00198108	<0.0001
C-B                 	0.00302058	<0.0001
B-D                 	0.00123778	<0.0001
D-B                 	0.00044060	0.0240
C-D                 	0.00553260	<0.0001
D-C                 	0.00155345	<0.0001

These significance values are all below the critical threshold (i.e. 0.05/6 = 0.0083333). Therefore, the three communities have different structures.

Parsimony test

Ideally, you would take the time to generate a "good" phylogeny of your sequences. We have provided a neighbor-joining tree generated from esophagus.dist, which is named esophagus.tree. First we will use the parsimony test. This test was previously implemented in the TreeClimber programmer and is now available as the parsimony command. First you need to read in the tree with the group file:

mothur > read.tree(tree=esophagus.tree, group=esophagus.good.groups)
mothur > parsimony()

This will generate the following output:

Tree#	Groups	ParsScore	ParsSig
1	B-C-D	127		<0.001

These results indicate that at least one of the three patients harbored a different community structure. Now we want to know which of the communities are different from the others:

mothur > parsimony(groups=all)

This will yield the following output:

Tree#	Groups	ParsScore	ParsSig
1	B-C	66		<0.001
1	B-D	56		<0.001
1	C-D	53		<0.001

Needless to say, the three patients harbored distinctively different community structures.

Weighted UniFrac Test

The weighted UniFrac test measures the fraction of a tree's branch length that can be ascribed to each community. A similar analysis is available through the Knight Lab website [2]. Similar to the parsimony analysis, you first need to read in a tree and group file:

mothur > read.tree(tree=esophagus.tree, group=esophagus.good.groups)
mothur > unifrac.weighted()

This will yield the following output:

Tree#	Groups	WScore		WSig
1	B-C	0.200186	<0.001
1	B-D	0.234404	<0.001
1	C-D	0.227453	<0.001

Again, all of these significance values are less than the critical threshold (i.e. 0.05/3 = 0.016667). Therefore the three communities are significantly different from each other.

Unweighted UniFrac Test

Similar to the weighted UniFrac test, the unweighted unifrac test measures the fraction of a tree's branch length that is unique to each group represented in the tree. A similar analysis is available through the Knight Lab website [3]. Similar to the parsimony analysis, you first need to read in a tree and group file:

mothur > read.tree(tree=esophagus.tree, group=esophagus.good.groups)
mothur > unifrac.unweighted()

This yields the following output:

Tree#	Groups	UWScore		UWSig
1	B-C-D	0.625019	<0.001

Again, at least one of the three communities harbors a significantly different community. To dissect the relationship between the communities the following command will look at the pairwise comparisons:

mothur > unifrac.unweighted(groups=all)
Tree#	Groups	UWScore		UWSig
1	B-C	0.638774	<0.001
1	B-D	0.622414	<0.001
1	C-D	0.679816	<0.001

Again, these three communities clearly harbor distinct community structures.