We will be offering an R workshop December 18-20, 2019. Learn more.

Dist.seqs

From mothur
Revision as of 20:47, 22 May 2009 by Pschloss (Talk | contribs) (New page: The pairwise distance command will calculate pairwise distances between aligned sequences. This approach is better than the commonly used DNADIST because the distances are not stored ...)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The pairwise distance command will calculate pairwise distances between aligned sequences. This approach is better than the commonly used DNADIST because the distances are not stored in RAM, rather they are printed directly to a file. Furthermore, it is possible to ignore "large" distances that one might not be interested in. The command will generate a column-formatted distance matrix that is compatible with the column option in the read.dist() command. There are several options for how to handle gap comparisons and terminal gaps. This tutorial uses the data files in AmazonData.zip.


dist.seqs()

To run dist.seqs() an alignment file must be provided in either fasta, nexus, clustal, or phylip format. By default an gap is only penalized once, terminal gaps are penalized, all distances are calculated, and only one processor is used. Running the following command will generate amazon.unique.dist

mothur > dist.seqs(fasta=amazon.unique.align)


The calc option

mothur has several ways of calculating a distance based on how gaps are treated. First, the default option - onegap - only counts a string of gaps as a single gap. For example:

SequenceA  ATGCATGCATGC
SequenceB  ACGC---CATCC

Would have two mismatches and one gap. Therefore the distance would be 3/10 or 0.30. This is the distance calculating method employed by Sogin et al. (1995). The logic behind this type of penalty is that a gap represents an insertion and it is likely that a gap of any length represents a single insertion. This can be called within mothur by the command:

mothur > dist.seqs(fasta=amazon.unique.align, calc=onegap)


DNADIST actually ignores gaps. For example the two sequences above would have a distance of 2/9 or 0.2222. This type of distance calculation does not make much sense for sequences that are known to have a significant number of insertions. This can be used in mothur using the nogaps option. mothur can use this distance calculating method with the command:

mothur > dist.seqs(fasta=amazon.unique.align, calc=nogaps)


A final option is to penalize each gap. For the two sequences above, the distance would be 5/12 or 0.4167. This can be used in mothur using the eachgap option. mothur can use this distance calculating method with the command:

mothur > dist.seqs(fasta=amazon.unique.align, calc=eachgap)


The ends option

There is some discussion over whether to penalize gaps that occur at the end of sequences. For example, consider the following sequences:

Sequence1 ATGCATGCATGC
Sequence2 ---CAAGTA---

If terminal gaps are penalized (and we are using the calc=onegap option) then the distance would be 4/8 or 0.5000. If they are not penalized, then the distance would be 2/6 or 0.3333. Ideally, all sequences would be aligned over the same region; however, if this is not possible or desired for some reason the ends option can be employed.

mothur > dist.seqs(fasta=amazon.unique.align, ends=T)

The default is for ends to equal F. It is possible to combine the calc and ends options.


The cutoff option

If you know that you are not going to form OTUs with distances larger than 0.10, you can tell mothur to not save any distances larger than 0.10. This will significantly cut down on the amount of hard drive space required to store the matrix. This can be done as follows:

mothur > dist.seqs(fasta=amazon.unique.align, cutoff=0.10)

Without setting cutoff to 0.10 this command would have generated 4560 distances (i.e. 96x95/2 = 4560). With the cutoff only 46 distances are saved. The savings can be substantial when there are a large number of distances. The actual cutoff used by the command is 0.005 higher than the value that is set to allow for rounding in the clustering steps.