Read.dist
From mothur
The read.dist command will allow you to read a distance matrix into memory for downstream processing. There are a large number of ways to format a distance matrix. If mothur does not cover your preferred format, please either consider contributing your own read command or asking us to add it to the feature list of the next mothur version. The files that we discuss in this tutorial can be obtained by downloading the AmazonData.zip file and decompressing it.
Contents |
Default settings
Either a phylip-formatted distance matrix or a column-formatted distance matrix must be inputted for read.dist to be successful, the default output of the dist.seqs command is the column-format. If you have a favorite format, please let us know and we can work with you to incorporate that feature into mothur. Because the phylip format is so popular most software can generate this format for you.
Options
phylip
To read in a phylip-formatted distance matrix you need to use the phylip option:
mothur > read.dist(phylip=98_sq_phylip_amazon.dist)
or
mothur > read.dist(phylip=98_lt_phylip_amazon.dist)
Whereas dotur required you to indicate whether the matrix was square or lower-triangular, mothur is able to figure this out for you.
Once you execute the command, mothur reads in the matrix and generates a progress bar:
mothur > read.dist(phylip=98_lt_phylip_amazon.dist) *******************#****#****#****#****#****#****#****#****#****#****# Reading matrix: ||||||||||||||||||||||||||||||||||||||||||||||||||| **********************************************************************
At this point the distance matrix is stored into memory for downstream analyses.
column & name
To read in a column-formatted distance matrix you must provide a filename for the name option. The .names file was generated during the unique.seqs command.
mothur > read.dist(column=96_sq_column_amazon.dist, name=amazon.names)
or
mothur > read.dist(column=96_lt_column_amazon.dist, name=amazon.names)
Again, the column-formatted distance matrix can be square or lower-triangle and mothur will figure it out.
name
A names file contains two columns. The first column contains the name of a reference sequence that is in a distance matrix and the second column contains the names of the sequences (separated by commas) that the reference sequence represents. The list of names in the second column should always contain at least the reference sequence name.
There are several reasons to be interested in providing a name file with your distance matrix. First, as sequencing collections increase in size, the number of duplicate sequences is increasing. This is especially the case with sequences generated via pyrosequencing. Sogin and colleagues [1] found that less than 50% of their sequences were unique. Because the alignments and distances for the duplicate sequences are the same, re-processing each duplicate sequence takes a considerable amount of computing time and memory.
Example from amazon.names:
... U68616 U68616 U68617 U68617 U68618 U68618,U68620 U68619 U68619 U68621 U68621 ...
Second, if you pre-screen a clone library using ARDRA then you may only have a sequence for a handful of clones, but you know the number of times that you have seen a sequence like it. In such a case the second column of the names file would contain the sequence name as well as dummy sequence names
... AA1234 AA1234,AA1234.1,AA1234.2 AA1235 AA1235 AA1236 AA1236,AA1236.1 AA1237 AA1237,AA1237.1,AA1237.2,AA1237.3 AA1238 AA1238,AA1238.1 ...
A names file is not required (unless you are using the column= option), but depending on the data set to be analyzed, could significantly accelerate the processing time of downstream calculations. Although this is a simple example, the 98 sequence amazon data set has two pairs of duplicate sequences (U68618 and U68620) and (U68667 and U68641). The distance matrix in the file 96_lt_phylip_amazon.dist is a lower triangle matrix for the 96 unique sequences. While you could just read the matrix in and analyze the set of 96 unqiue sequences, this would give a considerably different analysis than if you used the entire 98 sequence data set. Considering the frequency of sequences is critical for pretty much every analysis in mothur, we want to use the name file to artificially inflate the matrix to its full size. In this case we use the namefile option:
mothur > read.dist(phylip=96_lt_phylip_amazon.dist, name=amazon.names)
mothur remembers that the distances for the reference sequence also apply to all of the sequences listed in the second column. Using a name file can considerably accelerate the amount of processing time required to analyze some data sets.
cutoff
An additional advantage to using a namefile is the ability to ignore certain pairwise distances that are large. For example, if you know that you are only interested in OTUs formed at a distance below 0.10, why keep a distance that is greater than 0.10? Because of rounding you might be interested in saving all of the distances smaller than 0.11. You could then generate a column-representation of a distance matrix using only those distances below 0.11. In addition to using a namefile, this step adds a significant boost to processing time. For example, the 96 unique sequences in the amazon dataset would require 4,560 distances to represent a lower triangle matrix, but if you only keep the distances below 0.11 as was done in the file 96_lt_column_11_amazon.dist, then there would only be 184 distances to save and process. That's a reduction of 96%, which can be substantial for analyzing large matrices for pyrosequencing analyses. In the future there will hopefully be distance matrix calculation tools that provide these parsed matrices. If your distance calculation software doesn't do this for you, you can have mothur do it for you using the cutoff option with any of the distance matrix reading commands. Each of these commands should provide roughly the same downstream results:
mothur > read.dist(phylip=98_lt_phylip_amazon.dist, cutoff=0.10)
mothur > read.dist(phylip=96_lt_phylip_amazon.dist, name=amazon.names)
mothur > read.dist(phylip=96_lt_phylip_amazon.dist, name=amazon.names, cutoff=0.10)
mothur > read.dist(column=96_lt_column_amazon.dist, name=amazon.names, cutoff=0.10)
mothur > read.dist(column=96_lt_column_11_amazon.dist, name=amazon.names)
hard
By default the cutoff parameter is set to cutoff + (5 / (precision * 10.0)). So if you set the cutoff to 0.03 with precision of 100, the cutoff would be set to 0.035. If you want a hard cutoff of 0.03, then you need to set the hard parameter to true.
mothur > read.dist(phylip=98_lt_phylip_amazon.dist, cutoff=0.03, hard=t)
precision
Another option that is available with each of the read commands is the ability to set the precision of the distances in the matrix. dnadist can provide up to 6 places right of the decimal point (10-6). While this may be an appropriate level of precision for comparing genomes, it is overkill for a gene that is 1500 bp long. The default option is a precision of 100 so that reporting of output values is rounded to the nearest 0.01. This is important for the read commands because if you set a precision of 100 and a cutoff of 0.10, then mothur will keep any distance less than 0.1049. The precision can be set using the precision flag.
mothur > read.dist(phylip=98_lt_phylip_amazon.dist, cutoff=0.10, precision=100)
sim
The sim parameter is used to indicate that your input file contains similarity values instead of distance values. The default is false, if sim=true then mothur will convert the similarity values to distances.
group
When using the libshuff command, you need to tell mothur what sequences belong to which groups. You can do this in the read.dist command using the phylip and group options. Previously, you would need to sort your distance matrix in a specific way and then tell s-libshuff how many groups you had and how many sequences there were in each group. This is no longer necessary:
mothur > read.dist(phylip=98_lt_phylip_amazon.dist, group=amazon.groups)
The groups option will not work with the column or names options.