We will be offering mothur and R workshops throughout 2019. Learn more.

Phylip-formatted distance matrix

From mothur
Jump to: navigation, search

Phylip [1] is a widely popular collection of programs developed by Joseph Felsenstein at the University of Washington and includes a tool called dnadist [2]. This is a pretty standard format for representing a distance matrix and can be generated by MEGA, ARB, and pretty much every piece of software out there. The basic format of a phylip-formatted distance matrix is to have the number of sequences described by the distance matrix on the first row. Successive lines have the sequence name as well as the pairwise distance from that sequence to the other sequences. These matrices come in two flavors - a square matrix and a lower-triangular matrix.

Square matrices

In the square matrix format each sequence listed in the distance matrix has all of the pairwise distances between it and every other distance in the sequence collection. If you were to look at the matrix, it would look like a square and the values along the diagonal from the top-left corner to the bottom-right corner would be zeros and that the triangles mirror each other.

The file 98_sq_phylip_amazon.dist is an example of a square matrix, which can be obtained from the AmazonData.zip zip file:

  98
U68589      0.0000  0.3371  0.3610  0.4155  0.2873  0.2971  0.3922  0.3093
 0.3201  0.3286  0.2572  0.3372  0.3018  0.3071  0.4777  0.3648  0.3204
 0.3407  0.6434  0.4331  0.3762  0.3515  0.3733  0.3251  0.2890  0.3103
 0.3070  0.3086  0.3250  0.3681  0.3250  0.2597  0.4092  0.2733  0.3488
 0.3388  0.2899  0.4602  0.2873  0.3049  0.2936  0.3987  0.3235  0.3439
 0.2703  0.3286  0.3607  0.3176  0.3073  0.3182  0.3152  0.3345  0.2594
 0.5133  0.3212  0.3821  0.3390  0.3115  0.3411  0.3062  0.3437  0.3485
 0.2439  0.3016  0.3715  0.3771  0.4282  0.3436  0.3635  0.3206  0.3771
 0.3346  0.3287  0.4388  0.3095  0.3944  0.3345  0.3387  0.2283  0.3761
 0.3489  0.2828  0.2816  0.2895  0.4601  0.2903  0.2259  0.3539  0.3016
 0.3183  0.2369  0.3874  0.3297  0.3066  0.3654  0.3229  0.4123  0.3236
U68590      0.3371  0.0000  0.3783  0.3198  0.1690  0.3293  0.2732  0.3127
 0.2668  0.2063  0.3287  0.3411  0.3238  0.2204  0.3626  0.2744  0.2012
 0.3175  0.5300  0.3734  0.2947  0.2866  0.2931  0.2144  0.3049  0.2940
 0.2963  0.3189  0.2282  0.2156  0.2282  0.2853  0.3831  0.2878  0.2485
 0.3304  0.3313  0.3592  0.3143  0.2825  0.3082  0.2735  0.2652  0.2726
 0.3239  0.2165  0.2818  0.2416  0.2304  0.2282  0.2555  0.2197  0.3572
 0.5077  0.1781  0.2154  0.1337  0.2925  0.2802  0.2065  0.2254  0.3009
 0.2699  0.2161  0.2744  0.3121  0.3651  0.2441  0.2394  0.3227  0.3310
 0.3473  0.3072  0.4101  0.3002  0.2786  0.2197  0.3049  0.2742  0.3049
 0.3049  0.2788  0.1909  0.2788  0.3798  0.3109  0.2812  0.2159  0.3104
 0.2972  0.2872  0.3864  0.3073  0.1999  0.3135  0.2232  0.3135  0.3167
U68591      0.3610  0.3783  0.0000  0.4148  0.3362  0.3564  0.4275  0.3227
 0.3521  0.3450  0.3697  0.1228  0.3730  0.3696  0.4679  0.3771  0.3241
 0.3658  0.5488  0.4901  0.4507  0.3637  0.3966  0.3635  0.3113  0.3294


Lower-triangle matrices

Alternatively, to save some hard drive memory, a lower-triangle representation is possible. In this case the first line again has the number of sequences in the matrix. The difference is that each subsequent line has the distance between the sequence represented by the line and every sequence up to that sequence. So when you look at the matrix in a text editor, you'll notice that it contains the distances below the diagonal of zeros that you would have in a square matrix.

The file 98_lt_phylip_amazon.dist is an example of a square matrix:

98
U68589
U68590	0.3371
U68591	0.3609	0.3782
U68592	0.4155	0.3197	0.4148
U68593	0.2872	0.1690	0.3361	0.2842
U68594	0.2970	0.3293	0.3563	0.3325	0.2768

ARB exclusively generates lower triangular matrices. Although we are unaware of any software that produce upper triangle matrices, such a matrix would be the mirror of a lower triangle matrix. All of this discussion is somewhat unnecessary because as long as mothur knows that you are providing a phylip-formatted matrix, it can identify the shape of the matrix.