We will be hosting mothur and R workshops throughout 2018. Learn more.

Summary.seqs

From mothur
Jump to: navigation, search

The summary.seqs command will summarize the quality of sequences in an unaligned or aligned fasta-formatted sequence file.

Default Setting

Fasta is the only required parameter:

mothur > summary.seqs(fasta=amazon.fasta)


		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	422	422	0	4	1
2.5%-tile:	1	436	436	0	4	3
25%-tile:	1	507	507	1	5	25
Median: 	1	530	530	3	5	50
75%-tile:	1	961	961	6	6	74
97.5%-tile:	1	973	973	15	8	96
Maximum:	1	978	978	20	9	98
Mean:	1	678.235	678.235	4.54082	5.44898
# of Seqs:	98


For this unaligned fasta file, we see that all of the sequences started at position 1 (they're unaligned) and had between 422 and 978 bases in them. The median length was 530 bases. We can also see that more than 75% of the sequences had at least one ambiguous base in them and at least one had 20. The final column indicates the length of the longest homopolymer in each sequence - 95% of the sequences had a homopolymer length between 4 and 8 bases long.

An output file (e.g. amazon.fasta.summary) lists each of these parameters for each sequence. For example:

seqname	start	end	nbases	ambigs	polymer numSeqs
U68589	1	943	943	10	5     1
U68590	1	497	497	0	6     1
U68591	1	930	930	1	4     1
...

So we see that sequence U68589 was 943 bases long, had 10 ambiguous bases in it and the longest homopolymer in the sequence was 5 bases long. The numSeqs column is 1 because no name file was provided so we assume this sequence is unique.


If we had instead analyzed an aligned sequence file such as the greengenes alignment database:

mothur > summary.seqs(fasta=core_set_aligned.imputed.fasta)

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	69	6849	1423	0	4	1
2.5%-tile:	97	6849	1469	0	5	124
25%-tile:	109	6849	1507	0	5	1235
Median: 	109	6849	1524	0	6	2470
75%-tile:	109	6849	1538	0	6	3704
97.5%-tile:	109	6857	1563	3	8	4815
Maximum:	109	6885	1609	30	11	4938
Mean:	107.857	6849.47	1521.43	0.222155	5.67294
# of Seqs:	4938


Now we see that all of the sequences are at least 1,423 bases long, very few have any ambiguous bases and most sequences start by position 109 and end by position 6,849. These data can be useful for removing sequences that don't overlap or that have features indicating poor quality using the screen.seqs command.

name

You can also use a name file with the summary.seqs command.

mothur > summary.seqs(fasta=stool.unique.fasta)
		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	183	183	0	3	1
2.5%-tile:	1	242	242	0	4	507
25%-tile:	1	258	258	0	5	5061
Median: 	1	267	267	0	5	10121
75%-tile:	1	274	274	0	5	15181
97.5%-tile:	1	287	287	0	6	19735
Maximum:	1	373	373	0	6	20241
Mean:	1	265.647	265.647	0	4.9541
# of Seqs:	20241

or

mothur > summary.seqs(fasta=stool.unique.fasta, name=stool.names)
		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	183	183	0	3	1
2.5%-tile:	1	243	243	0	4	929
25%-tile:	1	259	259	0	5	9282
Median: 	1	267	267	0	5	18564
75%-tile:	1	274	274	0	5	27845
97.5%-tile:	1	287	287	0	6	36198
Maximum:	1	373	373	0	6	37126
Mean:	1	265.768	265.768	0	4.96205
# of unique seqs:	20241
total # of seqs:	37126

count

The count file is similar to the name file in that it is used to represent the number of duplicate sequences for a given representative sequence.

mothur > summary.seqs(fasta=stool.unique.fasta, count=stool.count_table)

processors

If you are a Windows user, the summary.seqs command is now parallelized for you as well! If you are using the mpi-enabled version, processors is set to the number of processes you have running. The processors option enables you to accelerate the summary process by using multiple processors. You are able to use as many processors as your computer has with the following option:

mothur > summary.seqs(fasta=stool.unique.fasta, name=stool.names, processors=2)


Revisions

  • 1.22.0 - Added processors option for Windows users.
  • 1.22.0 - Added NumSeqs and mean values to output.
  • 1.28.0 - Added count option
  • 1.31.0 - Bug Fix: 32bit machines processing a file larger than 4G could hang
  • 1.39.1 - Adds file mismatch check