Summary.seqs

From mothur

Jump to: navigation, search

The summary.seqs command will summarize the quality of sequences in an unaligned or aligned fasta-formatted sequence file. The only parameter is fasta and it is required:

mothur > summary.seqs(fasta=amazon.fasta)


		Start	End	NBases	Ambigs	Polymer
Minimum:	1	422	422	0	4
2.5%-tile:	1	436	436	0	4
25%-tile:	1	507	507	1	5
Median:		1	530	530	3	5
75%-tile:	1	961	961	6	6
97.5%-tile:	1	973	973	15	8
Maximum:	1	978	978	20	9
# of Seqs:	98

For this unaligned fasta file, we see that all of the sequences started at position 1 (they're unaligned) and had between 422 and 978 bases in them. The median length was 530 bases. We can also see that more than 75% of the sequences had at least one ambiguous base in them and at least one had 20. The final column indicates the length of the longest homopolymer in each sequence - 95% of the sequences had a homopolymer length between 4 and 8 bases long.

An output file (e.g. amazon.fasta.summary) lists each of these parameters for each sequence. For example:

seqname	start	end	nbases	ambigs	polymer
U68589	1	943	943	10	5
U68590	1	497	497	0	6
U68591	1	930	930	1	4
...

So we see that sequence U68589 was 943 bases long, had 10 ambiguous bases in it and the longest homopolymer in the sequence was 5 bases long.


If we had instead analyzed an aligned sequence file such as the greengenes alignment database:

mothur > summary.seqs(fasta=core_set_aligned.imputed.fasta)

		Start	End	NBases	Ambigs	Polymer
Minimum:	69	6849	1423	0	4
2.5%-tile:	97	6849	1469	0	5
25%-tile:	109	6849	1507	0	5
Median: 	109	6849	1524	0	6
75%-tile:	109	6849	1538	0	6
97.5%-tile:	109	6857	1563	3	8
Maximum:	109	6885	1609	30	11


Now we see that all of the sequences are at least 1,423 bases long, very few have any ambiguous bases and most sequences start by position 109 and end by position 6,849. These data can be useful for removing sequences that don't overlap or that have features indicating poor quality using the screen.seqs command.

Personal tools