Sorry for the delayed response @Nicholas_Bokulich - was traveling to and from a conference. It was a great opportunity to talk a bit about the new tools in QIIME2 with other users, and now I’m back to thinking about my reference database construction. I really appreciate your comments above and have already employed the
exclude-seqs argument with a bat-reference database (to filter out bat sequences). However, I’m now thinking about the arthropod-only database. This comes from BOLD, but I’ve been applying a few other filtering criteria for BOLD.
Perhaps this is a question that QIIME already has an answer: I’m wondering about how to preserve as much taxonomic information for a sequence which contains redundant records. For instance, if I run
qiime vsearch dereplicate-sequences (info here), it’s not clear to me how to retain the record (sequence + taxonomic info in header) with the most complete taxonomic information. Or, what would happen if there were two identical sequences which contained equally complete but distinct taxonomies.
A post back in April on the developers forum here suggested that the
--p-derep-prefix can resolve issues for pseduo-replication of sequence variants, but it’s my understanding this doesn’t address anything to do with taxonomic information.
Consider the information in the following two files:
>seq1 AATTCCGG >seq2 AATTCCGG >seq3 TAGTAGTA >seq4 TAGTAGTA
seq1 k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__;g__;s__ seq2 k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Oecophoridae;g__Chezala;s__ seq3 k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Carabidae;g__Pterostichus;s__Pterostichus tristis seq4 k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Carabidae;g__Heteropaussus;s__ Heteropaussus hastatus
We can see from refseqs.fasta that there are a pair of identical sequences (
seq2; as well as
seq4). If I was to dereplicate these data, in a perfect world, the outcome would be:
seq2are concatenated into a single representitive sequence, however, because
seq2contains complete taxonomic information and no other identical sequence contains any opposing information at equivalent levels, I’d preserve the full taxonomic identities of
- My concern is that if the program simply retains the first taxonomic entry, then it will lose out on the potential information contained in redundant sequences
seq4contain identical sequences, and both contain full taxonomic records through to a species level. However they disagree at both the species and genus ranks. In this case, I’d like a derplicated record to agree where they share a least common ancestor - the Family in this case.
Thus the desired output for a dereplicated fasta would be:
>derep-seq1 AATTCCGG >derep-seq2 TAGTAGTA
and the associated dereplicated taxonomic file would be:
derep-seq1 k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Oecophoridae;g__Chezala;s__ derep-seq2 k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Carabidae;g__;s__
Hopefully that makes sense. I’m sure you smart microbiologists have been thinking about these things already. Unfortunately the QIIME docs I’ve come across are sparse on considerations when building your own database - I can see why, after starting to have to deal with the web of complications I’m hitting now!
Thanks for your help,