Filtering sequences without taxonomic annotation

Continuing the discussion from MOCK sample and PCR blanks:

I'm also interested in removing sequences from a seq.qza file following the successful removal of samples from its corresponding table.qza file using Metadata-based filtering. I see that it may be possible to do this using taxonomy based filtering but was wondering if there was another way of achieving this without having to complete the taxonomic annotation process?
Nsa

Yes! You could use q2-quality-control to exclude sequences that don't match your reference sequences within some % identity (or alternatively, find seqs that match a set of "bad seqs" that you want to filter out, e.g., known contaminants or non-target DNA, with a high degree of similarity)

That is about the best you can do to remove specific contaminants (or include only certain clades) without first assigning taxonomy. You need to have some criteria for actually identifying the sequences to remove. If your concern is that assigning taxonomy on all sequences will take way too much time, then this will still add time to your workflow.

Another option to build this into your workflow in a streamlined way would be to use deblur (which essentially performs this same step as a pre-filter to remove non-target DNA) for denoising your sequences. But that probably doesn't help you in your current analysis.

I hope that helps!

Thank you for getting back @Nicholas_Bokulich. I should have been more clear about what i am trying to do, which is to remove the samples (with their corresponding sequences) that have been successfully filtered out from a table.qza file using the --p-where parameter in the feature-table filter sample command. In this case, i don't think that

would do the job. It may take out sequences from samples that were retained in the table.qza file as well as from those that were not (and not necessarily remove the samples that i am interested in filtering out), if these sequences belonged to bacteria that were found across samples that were either filtered out or retained, does this makes sense? Nevertheless, i think that this suggestion will work for removing sequences that were say found in our no template controls for example - something that i will be looking into next.

in the meantime, is it even necessary to try to remove seqs from the repseqs.qza file? i haven't tried this yet, but i think that passing the repseq.qza file as is, along with a filtered table.qza file in downstream analysis should compute stats (e.g. matrices) for only the seqs with corresponding IDs in the table.qza file correct?

Sorry, culpa mea, I wasn't reading closely enough the first time around. Got it now.

That's easy. You can filter a sequence artifact to match only features found in table X by using feature-table filter-seqs. Just do something like:

qiime feature-table filter-seqs \
    --i-sequences sequences.qza \
    --i-table filtered-table.qza \
    --o-filtered-sequences filtered-seqs.qza

That should do the trick.

In theory, no you do not really need to filter seqs. For all downstream statistical analyses that I can think of, you don't even use the sequences file itself so having extra seqs does not really matter.

Where it does matter is for steps that do take a sequence file: taxonomy classification (which you've already done), alignment/tree building (which you've probably already done, too), and some other non-core analyses (like some methods in q2-quality-control). Having these extra sequences will just slow down those steps, that's all. So probably not important, but sometimes it's best to cut out dead wood and not have to worry about the details (at least when the method exists and is easy).

I hope that helps!

1 Like

Thank you for getting back @Nicholas_Bokulich the feature-table filter-seqs was what i needed - it worked! :blush:

P.S. just in case someone finds and proceeds to use this:

--i-sequences should be --i-data it took me a few tries to figure out what i was doing wrong.
Cheers,
Nsa

2 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.