Filtering host sequence from rep_seq.qza

Hi
Thank you for the detailed post on removing host sequences from feature table. I followed till that step

qiime quality-control exclude-seqs --i-query-sequences rep-seqs.qza --i-reference-sequences gg_13_5_otu_97.qza --p-method vsearch --p-perc-identity 0.97 --p-perc-query-aligned 0.95 --p-threads 4 --o-sequence-hits hits.qza --o-sequence-misses misses.qza --verbose

I got the following output:

142290491 nt in 99322 seqs, min 1254, max 2353, avg 1433
Masking 100%
Counting k-mers 100%
Creating k-mer index 100%
Searching 100%
Matching query sequences: 1234 of 6428 (19.20%)

I got matching query sequence percentage is only 19.20%. Does that mean that most of the sequences (nearly 80%) are host sequences. Is it due to the sequencing error? Can I proceed with this data for taxonomic classification? Will it be meaningful?

My second question is: How can I remove host sequences from rep_seq.qza too? So that I can use filtered rep_seq.qza for taxonomic classification step using the following command to improve my taxonomic identification:

qiime feature-classifier classify-sklearn --i-classifier gg_13_5_otu_97.qza --i-reads hits.qza --o-classification taxonomy_filtered.qza

Thank you in advance

Not necessarily — this means that 80% of the sequences do not have at least 97% similarity to at least one reference sequence with at least 95% of the query sequence aligned.

I think you are filtering too stringently. If using a "positive filter" like this to find sequences that resemble bacterial sequences (as opposed to host sequences, which should be quite dissimilar), you should probably reduce the percent similarity setting a bit... I do not know how low you can go, but probably much lower (80%??? It all depends on how dissimilar the host is!).

I am confused. That is what you are doing above.

the filtered rep_seqs.qza are the hits.qza in your example (I assume).

You can also just proceed with taxonomic classification and remove any sequences that fail to classify (as these are probably host DNA or other non-target DNA unless if they are quite similar to target sequences).

I hope that helps!

1 Like

is this percentage which everyone gets normally? Or is this due to sequencing error.

Is 80% an acceptable similarity in term of publication. In my case the host is human.

Yeah sorry. I was confused too. I got the better taxonomic classification

Thnak you for time

No, this is not normal, and 80% is too high. This is not due to sequencing error, it is do to bad parameter settings (don't use the defaults in your case!). See this comment from above:

Set --p-perc-identity lower — maybe 80%? — and you should get a better yield. Since you are doing this filter to remove non-bacterial (mostly host) DNA, that should be adequate. That will not remove mitochondrial DNA, but you can do an additional filter with qiime taxa filter-table to remove mitochondria as shown here.

Good luck!

Thank you for the calcification.

As you suggested, I tried again with --p-perc-identity 80%. It improved inly 2.74%. I got 21.94% matching sequences.I am trying with even low identity percentage.

Thank you for your help

Hi @steffi,
You could also try reducing the --p-perc-query-aligned parameter, though 0.95 seems fairly reasonable. If you still have barcodes or adapters in your sequences, those would cause issues with this and should be trimmed if they have not already —just checking. 80% is surprising but not unheard of! (I used to do work with microbiomes on plant tissue where we’d lose 90% of reads to host DNA sometimes)

You could take some of the filtered reads and NCBI BLAST them just to confirm that they are host DNA — if not, I would suspect there are barcodes in the sequences.

Good luck!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.