142290491 nt in 99322 seqs, min 1254, max 2353, avg 1433
Masking 100%
Counting k-mers 100%
Creating k-mer index 100%
Searching 100% Matching query sequences: 1234 of 6428 (19.20%)
I got matching query sequence percentage is only 19.20%. Does that mean that most of the sequences (nearly 80%) are host sequences. Is it due to the sequencing error? Can I proceed with this data for taxonomic classification? Will it be meaningful?
My second question is: How can I remove host sequences from rep_seq.qza too? So that I can use filtered rep_seq.qza for taxonomic classification step using the following command to improve my taxonomic identification:
Not necessarily — this means that 80% of the sequences do not have at least 97% similarity to at least one reference sequence with at least 95% of the query sequence aligned.
I think you are filtering too stringently. If using a "positive filter" like this to find sequences that resemble bacterial sequences (as opposed to host sequences, which should be quite dissimilar), you should probably reduce the percent similarity setting a bit... I do not know how low you can go, but probably much lower (80%??? It all depends on how dissimilar the host is!).
I am confused. That is what you are doing above.
the filtered rep_seqs.qza are the hits.qza in your example (I assume).
You can also just proceed with taxonomic classification and remove any sequences that fail to classify (as these are probably host DNA or other non-target DNA unless if they are quite similar to target sequences).
No, this is not normal, and 80% is too high. This is not due to sequencing error, it is do to bad parameter settings (don't use the defaults in your case!). See this comment from above:
Set --p-perc-identity lower — maybe 80%? — and you should get a better yield. Since you are doing this filter to remove non-bacterial (mostly host) DNA, that should be adequate. That will not remove mitochondrial DNA, but you can do an additional filter with qiime taxa filter-table to remove mitochondria as shown here.
As you suggested, I tried again with --p-perc-identity 80%. It improved inly 2.74%. I got 21.94% matching sequences.I am trying with even low identity percentage.
Hi @steffi,
You could also try reducing the --p-perc-query-aligned parameter, though 0.95 seems fairly reasonable. If you still have barcodes or adapters in your sequences, those would cause issues with this and should be trimmed if they have not already —just checking. 80% is surprising but not unheard of! (I used to do work with microbiomes on plant tissue where we’d lose 90% of reads to host DNA sometimes)
You could take some of the filtered reads and NCBI BLAST them just to confirm that they are host DNA — if not, I would suspect there are barcodes in the sequences.