exclude-seqs: Exclude sequences by alignment

Hello,

I have a handful of ASVs that is making my unweighted-unifrac PCOA plot have a huge hole in the middle of the cluster which is not biologically reasonable. The weighted unifrac PCOA clusters according to biologically sound ways. I tried the following code to exclude sequences by alignment:

qiime quality-control exclude-seqs
–i-query-sequences filtered-rep-seq.qza
–i-reference-sequences silva-138-99-seqs-515-806.qza
–p-method blast
–p-perc-identity 0.70
–p-perc-query-aligned 0.10
–o-sequence-hits hits.qza
–o-sequence-misses misses.qza

I chose perc-identity 0.70 because i also used sklearn-classifier perc-identity 0.70 (the default) and did not want to filter by perc-identity twice. My question is: how do I decide what perc-query-aligned value to use? the default of 0.97 seems far too stringent for microbiome analysis.

In case it is relevant, I am working on gut microbiome of marine animals and sequenced the 16s rRNA gene.

Thank you so much.

Hi @Melissa_Soh ,

What are these ASVs? is it biologically reasonable to exclude them? E.g., can you confirm that they are contaminants etc?

classify-sklearn does not perform alignment, it is a naive Bayes classifier based on kmer frequencies and the 0.7 is the confidence score, not a percent identity. So it is not relevant to consider in this context.

the 0.97 percent-identity default is based on the scenario where you might be filtering out something really specific so, e.g., looking for matches to a set of host sequences. I agree it is too strict for inclusion criteria in your case.

the perc-query-aligned and perc-identity should be defined based on evidence and expectations. How dissimilar are the sequences that you are trying to exclude? There are various publications out there describing the similarity of 16S rRNA gene sequences within different taxonomic groups, which you could consult to answer this.

perc-query-aligned 0.10 is very low and does not make sense unless if you expect the query to be a very poor match to the reference for some reason (unlikely for 16S rRNA genes)

Any chance the “bad ASVs” are host mitochondrial sequences or other non-target DNA? Instead of using q2-quality-control you could just use q2-taxa and exclude based on taxonomic label (e.g., classification to mitochondria).

Good luck!

1 Like

@Melissa_Soh I’d echo the suggestion that you have a close look at the bad ASVs, anecdotally if it is not off target DNA I have also encountered on-target DNA, but where the forward and reverse reads were flipped so a few ASVs was the reverse complement of what they should have been. I believe blast is fine with the reverse complement, but it will ruin the phylogeny.

2 Likes

quick side-note: RESCRIPt (a relatively new plugin) can be used to fix read orientation prior to classification or alignment/tree-building:
https://library.qiime2.org/plugins/rescript/27/

2 Likes

Hello Nicholas_Bokulich and dwt,

Thank you so much for helping me.

I would first like to ask something generic before moving on to my main question:

I tried running the following two scripts:

qiime quality-control exclude-seqs
–i-query-sequences filtered-rep-seq.qza
–i-reference-sequences silva-138-99-seqs-515-806.qza
–p-method blast
–p-perc-identity 0.70
–p-perc-query-aligned 0.97
–o-sequence-hits hits.qza
–o-sequence-misses misses.qza

and

qiime quality-control exclude-seqs
–i-query-sequences filtered-rep-seq.qza
–i-reference-sequences silva-138-99-seqs-515-806.qza
–p-method blast
–p-perc-identity 0.0
–p-perc-query-aligned 0.97
–o-sequence-hits hits.qza
–o-sequence-misses misses.qza

Both scripts gave the same results. I thought that is was confirmation that the sklearn-classifier --p-confidance is the same as --p-perc-identity and I now understand that I misunderstood. The fact that both scripts resulted in the same output could be because none of the ASVs had a perc-identity of <0.7. Is this correct?

my main concern is:

to give more context, I sequenced by gut microbial samples using single end EMP primers targetting 16s rRNA gene. The sequencing provider demultiplexed the reads and I imported them into qiime. I then use dada2 to denoise the sequences based on length (i check the quality by length and chose a cutoff). Next, I used classify-sklearn and input a trained silva.qza file. I then created a qzv of the taxonomy and viewed it on qiime 2 view. I downloaded the csv and made a new csv with only bacterial ASVs (I removed eukaryotic, unclassified, archaea ASVs and ASVs that were only classified as bacteria and at <85% confidence). I then used feature-table filter-features to filter the ASV table, keeping only the ASVs I indicated in my csv. I then used feature-table filter-seqs to filter my representative sequences according to my filtered ASV table. Next, I classified with classify-sklearn and did align-to-tree-mafft-fasttree. Finally, i did core-metrics-phylogenetic. This was when I observed the gap in the unweighted unifrac PCOA.

My question is: Is there a way to find out which ASVs are the cause? I looked at the aligned sequences (after filtering by taxonomy) and noticed that a handful do not align as well as the rest. I assumed that these are the ones that are causing a gap in the PCOA. As such, I tried filtering by feature-abundance. Removing features that only have 14 counts removed the large gap in my PCOA. I arbitrarily chose 14 as the cutoff just to see if it works. I also tried the exclude-seqs method, where all parameters that I tested removed the large gap. I tried to look for similar literature to no avail.

I used NCBI blast to check the sequences that are not aligned well and found that some are eukaryotes. I am not sure why they are there when I have already filtered out the eukaryotes and archaea etc in the feature-table filter-features step.

from my understanding, there shouldn’t be a problem with the direction of the ASVs as the sequences have been generated using single end emp primers and demultiplexed accordingly. I hope this makes sense :confused:

Thank you once again for your kind help.

Sincerely,
Melissa

1 Like

Correct

I recommend using q2-taxa to filter based on taxonomy — it will save you time and preserve your data provenance. See the tutorials at docs.qiime2.org for examples of removing non-bacteria.

You also did not mention mitochondrial sequences — I wonder if these could be the issue.

Could your tree be the issue? See the documentation for q2-fragment-insertion and give that a try… the error you are describing sounds a bit like the example given in the documentation for that method. At the very least, the tree will need to be remade after removing non-bacterial sequences.

Sounds like you ran classify-sklearn twice? No need to do that… once is enough, you will not get a different result by running it the second time (and if you really want to remove the non-bacterial taxonomies from your FeatureData[Taxonomy] Artifact, you can use RESCRIPt to do so).

This sounds like a good approach. It is unclear from your description whether this could be due to ASVs present in a particular group, or in all samples. You could just try removing these poorly aligned sequences and see what it does.

I am not sure why either — you could check the feature IDs to see what they were classified to with classify-sklearn.

You should not filter based on the confidence score. The confidence score reported is for the terminal rank only, and this is how the classify-sklearn method chooses the depth of classification. Use the --p-confidence parameter instead to adjust the threshold you want.

That’s correct. If you used the EMP primers and protocol, the reads should all be in the same orientation.

Good luck!

2 Likes

Hello Nicholas_Bokulich,

Thank you so much for your explanations and suggestions.

I did it this way this time with the following script:
qiime taxa filter-table
–i-table table.qza
–i-taxonomy silva-taxonomy.qza
–p-exclude archaea,eukaryota,mitochondria,chloroplast,unassigned
–o-filtered-table filtered-table.qza

I reviewed the parkinsons mouse tutorial and I did fragment-insertion sepp and fragment-insertion filter-features. I then used feature-table filter-seqs to generate a new rep-seqs.qza. I aligned once more using align-to-tree-mafft-fasttree and subsequently diversity core-matrics-phylogenetic. When I viewed the unweighted unifrac PCOA, the gap is still there. When i compared the PCOA i generated without sepp and with sepp filtering, the PCOA looks identical. I then exported the aligned rep-seq.qza as a fasta file and took a look. Adding a sepp filtering step did not decrease the number of ASVs.

do you have suggestions as to how to do this? I thought that fragment-insertion sepp and the subsequent fragment-insertion filter-features would do the trick but it did not. I also tried exclude-seqs (as described in the first post) and it did close up the unexplained gap in the clusters but i am unsure about choosing the parameters. I also tried to remove the low abundance reads and it also helped close up the unexplained gap in the clusters. However, I read about filtering by abundance and found out that it is not encouraged because dada2 already got rid of spurious reads.

I tried this, and found that some of the ASVs were classified only by the domain of bacteria.

When i tried quality-control exclude-seqs at a very low threshold of perc-query-aligned 0.01 (i tried 1% just to see how it turns out) the cluster is closed up, indicating to me that the sequences removed by quality-control exclude-seqs were very misaligned (is that the correct inference?). Yet, when I use fragment-insertion sepp and filter-features, the cluster does not, indicating to me that there are no sequences that are very misaligned.

I am unsure of how to proceed. Thank you once again for your help.

Sincerely,
Melissa

Wow… it is surprising that fragment insertion is not dropping these. exclude-seqs searches in both directions, so this is not a directionality issue.

exclude-seqs as you described sounds like a fine way to filter out these non-target sequences that are failing to align and are causing this gap. You could also use qiime taxa filter-table to include only sequences that are classified to at least phylum level — this is quite routine, and you can see docs.qiime2.org for examples.

Those features that fail to align/classify: you should spot check a few with NCBI BLAST to see what they are… chances are they are non-target (maybe host DNA).

1 Like

Hello Nicholas_Bokulich,

This time, I filtered out ASVs that are not classified at least to the phylum level (together with excluding archaea,eukaryota,mitochondria,chloroplast,unassigned) as you suggested and the PCOA gap closed up. It is good to know that the tutorial suggested it and that it is routine.

which extract-reads command are you referring to?

Thank you, I will try this out. So far for the few I tried, they matched to bacterial sequences but with relatively low identity/ partial matches only.

Thank you so much for your patience!

Cheers,
Melissa

Very glad to hear that fixed your issue!

sorry, I miswrote — I meant exclude-seqs :grin: I was referring to this comment from you:

I will edit above for clarity, thanks for catching!

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.