>90% of sequences are unidentified?!? WHY?

NPK · May 29, 2019, 12:00pm

Hello all q2 experts,

Can someone please kindly give me some advice on my situation.
Herewith the summary of my project:

ITS1F/ITS2 PE300 MiSeq
demultiplexed input
dada2 denoise-paired ( --p-trim-left-f 22 --p-trim-left-r 20 --p-trunc-len-f 280 --p-trunc-len-r 180 )
unite-ver8-99-classifier-02.02.2019

image1924×562 181 KB

image946×1232 103 KB

image1610×1296 213 KB

I did try to use ITSxpress plugin and almost everything got filtered out

image964×1234 94.6 KB

I am wondering why there is such a big percentage of non-target amplification/ error within all of my samples.

Regards,
Namie

Nicholas_Bokulich · May 29, 2019, 12:27pm

Welcome @NPK!

What sorts of samples are you analyzing? You could be amplifying e.g., plant DNA... I used to run into this problem frequently when using similar primers to examine plant-associated microbial communities.

What taxonomy classification method did you use? Another possibility is that your reads are in the wrong orientation, which will confuse classify-sklearn. Try the BLAST- or vsearch-based classifiers in QIIME 2. If you get the same result, this is probably non-target amplification.

The ITSxpress results you see could indicate either of these problems, but maybe @Adam_Rivers has some more ideas?

NPK · May 29, 2019, 12:52pm

Hi Nicholas,
Yes you are correct. I am analyzing fungal community from eucalyptus leaves. I thought ITS1F could counter all plant DNA amplification?

yes, classify-sklearn. Thanks for your recommendation, I will try other classifiers and see how it goes.

Adam_Rivers · May 29, 2019, 1:03pm

ITSxpress is most likely not merging most of your reads because it has pretty high quality thresholds but it is a bit hard to tell from the information I can see in the post.

I'd second the suggestion to Blast a subset of reads and try to get a better sense of what's happening. Plant contamination seems like the most likely culprit.

NPK · May 29, 2019, 3:29pm

I did try again with the eukaryotes UNITE database. Well it is true that all those bastards are plant contamination. Thank you @Nicholas_Bokulich @Adam_Rivers

Well now can I have some advice of how to avoid this situation. How to lower the chance of amplifying plant DNA? Is it all crucial in the library prep step?

Nicholas_Bokulich · May 29, 2019, 3:47pm

You have already used the best method: choose primers that do not amplify plant DNA. ITS1F is supposed to do that, but obviously is not doing its job!

Library prep is where most of this should happen; e.g., if you are able to remove plant matter from your samples prior to DNA extraction, perhaps by rinsing leaves and then filtering.

When I have done plant-associated microbiome work I have just attempted to increase the sequencing depth (i.e., put fewer samples on a single sequencing run) so that I can afford to lose some of my sequences to non-target hits. In some samples I would lose 90% of my sequences! And some samples could not be recovered. But if you have enough non-plant sequences left over you can just proceed with the leftovers.

ben · May 29, 2019, 4:59pm

@NPK

This may be analogous, but I have this same problem with low-biomass lung samples. To exclude reads from eukaryotic sources, you can do a quality filter step where you essentially blast/vsearch to a taxonomic file (99_otus.txt) from your training set. Then once you generate a hit/misses.qza you can then filter out ALL of the "misses" from your table/sequences.

I use this code:

qiime quality-control exclude-seqs \

--i-query-sequences ~/id-filtered-seqs.qza
--i-reference-sequences ~/greengenes/trained.v4/99_otus/99_otus.qza
--p-method vsearch
--p-perc-identity 0.97
--p-perc-query-aligned 0.97
--p-threads 4
--o-sequence-hits ~/99_hits.qza
--o-sequence-misses ~/99_misses.qza

Then obviously filter your sequences file. Once you filter your sequences file by excluding misses. You filter your table.

Ben

Nicholas_Bokulich · May 29, 2019, 5:14pm

That method is great for miscellaneous non-target DNA, but I would actually discourage this for ITS data, just because your non-target plant hits are still ITS sequences and you would need to figure out a reasonable threshold of sequence similarity (i.e., how dissimilar plant ITS is from fungal ITS) to use the exclude-seqs method.

Instead, ITSxpress should do a good job of removing most plant reads. Anything that passes you can filter out after taxonomy classification, using qiime taxa filter-table as shown in this tutorial. Something like this:

qiime taxa filter-table \
  --i-table table.qza \
  --i-taxonomy taxonomy.qza \
  --p-exclude k__Viridiplantae \
  --o-filtered-table table-no-plants.qza

or better yet (in case you hit multiple non-fungal kingdoms):

qiime taxa filter-table \
  --i-table table.qza \
  --i-taxonomy taxonomy.qza \
  --p-include k__Fungi \
  --o-filtered-table table-no-plants.qza

ben · May 29, 2019, 5:16pm

@NPK

Thanks Nick, I figured there should be caveats with ITS. I am not an expert on ITS.

NPK · May 30, 2019, 11:23am

Thanks everyone for all the advice. I really appreciate it.