16S V3-V4 length : what to do with the sequences of shorter than expected length after DADA2?

Liang_Cheng · November 3, 2020, 11:35pm

Hi All,

I'm working on my 16S V3-V4 pair-end sequences of which the amplicon length should be around 460 nts. However, after joining them with DADA2 the Sequence Length summary looks like the following:

Sequence Count 21452
Min Length 271
Max Length 457
Mean Length 412.53
Range 186
Standard Deviation 26.46

I understand that V3-V4 length varies but I have a number of sequences too short (less than 300 nts). I blasted the shortest one (271 nts) after OTU picking and the best match was fungi.

Questions:

I wonder if you would suggest me to get rid of them, and if yes, how could I do this in Qiime2.
I will run the Vsearch-uchime-denovo after picking OTUs. Would the uchime-denovo remove these weird short sequences, if I don't remove them at this current step?

Thank you very much. I really appreciate your help.

Mehrbod_Estaki · November 6, 2020, 5:52am

Hi @Liang_Cheng,
The expected 460 amplicon length includes the primer sites, however, you likely have removed those (as you should) prior to DADA2, so the difference in length is due to that.
V3-V4 region can hit some unspecific targets as you have seen. My experience is that it can hit quite a bit of mouse host genes if the sample is high in host cells. Eitherway, their removal is important and I would recommend doing this.

You have a couple of options here. To get rid of non-16S reads you can use a permissive positive filter like the one implemented in Deblur by default. I believe it uses 88% clustered greengenes OTUs, then you can exclude sequences using quality-control exclude-seqs and give it very permissive threshold like 65% identity, with 50% coverage. This will basically toss away any reads that look weird and not anything like bacteria. I've found this method to work really well for me and is very fast. You can also go on to build your taxonomy file first and then use taxonomy-based filtering to discard reads that don't hit at least at a Phylum level in a bacteria database. I prefer the first approach but I don't have any benchmarking data to recommend one over the other. See what works best for you.

I doubt it since these are real targets hit by the primers and not chimeras. Also note that DADA2 already has a chimera removal step in it so you don't need to do this again separately.

I would advise you to read the literature and the various posts on this forum as why you should -or more likely- shouldn't use OTU picking and stick with your ASVs.

Hope this helps!

Liang_Cheng · November 6, 2020, 8:12pm

Hi @Mehrbod_Estaki, Thank you so much for your thorough explanation! I have a few more questions regarding your suggestions if you don't mind:

I saw that quality-control exclude-seqs would need --i-reference-sequences, which you suggested the one implemented in Deblur. I have not used Deblur yet, how could I get the filter? Would a greengene database (say 88% if that's what Deblur uses) work for the same?

Is the 65% identity, with 50% coverage threshold something very arbitrary as long as it is very permissive?

I'm using the open reference OTU picking method. Would it be possible that I discard unassigned bacteria OTUs if I use this approach by keeping 'Bacteria' only at the phylum level? or I should not worry about this because bacterial OTUs that do not hit the database would at least be assigned as 'Bacteria' at the phylum level.

Besides these two approaches you suggested, I also saw suggestions to get rid of sequences less than 490 nts. What's your thought on this? Is there a way to do this in Qiime2?

Thank you again and I appreciate your help.

Mehrbod_Estaki · November 6, 2020, 9:20pm

Hi @Liang_Cheng,

Yes, this is exactly what I meant, and what Deblur uses. You can download the Greengenes files from the data resource page.

I believe there were some benchmarking done with these parameters, you can look at those details here.

With open-reference picking (which I don't recommend) you get a mix of reads with taxonomy that hit your reference database and will have taxonomy names and anything that doesn't hit will get 'de novo' naming and thus no taxonomy name. So taxonomy-based filtering will only work on the portion that were hit. You could assign taxonomy to those de novo reads but that just sounds like a lot of extra unnecessary steps. I would just skip doing OTU clustering all together unless you have a specific reason to do so. If you really need OTU picking for some reason, I would do DADA2 + de-novo picking, + assign taxonomy using naive bayes classifier + then filter based on taxonomy.

I believe you mean 390 nt, and even then you need to account for having removed your primers so ~350 more likely in your situation. If these reads pass your positive filtering and/or taxonomy-based filtering you may want to blast a few of them to see what they actually are before discarding them. If you do end up wanting to remove them you can use this nifty little hack within QIIME 2 to do that.

Liang_Cheng · November 6, 2020, 10:31pm

Hi @Mehrbod_Estaki,

Thank you for your fast reply! I understand your suggestions to my original question now but I guess I'm get more questions about my pipeline (thank you for bringing it up).

I see your suggestion of sticking with ASVs rather than diving into OTUs. Our lab has been doing OTU picking and I guess I've never asked why . I will definitely read more literature on this. But say if I'm working on OTUs, why is open reference a bad idea? Isn't it something between the two extremes (close reference and de novo)? I have soil samples and I checked one of my samples, which got about one third de novo OTUs after open reference picking. These de novo OTUs were still able to be assigned to a certain level of taxonomy using feature-classifier classify-consensus-vsearch.

Sorry I did mean 390 nts.

What is the little hack here? Greatly appreciated!

Mehrbod_Estaki · November 6, 2020, 10:52pm

Oops, sorry I forgot to embed the link. Here is the trick for the length filtering.

Better late than never! Certainly worth challenging that notion in most cases.

Not really, to me its basically picking the worst of both worlds, but that is another topic on its own and unrelated to this one. For your interest, several reading materials collected in this thread.

Liang_Cheng · November 6, 2020, 10:57pm

Thank you so much! This really helps!

Now I really need to read about ASVs vs. OTU picking. (Just checked your slides on this from the last workshop and you said "You can do both" )

Mehrbod_Estaki · November 7, 2020, 12:49pm

Yes, and you certainly (and should) do both IF you need todo OTU picking. In that presentation I explain that in most cases using the ASVs is the best approach but there are some cases that your biological question is best answered by OTU picking, in those rare instances we recommend that you still start with ASVs (as the denoisers have much superior quality control methods) and then collapse your ASVs down to OTUs as needed.
Hopefully that makes a bit more sense!

system · December 8, 2020, 6:49pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.