The expected 460 amplicon length includes the primer sites, however, you likely have removed those (as you should) prior to DADA2, so the difference in length is due to that.
V3-V4 region can hit some unspecific targets as you have seen. My experience is that it can hit quite a bit of mouse host genes if the sample is high in host cells. Eitherway, their removal is important and I would recommend doing this.
You have a couple of options here. To get rid of non-16S reads you can use a permissive positive filter like the one implemented in Deblur by default. I believe it uses 88% clustered greengenes OTUs, then you can exclude sequences using quality-control exclude-seqs and give it very permissive threshold like 65% identity, with 50% coverage. This will basically toss away any reads that look weird and not anything like bacteria. I’ve found this method to work really well for me and is very fast. You can also go on to build your taxonomy file first and then use taxonomy-based filtering to discard reads that don’t hit at least at a Phylum level in a bacteria database. I prefer the first approach but I don’t have any benchmarking data to recommend one over the other. See what works best for you.
I doubt it since these are real targets hit by the primers and not chimeras. Also note that DADA2 already has a chimera removal step in it so you don’t need to do this again separately.
I would advise you to read the literature and the various posts on this forum as why you should -or more likely- shouldn’t use OTU picking and stick with your ASVs.
Hi @Mehrbod_Estaki, Thank you so much for your thorough explanation! I have a few more questions regarding your suggestions if you don’t mind:
I saw that quality-control exclude-seqs would need --i-reference-sequences, which you suggested the one implemented in Deblur. I have not used Deblur yet, how could I get the filter? Would a greengene database (say 88% if that’s what Deblur uses) work for the same?
Is the 65% identity, with 50% coverage threshold something very arbitrary as long as it is very permissive?
I’m using the open reference OTU picking method. Would it be possible that I discard unassigned bacteria OTUs if I use this approach by keeping ‘Bacteria’ only at the phylum level? or I should not worry about this because bacterial OTUs that do not hit the database would at least be assigned as ‘Bacteria’ at the phylum level.
Besides these two approaches you suggested, I also saw suggestions to get rid of sequences less than 490 nts. What’s your thought on this? Is there a way to do this in Qiime2?
Yes, this is exactly what I meant, and what Deblur uses. You can download the Greengenes files from the data resource page.
I believe there were some benchmarking done with these parameters, you can look at those details here.
With open-reference picking (which I don’t recommend) you get a mix of reads with taxonomy that hit your reference database and will have taxonomy names and anything that doesn’t hit will get ‘de novo’ naming and thus no taxonomy name. So taxonomy-based filtering will only work on the portion that were hit. You could assign taxonomy to those de novo reads but that just sounds like a lot of extra unnecessary steps. I would just skip doing OTU clustering all together unless you have a specific reason to do so. If you really need OTU picking for some reason, I would do DADA2 + de-novo picking, + assign taxonomy using naive bayes classifier + then filter based on taxonomy.
I believe you mean 390 nt, and even then you need to account for having removed your primers so ~350 more likely in your situation. If these reads pass your positive filtering and/or taxonomy-based filtering you may want to blast a few of them to see what they actually are before discarding them. If you do end up wanting to remove them you can use this nifty little hack within QIIME 2 to do that.
Thank you for your fast reply! I understand your suggestions to my original question now but I guess I’m get more questions about my pipeline (thank you for bringing it up).
I see your suggestion of sticking with ASVs rather than diving into OTUs. Our lab has been doing OTU picking and I guess I’ve never asked why . I will definitely read more literature on this. But say if I’m working on OTUs, why is open reference a bad idea? Isn’t it something between the two extremes (close reference and de novo)? I have soil samples and I checked one of my samples, which got about one third de novo OTUs after open reference picking. These de novo OTUs were still able to be assigned to a certain level of taxonomy using feature-classifier classify-consensus-vsearch.
Sorry I did mean 390 nts.
What is the little hack here? Greatly appreciated!
Yes, and you certainly (and should) do both IF you need todo OTU picking. In that presentation I explain that in most cases using the ASVs is the best approach but there are some cases that your biological question is best answered by OTU picking, in those rare instances we recommend that you still start with ASVs (as the denoisers have much superior quality control methods) and then collapse your ASVs down to OTUs as needed.
Hopefully that makes a bit more sense!