Hi @joaomiranda,
I don't see cutadapt being used in the provenance of your table-single2.qzv, are you sure you used the right input file there?
A few different things can be happening leading to you having even less reads in the second run. For one, is it possible you started with less reads after cutadapt? Check to see what your starting reads were between the 2 runs after you ran cutadapt.
Something I like to do with cutadapt if know for a fact that my primers are still intact is to set the --p-discard-untrimmed
, this will discard any reads that didn't actually have those primers in them. I justify this by saying anything that is in there without the primers it not what I targeted. This will toss away a bunch of those potentially unannotated features you're seeing.
Next, in your first run you trimmed 65 nt from your 5' and truncate at 240 from your 3'. In the second run you actually trim less from the 5' with 40 and truncate to the same 240 spot. From looking at your quality plots I'm not surprised you're seeing such a massive drop at the initial filtering step of DADA2. Unfortunately your quality scores are quite low on your 5'. If I were you I would trim at least 51 nt to get you passed that initial dip in quality scores, I'd also try a second run with 80+ nt trimming to see you pass that second odd dip too. Luckily for you even if you were to trim that much from the 5' you should still capture a bunch of the V4 region and that should still give you decent resolution.
Now moving to your classification issues, again 2 things come to mind. First, did you see @Keegan-Evans 's recommendation about filtering out host reads?
There's a previous discussion on this if you want to see an example of what to do here. You can use much lower % identity and alignment than what I did a few years ago in that post. The idea is to just filter your reads to toss away anything that doesn't look like 16S, we're not looking for perfect matches at that step.
The second I noticed was that when you are extracting your reads for training the feature-classifier, you are truncating your reads at 200 with no trimming, however your dada2 parameters have you doing a 40/240 trim/truncate. What is essentially happening here is that the region you have extracted and trained your classifier is different than the ones where your ASVs represent. So, as @Keegan-Evans already mentioned:
I'm not sure if that is going to resolve all your classification issues, but it should certainly improve it! I'm of the opinion that you are getting a lot of reads that are coming from the host cells and are not true bacteria reads. One easy thing you can do to check this, is just find some of those ASV sequences that are not hitting past Domain/Phyla level and BLAST a few of them to see what they actually hit. The rep-seq visualizer already provides you with hyperlinks to BLASTn so is super easy to do.
Try these out and get back to us, hopefully we'll se some improvements!