This isn’t technically what is happening here. The reason why the reverse reads look “better” is that you now have majority of your reverse reads that super short, so the reads that are not trimmed that much appear to be in better quality.
I’ll use your new trimmed-paired-end-demux.qzv as an example here. Hover your mouse over the q-score plot. You’ll see a line in the description that says:
The plot at position 23 was generated using a random sampling of 6060 out of [object Object] sequences without replacement[…]
Since by default these plots are using 10,000 random reads for demonstration, we know that about 40% of your reads following cutadapt are now never 23 bp long.
So something weird has happened during your cutadapt step, in particular to your reverse reads because we don’t really see this issue with your forward reads.
These do not look good at all, with majority of your reads being unassigned. The forward reads are particularly problematic because there isn’t even an option to view beyond level 1 in the taxonomy levels.
The taxonomy issues might be a) related to how you created your classifier or b) you truly have that much host contamination. This is worthy of starting its own issue with the PRESCRIPt folks once we sort out the feature-table stuff here to make sure the issue isn’t just with the processing.
I strongly doubt that, we get much better classification with even 100 bp than what you’re seeing here. This is a separate issue that is only worth trouble-shooting once we sort out the feature-table issue.
You have a few options here, this is what I would do
- A good starting place is to reprocess with just your forward reads for now
- Instead of removing adaptors, try using your primer sequence in cutadapt, this will look for your primer sequence and remove everything before it, including the adaptors and anything else. You an also set the --p-discard-untrimmed parameter and that does an extra level of filtering where any read that doesn’t have the primers in it gets discarded.
- Denoise, truncate to 100bp. If you use Deblur, it has a nice negative-filtering step in it that gets rid of crazy looking reads that are often host contamination. If you use DADA2, I would recommend doing a similar negative filtering step afterwards. I find this step to be super useful when I’m expecting a lot of host contaminants. This thread explains this process a bit more in detail (but keep in mind that the % identity and coverage I used in that thread were too high. I would use something like 60-65% instead now)
- Re-run your feature-classifier and assign taxonomy.
This should be your standard bar, what you get here is essentially the maximum number of reads you should expect. Once you get a decent output from this step then you can go back and fine-tune further to get longer reads if you want. As you increase the length of your reads, the # of reads you retain will go down. I’m not sure with your current data if you’ll be able to ultimately use the reverse and merge reads with a decent # of reads, but I will say that the increase in resolution that you get from say 100, 150, or 250 bp with regards to taxonomic classification is not that great. So ultimately, it’s not that big of a deal if you can’t merge them, or even get much longer reads.
From Wang et al. 2007:
Hope this helps!