I have the following issue. I am analysing 16S rRNA 250bp paired end mouse gut microbiome sequencing data (V3-V4 region) that have been obtained using Nextera tagmentation technology. Both forward and reverse read quality looks ok to me (see
, where you can see that adapter content throughout my reads is high (varies from 20 to 50% depending on sample, on average probably around 30%). Naturally, after running cutadapt a good fraction of my reads become shorter than 250bp and thus after de-noising with dada2 I get a lot of losses if I try to use trunc-len=250 for both forward and reverse reads (
, where I trimmed the first 10bp of each read; I also increased –-p-max-ee to 5 to try and keep as many reads as possible). My first question is: does anybody have experience with Nextera adapters? Is this usual behavior? I have seen a previous post about this (nextera transposase contamination) but the fraction of reads carrying adapters appeared to be much lower than what I find.
Independent on the Nextera issue I do think my data are still usable, however, I also believe that I need to find a way of keeping more of the reads after filtering and denoising. Given that I have very little wiggle room for cutting if I want my forward and reverse reads to merge I concluded that my best option would be to use only the forward reads and cut them to a length that allows me to keep for most samples >50% of them. This seems to work nicely in terms of sheer volume of reads that are retained (
), however, I wonder how short is too short for cutting my reads. I know there is no absolute answer to that but I would like to get a feel for what sounds ok and what is definitely too short. Is 200bp ok? Could I potentially go shorter? I guess it’s a balancing act between keeping as many reads as I can and keeping the reads as long as possible to get enough resolution for my taxonomic analysis.
I will welcome any comments, ideas and suggestions!
I agree. (Thanks for posting all your graphs, btw)
Yes, and yes, but like you said…
Well said! Trying to classify short reads is discussed in Liu 2008 and the classic paper Wang 2008, and while those discuss pyrosequencing, I imagine this would transfer to Illumina. I guess you could see how much taxa resolution you lose when you trim down to 150 or 100 bp.
Thank you Colin for your answers and all the references!
Can I ask how you would measure taxa resolution? Would you simply look at percentage of assigned species, genus and all the way up the taxonomy for e.g. 150 vs 100 bp?
And also what would you consider a satisfactory % at for example genus level?
Exactly. The classifiers in q2-feature-classifier are meant to (by default) only classify to a depth at which the classifier is confident about the result (based on the reference data). This can be adjusted by changing the confidence threshold (classify-sklearn) or the consensus threshold (classify-consensus-* methods), e.g., to allow less confident classification (not necessarily a good thing!).
There is a QIIME 2 plugin, RESCRIPt, which can create a nice visualization of taxonomic resolution, e.g., number of unclassified taxa at each rank. See here for details:
The degree of resolution for any given taxon is going to depend on the length of the sequences and the marker-gene region. I suggest you just let yourself be guided by the results — unclassified sequences at phylum level are a sign of technical issues, but at genus level lack of classification is usually just lack of resolving power, about which there is little you can do (except choose a different marker gene or longer sequencing read length!), so there is not a satisfactory % for accepting/rejecting a result, only to accept/reject the technology! (and that threshold is decided by you )