I’m trying to do some analysis using SRA reads from a mock community that was published here.
The paper said that using their pipeline (mostly Mothur with SILVA database) they could observe all the bacteria that were actually present in the sample.
I tried several runs of QIIME2 with CutAdapt to remove primers, then DADA2 with different trimming lengths, but none of them managed to detect all of the bacteria in the mock sample.
Any advice on which parameters to play with?
Many thanks for your help!
That publication appears to note that the authors could detect all bacterial genera, but not species. Are you looking at genus level or species level for your assessments?
Since you are working with mock communities, you can change any parameter you like to try to optimize the results. Unless if results look really bad, I’d suggest starting with the taxonomy classifiers and working backwards. E.g., lowering the
confidence settings will most likely lead to higher recall (detection of expected species) at the cost of precision (i.e., detection of more false-positives!). See here for more details:
Thank you for your kind reply and suggestion!
To answer your question: yes, I'm looking at the genus level, like what the authors did.
I also have some update:
I tried using lower trimming lengths in DADA2, and it worked.
The raw reads are 251 bp, and they sequenced the V4 region of 16s rRNA (about 254 bp).
Previously I used CutAdapat then DADA2 with 220 and 240 bp as trimming length values, but could not detect all genera (it's missing one genus).
I then tried DADA2 with 140, 160, 180, 200 bp as trimming length values (without CutAdapt) and these worked, but I'm not sure why they worked. Also, even though the genus in question can be detected by lowering the trimming length values, it's feature frequency is very small (below 10).
Do you think this is because of the raw data (the reverse reads) do not have enough good quality bases after 200 bp?
Please find attached the demux file below:
Mock_QMINI_demux.qzv (291.7 KB)
I wonder if this finding would be applicable to my actual samples (whose actual genera and composition are unknown)?
Many thanks for your time and kind help!
Based on my understanding as a QIIME2 user:
DADA2 corrects nucleotide sequence errors and generates ASVs. When specific trim lengths are set, DADA2 will trim all longer sequences to that specified length, and “throw away” any sequences that are shorter than that specified trim length. Setting the trim length longer might omit slightly shorter sequences, while setting shorter trim lengths may cut into the variable region.
The QIIME2 “feature-classifier” plugin then takes the ASVs (denoised by DADA2) and classifies them to reference database with a classifier. If I’m not mistaken, I think @Nicholas_Bokulich’s comment might be referring to the feature-classifier parameters instead of DADA2 parameters.
I recently read the feature - classifier paper, and if I remember correctly, a lot of the suggested parameters were based on v4 region data. There is a table with k - mer size and confidence interval parameter suggestions based on the user’s analytical needs. The “recall” settings should classify more of the ASVs, but some of those classifications may include false positive taxonomic assignments.