I’m working on 5 different studies with 16S amplicon (3 studies: V3-V4, 1 study: V1-V2, 1 study: V4-V5).
Consequently speaking, only 2 studies are not assigned taxonomy with silva-full-length-pre-trained classifier, so I tried to trace back. And I found out that I got too low features in OTU clustering and denoised feature-table in the 2 studies (study with 180 samples, but only 1~2 features are contained in few of samples (50 or 60 at maximum), and the rest features are contained in 1~2 samples).
To troubleshoot what I did wrong on the analysis procedure, I briefly write down below what I’ve done:
Import fastq (individual study, respectively)
cut primer sequences with qiime plugin referring to their papers
as one of studies are single ended, I decided to use only forward read for other studies
quality control with qiime plugin
denoise with dada2 single with same trunc length parameter (qiime dada2 denoise-single) without --p-max-ee parameter
OTU clustering (closed reference) with silva-SSU-99 reference with --p-perc-identity 0.97 parameter
visualise feature-table of individual study and figure out that too low number of features are involved in samples in 2 studies (2 studies are V3-V4 amplicon)
No problem found in other studies (e.g several features are included in all samples/almost all of samples)
After taxonomy analysis with OTU data, taxonomy is not assigned on the 2 studies while other studies are well assigned with silva-full-length-pre-trained classifier
Taxonomy analysis is done with OTU data with OTU clustering, right? I searched a lot on here, but I can’t find any clues on my case. I will really appreciate if you give me an opinion.
You pipeline mostly makes sense, but I think there are a few checks you may want to revisit
Are you discarding untrimmed reads when you do primer trimming? Have you checked how many reads you're losing at this step? I would recommend making sure you're not discarding everything if the primers have already been trimmed.
I would recommend this for a primer pair, but not for all the studies overall.
You should not do quality control before you run DADA2. DADA2 assumes that your reads not quality filtered, so I would recommend either using Deblur if you want to quality filter, or not doing that step. I would recommend the same parameters within the same hypervariable region, but would suggest that you can do variable regions specific primers.
I think the fact that you're tracing back to the ASV table is smart, and would recommend checking your read statistics in each step.
I again think this is smart it lets you scaffold all the sequences using consistent identifiers.
I don't understand this step. Why are you doing taxonomic assignment after you did closed refernece OTU clustering? One of the beauties of closed reference clustering is that the taxonomy is already assigned and the tree is already built for you. You just need to import them into QIIME 2 and you're ready to go!
So, I think I'd suggest the following modifications
Check your sequences after primer trimming for each study to make sure that reads have not been discarded
Use paired end reads for regions where you have paired end reads (except V13, that needs to be single end if its Illumina)
Either perform quality filtering and denoise with deblur or use DADA2 alone
Don't try to do taxonomic classification
Keep checking the number of reads retained at each step
Keep using consistent parameters within a single region
Hi @jwdebelius ! Thank you so much for your super detailed and kind comments. They made me think all process again and more in depth.
Nop, I didn't state the option, "--p-discard-untrimmed" on my command, so I didn't discard untrimmed reads.
I haven't checked this.. So you recommend comparing raw-demux.qzv file and primer-filtered-demux.qzv files and check how many reads are filtered out, right? And the smaller/none of reads are filtered out would be the best?
Sorry, I don't understand this. So you recommend using paired-end as paired-end (not using only forward) and single-end as single-end? this post said I should use forward read only when I deal with single-end and paired-end studies.
One more question on here: When I denoise them using dada2, I found out samples from one study lose most of their reads on merge steps and I couldn't make them be merged with changing trim/trunc length many times. Would it be fine to use just forward read on this study and use paired-end with other studies?
This is my mistake. I'll try again referring to this comment.
Is it fine to trim/trunc different parameter if studies are different hypervariable region data?
Maybe what I understand on this process is wrong. After OTU clustering, what I can get is OTU files (file which cluster reads analyzed as being from same species somehow; e.g OTU_1, OTU_2, OTU_3...), table.qza which is containing which feature/species(shown like "GQ448970.1.1284" on the table.qzv) exist in how many samples, and rep-seqs.qza which is containing representative sequence(shown like "AB506202.1.1522" on the rep-seqs.qzv) of specific feature. We don't know what "GQ448970.1.1284" is, we don't know whose "AB506202.1.1522" is.
This is why classifier needs and it assign taxonomy on top of those code-like names so that we can finally see through bar-plots with friendly(?) names like E.coli, S.epidermidis. Am I on wrong direction..?
This is why I love QIIME2 and this forum. I'm not alone
I'd recommend explicitly specifying that you don't want the behavior with the --p-no-discard-untrimmed. I never remember which is the default behavior, so I try to specify with things like that.
Yes! (I say this after weeks of frustration around losing reads and backtracking on various projects.) Checking at each step is really helpful to see where you're losing samples and letting you fix your parameters
I think the key for denoising it's externally valid (the ASVs are hte same) for the same primer pair, algorithm, and parameters. So, you treat each region/primer pair as unique. If you have single end reads for V4, then you treat all the V4 as forward and trim them to the shortest length. If you have all paired for V12, maybe you treat those as paired. You want to be consistent within a region, incause you want to compare those ASVs directly. (Although other may have other thoughts on this, so maybe some more mods will -in.
The database you used (Silva? GTDB?) has a sequence file with those names connected to the IDs. Its probably the file you used to build your classifier, if you built a classifier. Otherwise, you should be able to download it where you got your representative sequences.
Default of it is "False", but I'll try with specifying to be sure.
This is really good tip for my situation of now. I've checked that there are almost no (or completely no) reads loss after cutadapt phase now. I'll keep tracking the status after denoise and afterward processes
Ah, now I get it. So if my single-end data is V3-V4 data, I should treat paired-end data with V3-V4 region as single-end by using only forward, and it would be fine to deal with other paired-end with different region (e.g V4, V1-V2) as paired-end because V3-V4, V1-V2, and V4 regional data are all unique and get unique ASV without bias (like batch effect..?)!
And if I deal with same regional data, I should use same parameter (e.g trim/trunc length) within them, but it doesn't matter between different regional data, right?
I've used SILVA reference data. And I figure out that I've used incorrect way on here (using otu clustering with denoised-table, which is weird way..).. Thank you for correcting here
The ASV sequence you get depends on your primers and parameters. So, if you have a different primer, you're going to have a different ASV. (You have to think about how you want to combine the primers). Using consistent parameters means that all your features within a given region will have the same ASVs (and the same ASV IDs). It does not mean that you won't have other batch effects (i.e. reagent, collection kit, etc); it just means that for a region, the sample IDs will all be the same.