We have given 59 samples to service provider for sequencing V1-V9 region using Miseq. With the demultiplexed data, we have started QIIME2 pipeline.
3.Total length of the amplicon
V1V3 500 ±10
V4V6 550 ± 10
V7V9 440 ± 10
I followed the following workflow for pair-end
There’s a lot here, so hopefully I can address at least some of it.
Okay, so, first, how much do your sequence counts differ between the paired and single end picking? Because it sounds like you may have a quality problem overall, rather than simply an issue with the read joining. I would recommend considering your filtering parameters, as well as the read length and overlap for each region.
Okay, next, you’re dealing with multiple hypervariable regions. Do your counts take this into effect? So, are there 30,000 sequences when you combine the V1-3, V4-6, and V7-9 from a single sample, or are they seperated by region?
The hypervariable region for sequencing has a big effect on a lot of aspects of data, larger than a lot of biological effects you’re likely to see when dealing with a single body site on humans, as an example.
My recommendation would be to separate your table by the hypervariable region, and then perform parallel analyses targeting what you’re interested in. However, if you want to combine the data, you need to use closed reference picking, rather than a de-novo technique.
Without this information, it’s hard to answer the additional questions.
Thanks so much for the information. I definitely agree: single end was the way to go, although I worry about the quality of your reads if you’re losing so many.
The answer to this question depends a lot on the experimental design and how you barcoded and sequenced the data. Did you do the PCR in-house, or did the company do it for you?
You’ll be able to separate the regions if either (a) the samples were assigned the same barcode and sequenced on different runs or (b) the samples were assigned different barcodes. In either case, after demultiplexing, you should have 3 times the number of samples (S1-V13, S1-V46, S1-V79; S2-V13, S2-V46, S2-V79…). In case (a), you’ll need to load each region as its own demultiplexed file, denoise/cluster, and then combine the data. In case (b) you can load all the samples with their unique barcodes, and then process them.
If you don’t have that but you haven’t trimmed the primers, you might be able to somehow tease the demultiplexed files apart based on that primer sequence, and separate them.
I guess my other question, which is sort of high level and methological, was about the motivation for sequencing multiple hypervariable regions?
It may be wrong. It will definately have some biases that make it hard to compare with other datasets. (Or, it will have different biases that make the comparison hard.) It will also likely alter your results for compositionality compared to a single region. This is because there organism-specific primer biases, and because your ability to detect differences in closely related organisms varies region-to-region.
So, for instance, your ANCOM results might be different. You also (potentially) have multiple very different ASVs representing the same organism but that can’t be combined into a phylogenetic tree. So… I guess, based on that, it is wrong. I think to get a quasi functional from the tree-picking perspective, you might want to run a closed reference picking against a full-sequence reference database.
I don’t know if there’s been a comparison about how combining multiple regions affects your data (although it’s an interesting technical question.)
Also, I guess the idea of running multiple regions without a specific motivation kind of squicks me.
Unless there’s a specific motivation for running multiple regions (like you want to be able to access specific databases or perform a meta analysis), I guess Im wondering why. Like, I don’t think you can assemble reads off the 16s gene to get a better taxonomic resolution, which might be a motivation. And, even in those cases, I think you’d keep the regions separate?
So, Ive not ever done this, so I have no idea about whether or not this will actually work, but I think I might try using qiime cut-adapt demux-single where the barcode you use is the primer sequence, if it’s still in your data. This should () let you split your files into 3 primer specific files.
Thank you for your time. Just now I got the report from my sequence provider. They said, after the PCR amplification using three primers, they have pooled all three amplicons together and then they have taken into sequencer. They did not use any separate barcodes for regions. I am left with no choice here.
Is it possible to align my reads with 80% identity against V1-V3 and V4-V6 region from 16srRNA databases and retrieve the mapped reads and continue with QIIME2 analysis?
Coming back to my original query: Generally what is the lowest acceptable sequence counts which can be taken for further analysis
I think based on that, I’d trim to the same length, and then do closed reference picking against your favorite reference database. The database you choose will depend on your sample type. Silva may be better for enviromental samples; a lot of people used/are still using greengenes 13_8 for their human guts. So, one of those. At least the identifiers of your sequences may be the same. (You’re essentially doing a weird sort of mixed meta analysis.) I think I’d still align at 97 or 99 identity, so you’re getting confidence in what’s there.
This will let you use a phylogenetic metric which () will be more robust.
But, over all, your results are likely to be influenced by the mix of regions in the sample as much as anything.
As far as sequencing depth goes, it’s again, optimisation problem. I tend to work in the 5000-10000 seq/sample range for most of my analyses. I find 1000 to be really unstable, and too low, but 5K-10K works well for me. At your sample size, you need to keep as many high quality samples as possible, and 20K rarefaction will let you do that.