I wish to use 16S V3-V4 case-control data to learn 16S V4 classifier.
So I need to extract V4 region from V3-V4 reads. Is that in general good idea?
Also I guess there could be two different approaches to extract
provide extraction for each read and then run qiime pipeline ( dada, feature classifier etc )
dada step and then ectract V4 from ASV clusters before
feature classifier step
So looks like second one should be more reliable, what do you think?
P.S. I assumed single-end reads for simplifying. For paired-end reads I guess there should be merging forward with backward ones and then repeating the same method that we will choose for single-end reads
I am not sure why you'd do this. Perhaps I am mis-understanding...
Why not simply extract separate V4 and V3V4 amplicon regions from the initial full-length marker gene sequence data, i.e. the available SILVA and GreenGenes files on the Data resources page, or via RESCRIPt?
This would not be a good approach. PCR primer amplification bias would be an issue, especially if you are extracting V4 sequences from data generated from V3V4 reads. That is, V3V4 primers will have different amplification biases than V4. In fact, different primer sets that target these same regions can be biased from one another.
The same goes for in silico extraction of V3V4 (e.g. from the full-length SILVA reference sequences) and then using that extracted region to then extract V4. You'll bias the V4 output from the V3V4 output based on how successful in silico primer pair search operates across different taxa with different primer sequences.
If you are trying to merge data from a V3V4 study and V4 study for some combined analyses, then a word of caution... Just because you can extract the V4 sequences does not make it easier to compare your sequence data across studies. Mainly due to the inherent PCR amplification biases between different primer sets. Even closed-reference OTU picking will not help much in this case. You may artificially inflate differences among samples in the study simply due to the biases of the different variable regions and primer sets used. Unless you have a way of controlling or minimizing this effect.
I am not sure if I was able to answer your questions, as it is not clear what you are trying to do.
Hi, Mike. Thank you for detailed answer
Yes, that's the case. I wish to build some health-disese classifier, but since studies with only V4 regions are not enough -- I wish to use also other regions with V4 intersection ( like V3V4 ). I need exactly V4 in the end since our labs use V4 primers and the point is to apply the classifier to data from our labs