Demultiplex 16S hypervariable regions

apc · June 24, 2021, 4:36pm

Hi,

While taking a look at the de-multiplexing tutorial with the cutadapt plug-in I was wondering, what about the fastq files that have already been de-multiplexed by sample barcode but still contain different 16S hypervariable region reads in the same file.

In other words, I have a fastq file that contains reads with 180-210 bp length (which corresponds to the V3 16S rRNA H-region) and reads with 250-280 bp length (which corresponds to the V4 16S rRNA H-region). Therefore, before filtering and de-noising the files I would need to split them into one file for each hypervariable region.

One solution could be to split them with the primer sequences, in order to search for the presence of the correspondent primer for each H-region. But what about reads in which the primer/adapter sequence have already been removed? or in case that the primer sequences are not available (private information)?

For example: the Ion Torrent 16S metagenomics Kit utilises two mixes of primers to amplify different combinations of H-regions but the primer sequences are not publicly available.

To overcome this problem I have used my own python scripts to split them by length, but this is not the best practice since reads length is a random measure. A better solution is to align the reads using an appropriate reference and assign an H-region for each read. But, is there any Standard Operating Procedure to perform this task with QIIME?

Thank you very much for your attention.

Have a great day!

AP

colinbrislawn · June 25, 2021, 1:41pm

Good morning,

Not yet. How to handle multiple hypervariable regions at once is still an open question.

I noticed you mentioned the Ion Torrent 16S kit and its secret primers. I know other members of the forum have tackled this before with reasonable success. Are you working with this kit too, or a similar kit from a different manufacture?

This solution could be pretty good! When you plot the distribution of read lengths, do they truly look random, or do you see a bimodal distribution with peaks around the V3 and V4 lengths? If so, I think it would be defensible to split your reads based on these lengths, especially if you had paired end reads that could be joined to prevent overlap between regions.

16S ------V3====V3--V4=====V4-----
R1        >----------> overlapping regions :-(
R2    <----------<
joined    >------< just V3 :-)

I agree , but I don't know of a Qiime2-plugin that does this 'out of the box'

Colin

apc · June 28, 2021, 10:12am

Good morning @colinbrislawn !

Are you working with this kit too, or a similar kit from a different manufacture?
Yes, we have used it once. The situation we had to deal with was exactly the same than the one described in the post you have pasted. Thank you!
do they truly look random, or do you see a bimodal distribution with peaks around the V3 and V4 lengths?
As Thermofisher sent us only the V3 and V4 amplicons the read length distribution looks like bimodal so , as you mentioned, it was easy to split them by length. I was thinking about a more complex situation, in which you have all the amplicons in the same fastq.

It might be useful to have a plug-in to deal with this situation, specially for people that has little experience with microbiomes. In any case, this post provides a wide range of possible solutions.

Thank you very much for the discussion!