How to count occurrences of V1V2 haplotypes in large set of samples that are V1V3 PE sequencing

I have a set of 23 V1V2 regions sequences (my ‘haplotypes’) that I want to screen and identify in a set of 100+ samples of V1V3 16S amplicon sequence data. I’m a novice user of Qiime2 and my first thought was that I could use DADA2 denoise to assemble the V1V3 regions from each paired end set in my samples, and then use some external mapper to look for perfect matches to my 23 ‘haplotypes’.

But now I’m wondering if there would be a way to do all of this in Qiime2. Is there some way I can read in my 23 V1V2 region sequences (each being 322bp) and use that as a database against which I can screen my 100+ samples worth of V1V3 region amplicon data?

John Martin

I think you can try this, but it is better if a professional could say something.

First, analyze your “haplotypes”. DADA2 would output representative sequences. I don’t know if you can use it directly as FeatureData[Sequences]. If not, extract sequences to generate a FASTA file to import it.

Next, you will need a taxonomy file for the sequences. I’m not sure how to do it with q2. I would try extracting taxa from the large database we use for the analysis of “haplotypes”. Import it with right selection of import type.

There you will have the desired database; consisting of a file containing sequences and a file containing taxonomic information.

Lastly, you might need some programming skills, I don’t think there is a way to accomplish it purely with q2.

Please inform us about your progress, I’m sure I would face such a problem myself later on.

I do know the species of all the haplotypes, they are Staph species. I know down to species level, and then these are all putative strains of these species. Each haplotype was derived from a clonal source. I will try this out, it sounds like DADA2 will assemble my reads, so the trick will be getting my haplotypes into a taxonomy file I can import.

I am still very interested if anyone can recommend an explicit qiime2 workflow for me to accomplish my goal (assigning assembled V1V3 amplicons to my haplotypes). I have hundreds of files of 16S region specific amplicons I have to push through this analysis, so it would be very nice if this could be done entirely inside qiime2.

Hi @jmartin,
Sounds like you are trying to assign taxonomy essentially, but at the haplotype level. I’d recommend:

  1. format your haplotypes into a reference database consisting of separate sequence and taxonomy files. See the training a classifier tutorial at for examples of these file formats and other relevant info.
  2. Your haplotypes are V1V2, though, so your query sequences should be trimmed to the same sites. You can use extract-reads for that and an example is given in that tutorial (though for the more common practice of trimming the reference seqs… you want to trim the query seqs)
  3. classify your query seqs to haplotypes, using your haplotype seqs and taxonomy as reference. You could use a few different options: 1) train a naive-bayes classifier (as shown in that tutorial) for a general-purpose option that will find the nearest haplotype (increase the --p-confidence setting for a more specific match). 2) use classify-consensus-vsearch with the exact match mode switched on for an exact match to haplotypes. 3) use the hybrid classifier in q2-feature-classifier — it’s currently an experimental method but will perform step 2 then step 1 in succession so you get the “best of both”.

I hope that helps!

1 Like

Thank you for the suggestion! I’ll review the training a classifier tutorial and see how I can fit my experiment into a similar workflow. Regarding classifying the sequences, I specifically need exact matches only, so I think your 2nd classifier option is what I’ll try.

My primary goal is to setup a workflow that I can push large numbers of samples through. I appreciate the help!

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.