My libraries contain reads from _two_ variable regions - how can I proceed with the analysis?

bmb22 · May 28, 2018, 1:13pm

Hi everyone, I tried to find a similar post on here but didn't see one, so my apologies if this is a duplicate.
I have microbiome data from honeybee guts (38 samples). I am trying to understand the community structure so I am looking for taxonomy bar charts, alpha, and beta diversity, etc. Similar to the moving pictures tutorial.
Initially, I amplified the variable regions, both the V1-V3 and the V4, using NextFlex kits from Bioo Scientific. Everything was pooled for sequencing on MiSeq, 300 cycles, paired ends. This was my first time doing this sort of experiment, and I didn't consider that the indexes for both kits were the same, so, for example, all my day 1, hive 1 reads have the same index, whether they are V1-V3 or V4 regions.
I am new to Qiime (and Qiime2) and I would like to try to replicate the data types from the moving pictures tutorial, but I am not sure what the best steps are for classification. The regions are different enough they shouldn't align to the same area, so I can run them through Silva, but I worry that some species will be greatly over/underrepresented if I analyze them all together, depending on the reference databases used. Is this a reasonable concern, or should it not matter? I know that taxonomic resolution for different species can be better/worse depending on the region.
Is there a way to align sequences to the V1-V3 region, throw out all "unaligned" and then use that file and realign to the V4 region? In my brain, this feels like a logical approach but I am unsure how to execute it.

So far I have run DADA2 and generated summary tables from that. Any suggestions on how to proceed would be amazing!

Thank you so much
Brittany

Nicholas_Bokulich · May 28, 2018, 1:53pm

Hi @bmb22!

So you want to analyze data from both regions together? Some notes:

Have you seen this topic thread?
You basically have three options.
See also this thread.
Sounds like you have a job for q2-fragment-insertion.

Or do you just want to separate out the different variable regions to analyze separately, since they were combined by mistake? If so:

Use extract-reads to trim your reference sequences to the different variable regions
Use exclude-seqs to select the genes that align to each variable region within some % similarity (I am not sure what is a good setting for this, but the different variable regions are probably dissimilar enough that you can be fairly flexible with this).

That should split them out sufficiently.

I hope that helps! Let us know if you have any more questions.

bmb22 · May 30, 2018, 12:52am

Thank you for your reply! I think it would have been better if I had designed it so the different regions had different indexes and could have pulled each out during demultiplexing.
Now, given what I have, I am running into a number of problems and questions, and I apologize for my naivety on the topic, I literally have no bioinformatic background except for the last few weeks!

the V1-V3 region from the kit is ~650 bp, but I did the 300 cycles PE (and only the reverse primer has the index). Therefore, I am not sure these can overlap for joining, and not sure how to use Qiime here.
The V4 region is 450 bp and thus should overlap fine and work in downstream analysis. I really want to analyze it all using the same approach to be consistent though, do you have any suggestions?
If I were to do the extract-reads and exclude seqs, is this before or after quality filtering? Right now I took my demultiplexed reads, joined paired ends (of course not sure if this worked for the V1-V3 region though), then quality filtered and checked/removed chimera's. I am not exactly sure if this order is appropriate or when I should doing each step.
From my understanding, the next steps I should take are OTU picking, but I don't know how to do that for the V1-V3 (even if I have extracted those sequences)

Any suggestions would be greatly appreciated! Thanks for your continued help!!!

Nicholas_Bokulich · May 30, 2018, 1:04pm

Hi @bmb22,

I think I have a better understanding of the problem now.

Definitely much easier.

Yep, sounds like you do not have enough overlap to use paired-end for V1-V3. You will need to analyze just the forward or reverse reads as single-end data. See q2-cutadapt for details on demultiplexing with barcodes inside the reads.

Thanks for that information. That actually makes this much easier, because now you can split out these regions during demultiplexing. Your reverse reads probably look like this:

barcode linker primer read
bbbbbbbbbLLppppppppppACTGACTGATCGATCGTAGC

So in your sample metadata file (where barcodes are listed), just add the linker and part of the primer sequence (any conserved sections at the 5' end, cut it off before any degenerate bases) to your barcode sequences. Since those linkers/primers will be different for each of your primer set, voila.

Note that you will also need to use q2-cutadapt to trim out any remaining primer segments from your reads following the tutorial linked above.

Don't worry about this anymore — splitting via demultiplexing will be easier. But otherwise this would actually occur after denoising/OTU picking, so much later in the workflow.

Have you had a chance to check out our tutorials? In particular, the moving pictures tutorial could give you a good sense of what to do after demultiplexing. We recommend denoising methods (dada2 or deblur) as described in that tutorial as a replacement for OTU picking — however, if OTU picking is your preference, see this tutorial for some options.

I hope that helps!