Separating two different amplicons from demultiplexed data

Hello QIIME2 world!

I recently received my Illumina MiSeq paired end sequencing files and, although new to bioinformatics, I’m excited to learn QIIME2 and see where it takes me.
I was able to successfully import my fastq.gz files using Casava 1.8 paired-end demultiplexed fastq, but before removing primers and denoising, I might need to separate my files in two (if only working with F reads) or four (if working with F & R). The way I design my sequencing was pooling indexed ITS1 and SSU amplicons of the same sample, with both amplicons of the same sample having the same Nextera XT indexes.

I foundthis post that (I think) deals with a similar approach, but I would be using two different databases to classify. I assume I could use the second approach (Excluding sequences by alignment) mentioned in the post by @Nicholas_Bokulich but I wonder where would I feed the primer sequences of the two different regions.

I would I appreciate your thoughts!
Jean

1 Like

Hello Jean,

Welcome to the Qiime 2 forums. I'm glad you found that post, as it gives a really good overview of your options.

but I wonder where would I feed the primer sequences of the two different regions.

Because you want to get reads from ITS1 and SSU, you basically do each filtering step twice. Say, you filter your full data set once to get all your ITS1 reads, then filter you full data set again to get all your SSU reads. So while you can't filter for two regions in one step, you can filter twice, once for each region you want.

You probably already found this, but this page had lots of filtering examples:
https://docs.qiime2.org/2018.4/tutorials/filtering/

Let me know if that helps,
Colin

1 Like

Hi Colin,

Thanks for the reply :slight_smile:
I did read trough the Filtering tutorials but couldn’t find one that stated you could filter using primer sequences. Could you please advice me about which filtering method would help me separate the two different reads contained in one same fastq.gz file? It seems like most of them filter based on a metadata table which in my case wouldn’t work.

Also, would you recommend filtering before or after DADA2?

Regards,
Jean

This topic might be useful for you.

If primers are contained in the sequences, you may be able to use q2-cutadapt to separate out different marker regions prior to dada2, adding part of the primer sequences to the barcode sequence listed in your mapping file, as described in that topic. Note that you will want to trim out primers before running dada2, anyway, so this may be the way to go.

Alternatively, use exclude-seqs after dada2. There is no need to use primer sequences (though you could if primers are still in your samples), you would just use two different reference databases for filtering.

I hope that helps!

Hi Nicholas,

Would this approach still work if my sequences were dual indexed for sequencing and were demultiplexed by the sequencing facility?
Primers are contained in the sequences but each sample’s R1 and R2 file contain sequencesbelonging to the two different reads thus, two different sets of primers. I’m a newbie in the topic and might be missing the point but seems like all the options in q2-cutadapt are for single barcodes.

If q2-cutadap is not an option, then exclude-seqs might be the way to go after dada2, using the alignment as the excluding factor. In this case, can I run dada2 with the sequenced containing primers?

Thanks,
Jean

Sorry, missed that point yesterday — if the data are already demultiplexed, you cannot go the first route. Would need to run dada2 and then use exclude-seqs later. Or get the raw, multiplexed data from the sequencing center if that's even possible.

q2-cutadapt can still be used to trim the primers from each end on dual-indexed reads with the trim-paired method. You are correct, though — demux-paired cannot handle dual-indexed reads for demultiplexing.

:+1:

You can, but you should not if you are using degenerate primers. Instead:

  1. Use q2-cutadapt trim-paired to trim primers twice (once for each of your two primer sets)
  2. Run dada2
  3. Use exclude-seqs to exclude by alignment against each reference database. No primers required.
2 Likes

Hi Nicholas,

In my case, none of my primers are degenerate so I ran dada2 with sequences containing primers. However, since both regions are targeting fungi, having them contained in the same files will likely result in abundance overestimation. If I trim both primers, then I’ll lose the way in which I can separate them. Any command that would separate all sequences having X or Y non-degenerate primer?

Thanks

Or is exactly that what qiime quality-control exclude-seqs does with the blastn-short method? If it is, sorry for the misunderstanding :see_no_evil:

You can separate them with quality-control exclude-seqs, aligning against each separate reference database. You can do this whether or not you trim out the primers. Because the SSU and ITS amplicons should be very different, it should be no problem to separate using this method.

You figured out the answer:

Right!

Since your primers are not degenerate, it does not matter that you kept them in. So you can use exclude-seqs either aligning against your primer sequences (faster) or against the reference databases (slower but possibly more reliable).

I hope that helps!

1 Like

Thanks a lot, Nicholas! All this info is very helpful.

In the exclude-seqs method, the query-seqs would be my dada2 output file and the reference-seqs a FeatureData[Sequence] downloaded from UNITE and Maarjam for ITS and SSU respectively?

1 Like

:+1:

:+1:

good luck!

1 Like

Hi!

I’m having some trouble getting exclude-seqs to filter my sequences. To give you a summary of what I have done upstream, I imported using Casava 1.8 paired-end, used

qiime cutadapt trim-paired
–i-demultiplexed-sequences demux-paired-end-subset.qza
–p-front-f TATAAGAGACAG
–p-front-r GTATAAGAGACAG
–o-trimmed-sequences trim-demux-paired-end-subset.qza
–verbose
to trim Illumina overhangs that were in front the primers.

Then, I ran DADA2:
qiime dada2 denoise-paired
–i-demultiplexed-seqs trim-demux-paired-end-subset.qza
–p-trim-left-f 0
–p-trim-left-r 0
–p-trunc-len-f 277
–p-trunc-len-r 242
–o-table trim-subset-table.qza
–o-representative-sequences trimmed-subset-rep-seqs.qza
–o-denoising-stats trimmed-subset-denoising-stats.qza
–verbose

Imported UNITE fasta file
qiime tools import
–input-path sh_refs_qiime_ver7_97_01.12.2017.fasta
–output-path unite-ref-seqs.qza
–type ‘FeatureData[Sequence]’

Now, I’m trying the exclude-seqs command:
qiime quality-control exclude-seqs
–i-query-sequences trimmed-subset-rep-seqs.qza
–i-reference-sequences unite-ref-seqs.qza
–p-method blast
–o-sequence-hits trimmed-unite-hits-qza
–o-sequence-misses trim-unite-misses.qza
–verbose
but my hits file is empty; all are going to misses.

Is there something I’m missing? I suspect overhangs might still be in the sequences even though I ran cutadapt. When I separate them using primer sequences:
qiime feature-classifier extract-reads
–i-sequences trim-subset-rep-seqs.qza
–p-f-primer CTTGGTCATTTAGAGGAAGTAA
–p-r-primer GCTGCGTTCTTCATCGATGC
–o-reads extr-ITS-seqs.qza
–verbose
I do get sequences in my output files but still want to try using the reference database.

Side question: Does the last command (extract-reads based on primer sequence) also trim primers from sequences?

Thanks!

I'd recommend exporting and checking out some of your sequences. Are they very short? Are there overhangs or other non-biological sequences present?

You could also try fiddling around with parameters like perc-identity or perc-query-aligned. :violin:

Yes, I believe primers are trimmed out by that command.

Does that help? Let us know if you have more questions!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.