qiime phylogeny error + DADA2/other questions

MBugay · May 26, 2022, 1:06pm

Hello! I'm re-doing an analysis on mixed amplicons (16S and ITS) and received an error I didn't receive previously during the qiime phylogeny (see the first screenshot) section (I've been following the Moving Pictures tutorial with Casava1.8 paired-end demultiplexed fastq; my samples were also dual indexed). I compared my notes from last time and I changed the trimming/truncating parameters.

Previous analysis:

--p-trim-left-f 54
--p-trim-left-r 54
--p-trunc-len-f 301
--p-trunc-len-r 280

Current analysis:

--p-trim-left-f 61
--p-trunc-len-f 300
--p-trim-left-r 64
--p-trunc-len-r 220

My rationale for changing my trimming parameters was because I forgot to add the indexes in the past and added them in my current analysis. My length of primers + indexes are below:

16S F + index = 53 + 8 = 61
16S R + index = 56 + 8 = 64
ITS F + index = 57 + 8 = 65
ITS R + index = 58 + 8 = 66

I'm still learning and I have multiple questions:

My samples are paired-end de-multiplexed (Casava1.8 paired-end), dual indexed, and mixed amplicons. Should I be using "qiime cutadapt demux-paired" to remove my indexes? If so, would this be before DADA2?
If I use qiime cutadapt to remove my barcodes (or indexes) before DADA2, does this mean it was wrong of me to include the indexes when I was trimming? (My gut feeling is saying yes)
Because I have mixed amplicons, should I be separating ITS and 16S before DADA2? If so, how and what resources would be a good starting point?
Related to Question 3, I saw a previous post which said that DADA2 is fine with mixed amplicons and suggested using "qiime feature-classifier extract-reads". Would others recommend this approach? If I were to use this approach, which primer length would I use when trimming (16S or ITS)?
Do my current truncating parameters seem reasonable (forward - 300, reverse - 220)? I was basing it off of the quality sequence plots, but I was between using 220, 240, or 260.
For the current qiime phylogeny error I am seeing, I've looked at previous posts and I believe the error is because of RAM issues. I'm not able to increase my RAM on my laptop without seriously overworking it, so is this issue even solvable? (I have a feeling I messed up earlier in my workflow and I'm hoping if I resolve the previous issues, that this error won't occur )

I understand I'm asking a lot of questions in one post and I have read other posts with similar issues, but it was difficult to keep track of all the information and I felt that the type of data I am working with (paired end, dual indexed, and mixed amplicons) confused me more when reading other posts. Thank you in advance for any help and suggestions!

colinbrislawn · May 27, 2022, 1:49pm

Hello @MBugay,

Thank you for this detailed question! There is a lot of cover here, so hopefully I can help answer some of your questions.

First, this data set is unconventional in several ways, i.e. mixing regions and sequencing using Illumina primers instead of amplicon primers. Someone gave you a challenge!

If you want to or need to use cutadapt, then you should run it before DADA2. But, you may be able to remove your primers just as easily by using the --p-trim-left-* settings.

Yes, you only need to trim off barcodes and primers once, so trim with dada2 or cutadapt.

I think dada2 should work well either way, but splitting your amplicons into 16S and ITS lets you follow a more conventional pipeline as shown in the tutorials, which may be helpful.

Yep, use feature-classifier extract-reads. Run it twice, once with primers for ITS and again with primers for 16S.

Sure, those are a good place to start. You may want to adjust them so more of your reads join later on, but you will find out once you get to that step!

The big one:

Good news: you should be running MSA and treebuilding with separately with 16S and ITS, and that might solve you ram issue! Try splitting up your data set and see if it works with the ram you have.

Keep in touch!

MBugay · June 3, 2022, 7:13pm

Hello @colinbrislawn

Thank you for answering! Your responses helped a lot and I have some follow-up questions.

Would you recommend I run DADA2 twice (using the respective 16S and ITS primers when trimming) and then run feature-classifier extract-reads to separate the amplicons?
When using feature-classifier extract-reads, it asks for --p-f-primer TEXT and --p-r-primer TEXT. I added overhang adapters to my primers, so should I include those overhangs in the primer text or no?
Since I would run feature-classifier extract-reads twice for 16S and ITS, should I have a separate metadata file for both or is it okay if I have one metadata file (with two columns for 16S and ITS primers)?

Again, thank you so much for answering my questions! I will probably ask more questions in the future too!

colinbrislawn · June 3, 2022, 10:39pm

Sure thing!

I may have messed up slightly. I'm recommending starting by splitting up your data set into 16S and ITS as a 'step zero.' I thought the extract-reads plugin would be a great way to do that, but I'm not sure it's the best method if your data does not include the full forward and reverse primers.

(It might work fine, but you will have to try it and see.)

DADA2. Yes, I would run dada2 separately on my 16S and ITS data, because I can pick different trimming settings for the two different amplicon lengths.
--p-f-primer TEXT and --p-r-primer TEXT I'm not sure what should be included... Try it and see!
Metadata. For downstream diversity analysis, you could totally use a single sheet because the ITS and 16S are coming from the same sample.

MBugay · June 6, 2022, 2:35pm

I see what you mean about splitting the data set as a 'step zero'. My data does include the full forward and reverse primers.

I'll try with DADA2 --> extract-reads first and see how it goes!

MBugay · June 7, 2022, 1:49pm

I actually had another follow-up question I forgot to ask earlier.

When running extract-reads, it asks for trimming and truncating parameters. Since I plan on using DADA2 before and trimming/truncating during DADA2, I would need to set up the trimming/truncating parameters as zero, correct?

Thank you so much for answering my questions!

colinbrislawn · June 7, 2022, 2:37pm

Interesting!

From the docs:

Reads are extracted, trimmed, and filtered in the following order:
1. reads are  extracted in specified orientation;
2. primers are removed;
3. reads longer than `max_length` are removed;
4. reads are trimmed with`trim_right`;
5. reads are truncated to `trunc_len`;
6. reads are trimmed with `trim_left`;
7. reads shorter than `min_length` are removed.

This trimming is a post-processing step in read-extraction step, which we are using as a step zero.

0.0. pull amplicons with matching primers
0.1 trim and truncate those amplicons
1.0 quality filter
2.0 dada2
2.1 trim and truncate primers

So, the same data is being pushed through this pipeline, and the trimming steps take place almost back to back, so I guess it does not matter if you trim & truncate during the extract-reads step or the dada2 step.

As with all trim & truncate steps, we can adjust later if we discovere issues.

MBugay · June 7, 2022, 2:46pm

Thanks for the quick response.

I had a feeling it wouldn't matter if I trimmed & truncated during the extract-reads step or DADA2 step, so long as I didn't trim & truncate twice (i.e., trim & truncate in both steps).

I'll make sure to update my progress!

MBugay · June 28, 2022, 10:06pm

Here is a small update on my progress and a question.

I've been able to run DADA2 without any issues and I am reading more about extract-reads as this is my next step. I may be overthinking this and confusing myself, but should I train my feature-classifiers prior to extract-reads? Since taxonomic analysis will occur later on, it seems like a good idea to train the classifier now as I will need to download reference databases for extract-reads anyway.

I read the Training feature classifiers with q2-feature-classifier doc and noticed that it has a section titled "Extract reference reads", which seems to match the extract-reads doc. It seemed that the outputs are the same, but I could be wrong and would appreciate clarity.

For 16S, I am using 515F/926R for the V4-V5 region and planned on using SILVA for the database. For ITS, I am using ITS86F/ITS4R for the ITS2 region and planned on using UNITE.

Thank you!

colinbrislawn · July 11, 2022, 5:02pm

Awesome!

Sure, you could train a classifier now, or you could just get one here.

Just like you are using extract-reads as a preprocessing step to get the region you want, the pipeline for training a classifier also uses extract-reads in the same way for the same reason.

But you may not need to build your own classifier if the premade ones work fine.

Lovely! Let me know how it goes.

MBugay · August 10, 2022, 8:08pm

Thank you for your patience. I have another update.

I actually attempted extract-reads -> dada2, but it failed and I received this error saying that my input was wrong (I provided SampleData[PairedEndSequencesWithQuality] when it should have been FeatureData[Sequence]). I wasn't sure if there was another way to go from SampleData to FeatureData[Sequence], so I re-did dada2.

Please let me know if I misunderstood extract-reads and if there actually is another way to go from SampleData to FeatureData[Sequence].

1) Attempt of extract-reads -->dada2 (failed)
I planned on just extracting reads with the primers then trimming and truncating during dada2, but I feel that I should have set up the trimming parameters at this step.

qiime feature-classifier extract-reads 
--i-sequences demux-paired-end-mix.qza 
--p-f-primer TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGGTGYCAGCMGCCGCGGTAA 
--p-r-primer GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGCCGYCAATTYMTTTRAGTTT 
--p-trim-right 0 
--p-trunc-len 0 
--p-trim-left 0 
--p-min-length 0 
--p-max-length 0 
--p-identity 0.8 
--o-reads demux-paired-end-16S_v1.qza 
--verbose
 ^[Usage: qiime feature-classifier extract-reads [OPTIONS]
(1/1) Invalid value for '--i-sequences': Expected an artifact of at least
  type FeatureData[Sequence]. An artifact of type
  SampleData[PairedEndSequencesWithQuality] was provided.

2) Attempt of dada2 --> extract-reads
For dada2, I set up trunc parameters, but set the trim parameters to 0 then, during extract-reads, I set the trim parameters to match the primer length (without adapters). As I was doing this, truncating at dada2 then trimming at extract-reads made sense, but now I'm worried that I should have done trunc and trimming at one step and if the adapters are still in the sequences. The generated FeatureTable with the extracted reads has shorter sequence lengths than I expected and the BLAST results are dubious.
rep-seqs-16S_v1.qzv (212.7 KB)

Should I have set up different parameters at each step?

qiime dada2 denoise-paired 
--i-demultiplexed-seqs demux-paired-end-mix.qza
--p-trunc-len-f 300 
--p-trunc-len-r 280 
--p-trim-left-f 0 
--p-trim-left-r 0 
--o-table table-dada2-mix_v1.qza 
--o-representative-sequences rep-seqs-dada2-mix_v1.qza 
--o-denoising-stats stats-dada2-mix_v1.qza 
--verbose
Saved FeatureTable[Frequency] to: table-dada2-mix_v1.qza
Saved FeatureData[Sequence] to: rep-seqs-dada2-mix_v1.qza
Saved SampleData[DADA2Stats] to: stats-dada2-mix_v1.qza

qiime feature-classifier extract-reads 
--i-sequences rep-seqs-dada2-mix_v1.qza 
--p-f-primer TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGGTGYCAGCMGCCGCGGTAA 
--p-r-primer GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGCCGYCAATTYMTTTRAGTTT 
--p-trim-right 56 
--p-trunc-len 0
--p-trim-left 53 
--o-reads rep-seqs-16S_v1.qza 
--verbose
Saved FeatureData[Sequence] to: rep-seqs-16S_v1.qza

qiime metadata tabulate 
--m-input-file stats-dada2-mix_v1.qza
--o-visualization stats-dada2-mix_v1.qzv
Saved Visualization to: stats-dada2-mix_v1.qzv

qiime feature-table summarize 
--i-table table-dada2-mix_v1.qza 
--o-visualization table-dada2-mix_v1.qzv 
--m-sample-metadata-file FileName.txt 
Saved Visualization to: table-dada2-mix_v1.qzv

qiime feature-table tabulate-seqs 
--i-data rep-seqs-16S_v1.qza 
--o-visualization rep-seqs-16S_v1.qzv
Saved Visualization to: rep-seqs-16S_v1.qzv

Thank you!

colinbrislawn · August 11, 2022, 3:05pm

Hello again!

I'm not sure what's best. My understanding is that these trim and truncation settings can happen equally well in either step, because they happen back-to-back in the pipeline.

However, extract-reads does not work with paired end reads, so it will only work after dada2, as you discovered. Thus, Attempt 2) is the only option supported right now.

You will have to align and inspect your reads after dada2 and after extract-reads to see if you are getting the region you expect / want. Let me know if you need a hand setting up this MSA to inspect the alignments.

Now is also a good time to zoom out.

I most often see the extract-reads plugin used to pull amplified regions from full-length genes. I'm not sure it's the best way to split your reads into 16S and ITS.

Maybe a different method should be considered as a step 0.

https://bioinfo.lifl.fr/RNA/sortmerna/ was initially designed to split 16S and transcriptomic reads, and vsearch gives lots of options for fast alignments, so maybe another program could do this job faster and better.

EDIT: Also consider these q2-plugins:

q2-cutadapt (if primers are in the reads, also while removing the primers)
q2-quality-control, using exclude-seqs (post-denoising) or filter-reads (for demuxed seqs pre-denoising). This is via alignment to a reference, so primers do not need to be present.

These options are especially advantageous because they remain within Q2 and track all provenance.

Some of my mod friends (hey, @lizgehret @SoilRotifer @jwdebelius !) were talking about how we don't have a full tutorial for splitting up multi-amplicon Illumina libraries. I would love to benchmark some methods for doing this, pick a good one, then write it up into a tutorial.

Thank you for venturing into this ~~uncharted~~ undocumented territory and helping us understand what works.

colinbrislawn · August 11, 2022, 3:44pm

Nick had a good idea about those primers you listed:

theoretically it should also work just fine as long as primers are present (again, not a position statement, just that I would hold out for a benchmark before saying that it will not work for this purpose — it is not intended for full-length seqs, only assumes that the primers are present in the read). In this case, one issue with using this method is that the primer seqs are super long and I suspect contain non-biological seqs, maybe linker seq or something like that.

This:

does not square with this:

Try using just the primer region, not the full linker-primer-barcode thing.

MBugay · August 16, 2022, 9:20pm

Thanks for the feedback!

Yes, I would appreciate help with setting up this MSA.

I'll try the other suggested step 0 methods and try just using the primer region. I have some follow up questions.

To clarify, cut-adapt or exclude-seqs would have to occur post-dada2 while filter-reads would be pre-dada2, correct?
When re-doing dada2 --> extract reads with just the primer, would you recommend I then follow up with cutadapt or another plugin to remove the linkers and barcodes?

I briefly considered this and I doubt this is the most logical option, but would it be possible to do dada2 (set trunc and trim to 0) --> extract-reads --> dada2 (with appropriate trunc and trim parameters)? I was envisioning the first dada2 step as a means to join the reads. However, after further reading and thinking, I realized this probably didn't make sense as you need to trim then merge pairs for dada2.

Although I do want to continue working with qiime2, this process is taking much longer than I expected and hindering my research. I'm interested in looking further into any R tutorials that also deal with separating multi-amplicon libraries. Do you have any suggestions?

Thank you for assisting me in this undocumented territory!

colinbrislawn · August 17, 2022, 3:43pm

I would suggest the other way around, but I think multiple options are possible.

I'm not sure that's needed, as the extract reads plugin should cut out the middle section of the read, leaving out the primers on the outside edge.

I agree, I don't think that's the best use of the pipeline...

I don't right now. As I mentioned, this is on my todo list precisely because I don't know of a comprehensive method to do this and a benchmark showing how well it works.

This would make a great paper! We just need someone to write it...

MBugay · September 14, 2022, 7:26pm

Here is another update! I was able to meet with a new post-doc in one of my collaborator's labs and she described her process for splitting mixed amplicon libraries (see below), which was very helpful.

fastqc to check Illumina adapters
Trimmomatic to remove any adapters/barcodes and/or overhangs
cut-adapt to trim primer sequences
fastqc to check that Illumina adapters have been removed
demux summary to check if there is overlap for pairing
dada2 (setting trimming paramters to zero)

I'm currently trying her process out, so I will send another update once I have some outputs.

colinbrislawn · September 14, 2022, 8:57pm

Thanks for keeping me updated!

In that 6 step process, which step splits the amplicons? As far as I can tell, it looks like all the different regions get passed right into dada2. This is not a problem, but I think the dada2 results will have ASVs from all the different regions still un-split...

MBugay · September 19, 2022, 5:34am

Sorry for any confusion!

To clarify, they suggested two options for splitting the amplicons:

After performing cutadapt using the primer sequence for each amplicon, I could use --p-discard-untrimmed, so that I would be left with the reads that have the respective 16S or ITS primers then denoise each separately and continue with downstream analysis
Cut all the adapters, don't discard anything, and continue with downstream analysis (Not recommended as I would have to later merge taxa, filter reads, etc)

Reading through their process again and reading more about trimmomatic, the process seems similar to your previous suggestion of cut-adapt prior to denoising. I wonder if I could also do:

fastqc to check Illumina adapters
cut-adapt to remove adapters
fastcq to check adapters have been removed
cut-adapt to trim primer sequences (and possibly linkers) and discard the untrimmed
demux summary to check if there is overlap for pairing
dada2 (setting trimming paramters to zero)

colinbrislawn · September 19, 2022, 1:24pm

Got it! That makes sense to me.

I think it is!

Keep me posted. If you find something that works well (or doesn't! ) that provides a starting point for a paper.

MBugay · October 18, 2022, 10:27pm

I have a good update!

So far, I have found success with the following:

fastqc to check for adapters
trimgalore to remove Illumina adapters
check with fastqc to ensure Illumina adapters are removed
(after importing trimmed sequences to qiime2) cut-adapt of respective primer sequences (16S or ITS) and using --p-discard-untrimmed to remove the untrimmed
demux summarize
dada2 with trimming parameters set to zero
continue with downstream analysis (taxonomy, etc)

I was able to get through generating the tree for phylogenetic diversity analyses (I was getting stuck here before), and alpha and beta diversity analysis. I am currently at the taxonomic analysis part. I was able to run all of this locally until the taxonomy part (not enough RAM), but I have access to a HPC that I am familiarizing myself with.