Can I create one OTU table from mixed ITS1, ITS2, and full ITS fungal datasets?

Hi QIIME 2 community,

I am working on a fungal ITS meta-analysis using data from 77 BioProjects. The data include Illumina paired-end reads and Ion Torrent, GridION, and PacBio single-end reads. The datasets target different ITS regions: ITS1, ITS2, and full ITS / ALL.

Because some datasets have abnormal or low-information quality scores, DADA2 is not suitable. For example, one dataset has quality scores almost entirely at Q3 and Q30, so DADA2 cannot learn a reliable error model. I am therefore considering a VSEARCH-based workflow:

ITSxpress trimming to ITS1, ITS2, or ALL
dereplication
chimera removal
de novo OTU clustering at 97%
taxonomy assignment using UNITE

My goal is to obtain one final table for downstream comparison across all BioProjects and platforms.

I am unsure whether I should:

Combine ITS1, ITS2, and full ITS reads after ITSxpress trimming and perform one global 97% OTU clustering, producing one mixed-region OTU table; or
Process ITS1, ITS2, and full ITS separately, generate region-specific OTU tables, assign taxonomy with the same UNITE database, collapse to a shared taxonomic level such as genus or species, and then merge the taxonomy tables.

I understand that ITS1, ITS2, and full ITS are different regions, so a single mixed-region OTU table may not be biologically comparable even if it is technically possible.

What would be the recommended QIIME 2 approach for this type of heterogeneous fungal ITS meta-analysis? Is a single mixed-region OTU97 table acceptable, or should I merge only after taxonomy assignment/collapsing?

Thank you!

Hi @budixin36, This is definitely a bit non-standard, but I'll share my thoughts. I would join the tables after doing the taxonomy assignment - and that taxonomy assignment should all happen against the same reference database and database version (it sounds like that's your plan, but I just wanted to explicitly state that). The reason I would prefer this is that with the OTU table, the same organism would always show up as different OTUs when identified by a different region, but if joining the tables after taxonomy assignment, there is at least a chance that they will be assigned into the same feature (the taxonomy label, in this case).

For this analysis, it's going to be important to keep in mind the different sources of systematic bias - for example, the regions and the different sequencing technologies will each introduce their own biases, and I suspect that those will often introduce effects rivaling the size of the biological effects you're interested in studying. So I would want to include that information in the master metadata file I compile for the project, and be sure to investigate effects arising from those variables (e.g., color your ordination plots by sequencing technology, and by target marker region, and so on) so you can explore where those effects arise.

And just so you're aware of it, be sure to check out the relatively new qiime boots kmer-diversity command - this might be helpful if you want to do diversity calculations at the OTU table level, and want to approximate phylogenetic diversity metrics without using a tree. You can see this illustrated in the gut-to-soil tutorial.

Hope this helps!

3 Likes

One follow-up here - another moderator pointed out that kmer-diversity might not work well here, as there may be little kmer overlap across the different datasets which would give misleading results. So updating my recommendation - probably best to stick with analysis of the taxonomy tables.

1 Like

Thank you, this is very helpful. I have one follow-up question.

Would mapping my representative sequences to UNITE Species Hypotheses be a reasonable alternative to creating a mixed-region OTU table?

My goal is to obtain one shared feature table that is somewhat analogous to an OTU table, but where the features are UNITE SH IDs rather than de novo OTUs. I want to avoid implying that ITS1, ITS2, and full ITS sequences are directly comparable before assignment.

In other words, would a sample × UNITE SH table be a defensible post-assignment integration strategy for mixed ITS1, ITS2, and full ITS fungal datasets, assuming all sequences are assigned against the same UNITE release/version?

Hi @budixin36 ,
(jumping in, hope you don't mind @gregcaporaso )

Yes, definitely. What you are describing would be to use a closed-reference OTU clustering/mapping approach instead of de novo clustering, both possible with the q2-vsearch plugin. This comes with the caveat that (1) your amplicons could theoretically map to multiple references and in the case of ties for top hit one will be taken arbitrarily as the reference match, and with this in mind (2) the reference taxonomy will not be fully reliable. So I would still recommend performing taxonomy classification (if nothing else, it would give you a sense of how reliable that SH match is)

There is also the caveat that as far as I know the UNITE database contains a mix of ITS1/2/complete, so this makes the closed-reference approach a little bit messy, depending on what you plan to do downstream (e.g., your ITS2 reads could match to ITS2 references but your ITS1 reads will not, so might still obtain different reference matches, or obtain an incorrect match just because the technically correct species reference is from the wrong subdomain). As you also have full-length ITS you might want to compare vs. the eukaryome database, which has a full-length ITS database. (btw, UNITE and eukaryome are both available via the RESCRIPt plugin if you want to test different options) I have never tried eukaryome, it is quite new, but could be worth a shot to compare.

If you are doing a closed-reference approach to the same reference molecule (full-length ITS), Greg's suggestion to use kmer-diversity-based metrics (available in q2-boots and q2-kmerizer) would also be an option, since you will be mapping all reads to the same reference molecule.

I hope that helps!

2 Likes

Thank you, this is very helpful. I understand that the UNITE SH approach would be a closed-reference mapping strategy rather than de novo OTU clustering, and that the resulting sample × SH table should be interpreted with those caveats.

I will keep your suggestion about performing taxonomy classification in mind as a reliability check for the SH matches. Since my project timeline is quite tight, I may not be able to fully explore all validation options, but I will at least examine mapping rates and the effects of target region and sequencing platform in the metadata.

Thanks again for the detailed guidance.

2 Likes