Optimizing taxonomy assignment to reduce unassigned ITS reads

Sonja · February 21, 2022, 12:46pm

Hello everyone,
I'm analysing microbial communities in a fjord and got stuck with the taxonomy assignment of fungi, I get 25-90% unassigned reads in my samples. The amplicons were generated using ITS3 and ITS4 as primers and the reads are 301bp long.

I read about the problem of read-through in the forum, so I used the cutadapt tool to remove the reverse complement of the reverse primer in forward reads and vice versa (side question: does cutadapt also find partial matches at the 3' end? The help function only says "partial matches allowed" for the 5' end - i.e. something like sequencesequencePRIM instead of sequencesequencePRIMER). Primers at the start were removed in the next step, I thought this is easier and safer to do, as they are always the first 20 bases and misread bases don't affect the trim-left parameter in dada2 while they could produce problems with cutadapt.

After that, I ran Dada2 for denoising, tried several settings to keep as many reads as possible while also not truncate too much to not lose species with particularly long ITS regions and managed to get ~80-90% through quality filtering, denoising, merging and chimera-removal, which was surprising as I couldn't get such high percentages with my other dataset of 16S sequences.
These were the settings I finally used:
qiime dada2 denoise-paired --i-demultiplexed-seqs ITS_cutadapt.qza --p-chimera-method consensus --p-trim-left-f 20 --p-trim-left-r 20 --p-trunc-len-f 300 --p-trunc-len-r 290 --p-max-ee-f 3 --p-max-ee-r 5 --p-n-threads 0 --o-table FeatureTable_Frequency --o-representative-sequences FeatureData_Sequence --o-denoising-stats SampleData_DADA2Stats

Then, I tried feature-classifier classify-consensus-vsearch with the dynamic dataset of the latest UNITE database (4h on my computer) and created a taxa barplot:
taxa_barplot.qzv (482.5 KB)
As there are a lot of unassigned sequences, I searched for the problem in this forum and found two things to improve, using the untrimmed developer dataset of the database and also use the dataset that also includes non-fungi eukaryotes (12h). This resulted in the following:
taxa_barplot2.qzv (528.0 KB)
So it appears that many of my reads are actually non-target taxa, but there are still many unassigned sequences.
Also, now I get a lot that are "k_unidentified". Does it make sense to remove sequences with k_unidentified from the reference database to get more informative results?

Do you think I could identify more sequences if I used a trained classifier instead of vsearch? Roughly, how much longer would that take to run?

Blasting some of the unidentified sequences in NCBI gave results like "uncultured fungus", but some also more specific results (both fungi and other kingdoms). But I guess combining UNITE and NCBI databases as reference would lead to impracticable runtimes.

Could the relatively high error tolerance in Dada2 cause these problems?

Also, I stumbled over ITSxpress, but I'm unsure if this only makes sense if you use whole metagenomes instead of metabarcoding. As I understand, it trims sequences to the "interesting" part of the ITS region, correct? So if the sample DNA was amplified using ITS primers, do I still need to do this step?

This is the first time I'm doing Data Analysis of Metabarcoding Data, so I'm still trying to piece together how everything works, but as I got very few unassigned sequences (1-2%) in my other dataset of 16S sequences, I think I'm not doing it totally wrong.
I hope you can point me in the right direction what I can try with my ITS data to improve the results.

Please tell me if you need more information (e.g. code I used or result files).

Best Regards
Sonja

colinbrislawn · February 22, 2022, 2:17am

Hello Sonja,

What a great first post! There's a lot of discuss here, so let's dive in!

I think you are doing all the right things. From choosing pairing settings to preserve as many reads as possible, to reviewing the docs and looking for past questions on the forums. Validating results is essential, and I think you are doing a great job.

This depends on what combination of settings you use. If you want to post your command, we can take a look at exactly what it's doing. You already found the q2-cutadapt docs, so also check out the full Cutadapt docs that have a section about using anywhere to allow partial matches on boths ends.

Nice!

Could you post the full classify-consensus-vsearch command that you ran? As discussed here, there's a bunch of settings within that pipeline that could change/improve results.

It's worth a shot.

Training the classifier takes the most time, but running them is pretty fast. I made some pre-trained UNITE classifiers if you want to running them. Those have not been tested at all, so I don't know if they would perform better or worse than classify-consensus-vsearch.

You should totally try ITSxpress! It is designed with amplicons / metabarcoding in mind, so it should be a good fit for your data. If there are issues with trimming / region-extraction or joining, it could also help.

Or maybe, these results are good to go!

Oh, I've see than with Fungi too! While having a sample with 50% unassigned reads is unexpected with a classic 16S V4 dataset, your samples look pretty good for fungi, given that the ITS region varies more in length and the databases tend to be less complete.

My main concern when seeing a lot of unassigned reads is that something (extra barcodes, reversed reads, bad joining, etc.) is messing up my whole data set, and I don't think that's happening here. It might even be something as simple as those pretty high max Expected Error rates ( --p-max-ee-f 3 --p-max-ee-r 5) that let some extra noise through your dada2 step.

A blast search is a great way to check on the classification results! That sounds like those reads may just be 'unclassifiable.'

Without a mock community of known composition to use as a positive control, I'm not sure if this is a technical problem or just a limit of our labels for fungi.

(Do you have a positive control of known composition? == )

Colin

P.S. Welcome to the forums!

P.P.S.

Building a database is a sisyphean task. Probably best to avoid that

Sonja · February 22, 2022, 11:21am

Thanks Colin!

This was my command for cutadapt:
qiime cutadapt trim-paired --i-demultiplexed-sequences ITS_untrimmed.qza --p-adapter-f GCATATCAATAAGCGGAGGA --p-adapter-r GCTGCGTTCTTCATCGATGC --o-trimmed-sequences ITS_cutadapt.qza
The sequences are the reverse complement of the opposite primer. I did a quick search for them in Text-Editor with the original fastq-files, read-through doesn't happen very often in my samples (400-500 hits in ~100.000 reads searching for the complete primer sequence), so after this step less then 2% of the sequences were shorter than 301 bases, which is what I would expect. As I mentioned, I removed the 5' primers later by using trim-left with dada2, because I saw that the first base wasn't matching in many reads and I wasn't sure how cutadapt would handle this.

Oh, maybe I wasn't clear enough, I think that's not what I need. As far as I understand, 'anywhere' means that cutadapt will search for adapters/primers which are expected at the 3' end also at the 5' end and vice versa (I suppose this helps when the direction of the reads is not consistent).
I found this page as well, and somewhere it says that it finds partial matches at both ends per default, I was just confused if it was implemented in Qiime2 in the same way since the help text specifically mentioned partial matches for the 5' end when using p-front but not the 3' end when using p-adapter.

Sure, I used the default settings (apart from p-threads):
qiime feature-classifier classify-consensus-vsearch --i-query FeatureData_Sequence.qza --i-reference-reads UNITE_sequences.qza --p-threads 20 --i-reference-taxonomy UNITE_taxonomy.qza --o-classification taxonomy.qza

By the way, it would be awesome if the parameters were named consistently, in dada2 it's p-n-threads, here it's p-threads, also you can't set it to 0 as in dada2 to use all available cores (for hardware noobs like me who have no idea what to chose for their computer ).

Ah very cool, thank you. I only found pre-trained classifiers for older versions of Qiime2 and UNITE. I'll try that and look more into ITSxpress as well.

And no, unfortunately I don't have a mock community.

colinbrislawn · February 22, 2022, 11:40pm

Hello again,

Let's take a look at how q2-cutadapt implements cutadapt. All these settings combine in slightly different ways, which makes a direct comparison tricky.

github.com

qiime2/q2-cutadapt/blob/dev/q2_cutadapt/_trim.py#L90-L102


      
              '--overlap', str(overlap),
              '--minimum-length', str(minimum_length),
              '-q', ','.join([str(quality_cutoff_5end), str(quality_cutoff_3end)]),
              '--quality-base', str(quality_base),
              '--cores', str(cores),
              '-o', str(trimmed_seqs.path / os.path.basename(f_read)),
          ]
          
          if r_read is not None:
              cmd += ['-p', str(trimmed_seqs.path / os.path.basename(r_read))]
          
          if adapter_f is not None:
              for adapter in adapter_f:

I also searched for 'are allowed' and found some more examples on the cutadapt docs. I'm not sure that fully answers your question, but I hope that helps.

OK, I was just checking there was not bigger changes like to --p-strand. Looks good!

Part of the discrepancy is what the various programs call their parts / threads setting, but to your point, we might be able to standardize that within their :qiime2: plugins.
EDIT But then the qiime2 settings would not match the setting from the original program, making it hard to refer to the original docs. The parameter names are consistent with the underlying tool, so at least that matches.

system · March 26, 2022, 12:57pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.