Optimizing taxonomy assignment to reduce unassigned ITS reads

colinbrislawn · February 22, 2022, 2:17am

Hello Sonja,

What a great first post! There's a lot of discuss here, so let's dive in!

I think you are doing all the right things. From choosing pairing settings to preserve as many reads as possible, to reviewing the docs and looking for past questions on the forums. Validating results is essential, and I think you are doing a great job.

This depends on what combination of settings you use. If you want to post your command, we can take a look at exactly what it's doing. You already found the q2-cutadapt docs, so also check out the full Cutadapt docs that have a section about using anywhere to allow partial matches on boths ends.

Nice!

Could you post the full classify-consensus-vsearch command that you ran? As discussed here, there's a bunch of settings within that pipeline that could change/improve results.

It's worth a shot.

Training the classifier takes the most time, but running them is pretty fast. I made some pre-trained UNITE classifiers if you want to running them. Those have not been tested at all, so I don't know if they would perform better or worse than classify-consensus-vsearch.

You should totally try ITSxpress! It is designed with amplicons / metabarcoding in mind, so it should be a good fit for your data. If there are issues with trimming / region-extraction or joining, it could also help.

Or maybe, these results are good to go!

Oh, I've see than with Fungi too! While having a sample with 50% unassigned reads is unexpected with a classic 16S V4 dataset, your samples look pretty good for fungi, given that the ITS region varies more in length and the databases tend to be less complete.

My main concern when seeing a lot of unassigned reads is that something (extra barcodes, reversed reads, bad joining, etc.) is messing up my whole data set, and I don't think that's happening here. It might even be something as simple as those pretty high max Expected Error rates ( --p-max-ee-f 3 --p-max-ee-r 5) that let some extra noise through your dada2 step.

A blast search is a great way to check on the classification results! That sounds like those reads may just be 'unclassifiable.'

Without a mock community of known composition to use as a positive control, I'm not sure if this is a technical problem or just a limit of our labels for fungi.

(Do you have a positive control of known composition? == )

Colin

P.S. Welcome to the forums!

P.P.S.

Building a database is a sisyphean task. Probably best to avoid that