Fungal ITS classification with UNITE database

Hillary_Smith · November 28, 2017, 4:25pm

Hi qiime2 forum,

I am attempting to analyze a dataset generated from fungal ITS sequences, and have run into similar issues as the following two threads, however haven't been able to solve my problem with the advice therein:

I have tried the following different ways of classifying my sequences and all of them result in kingdom and phylum level classification only. The parameters I have tried are:

training the classifier with extracted and truncated reference reads
training the classifier with extracted (but not truncated) reference reads
training the classifier without extracting reference reads

I have also attempted the above using both the developer and non-developer UNITE database.

Any advice? Why am I still only able to get Kingdom-level classification?!

colinbrislawn · November 28, 2017, 4:52pm

Great question! I would also like to hear what the Qiime devs recommend as supporting many databases is important.

For reference, here is the current documentation on retraining the feature classifier.

Nicholas_Bokulich · November 28, 2017, 4:58pm

Hi @Hillary_Smith,
Thanks for posting! And many thanks for checking out the other forum posts related to this issue first. I have a few questions to get started troubleshooting:

What primers are you using?
What is the length of the sequences that you are classifying?
What types of samples are you analyzing? (just curious but it may be relevant here)

One way to test your classifier would be to classify a subset (10 would be enough) of the same sequences that you used to train the classifier. If you are still getting kingdom-level classifications, something is wrong with the classifier (or possibly the taxonomy format — I assume that you are using the QIIME-formatted UNITE dataset).

The best performance for UNITE would be achieved by training the classifier without extracting reference reads, using the "developer" version of the UNITE sequences. Thanks for already trying these different approaches! It gets us a bit closer to solving the problem...

Would you mind sending your classifier and query sequences so that I could examine these files? You can send as a direct message to me if you do not want to post these publicly on this forum.

Hillary_Smith · November 29, 2017, 1:38pm

Hi Nicholas, Thanks for the quick reply!!

I am using primers ITS1F (forward; Gardes and Bruns 1993) and ITS4 (reverse; White et al 1990) following this protocol.
The sequences are Illumina HiSeq, so 2x300bp.
The samples were isolated from 3 different species of Acropora corals.

I tried classifying the same sequences used to train the classifier, as your suggestion, and I was able to retrieve down to genus and species level classifications (assuming I did it correctly!).

My classifier and representative sequences are available here. Thanks for your offer to help! Fingers crossed for an easy fix

Nicholas_Bokulich · November 29, 2017, 3:55pm

Hi @Hillary_Smith,
It sounds like your classifier is working fine (and we/other qiime2 users have used the UNITE classifier quite a bit so there should not be compatibility issues). So the bad news is this is not an easy fix, but the good news is that you can try to extract the useful sequences and proceed with that subset of the data (see below).

It seems like the sequences themselves are probably the issue — as a quick test I BLASTed the first five query seqs; 4/5 had no significant hits (but had short segments of similarity to non-fungal sequences) and 1/5 hit Acropora. So in spite of the fact that the primers are supposed to be fungi-specific, they appear to be hitting at least some non-target (e.g., host) sequences.

Here's my advice for the next steps:

as a sanity check, you could test out classify-consensus-blast on the UNITE database to get a "second opinion" on sequence classification. Make sure max_accepts is set to 10 (so that you aren't just blasting the top hit).
You will want to remove any non-target sequences. You could do this by aligning all sequences to the Acropora pan-genome, but the easier way to do this (in QIIME2) would be to use filter-features to include only sequences that align to the UNITE database within some % identity (maybe 90%? 80%? Off the top of my head I cannot think of a good number. too low and you pick up more non-target seqs, too high and you might lose poorly characterized but truly fungal species — keep in mind that below a certain % id you probably won't get a good classification anyway).
2A. All hits should be re-classified with your method of choice. Filter all non-hits (non-target sequences) out of your feature table before proceeding with downstream analyses.
2B. You can use NCBI BLAST or similar to try and determine what the non-hits are related to. It seems that at least some are Acropora and perhaps other non-fungi. There may be interesting/useful information in there! (both for experimental and trouble-shooting purposes.)

The other good news is that I might just be wrong and the 5 seqs I blasted don't represent the rest of the pack! If a large number of seqs align to the UNITE database with 90%+ similarity and still don't classify well with either method, let us know!

I'm sorry I can't be the bearer of better news! Please let us know if you need any more help with these downstream steps (which should probably be opened in a new thread).

Hillary_Smith · December 21, 2017, 12:57pm

Hi @Nicholas_Bokulich -
Thanks so much for your advice. I realized I had done something stupid in an earlier step, and figure it might help people if I share it here.

We used 2x300bp paired end technology, so I thought naturally I should join the paired-end reads. However, due to variation in the ITS gene sequence length (which can be upwards of 800bp), the reads actually have a pretty high chance of not joining. I re-did the analysis using only the forward reads, and got way better results. Some samples communities were still classified only at kingdom level, but a BLAST of sequences shows that there is not a high percentage match to any one phylum (or lower level), and the conservative kingdom level classification is probably best.

Hope this helps someone out there

Nicholas_Bokulich · December 21, 2017, 1:24pm

Thanks for the update, @Hillary_Smith!

Hillary_Smith · January 10, 2018, 9:10pm

Hi @Nicholas_Bokulich -
Trying to follow your recommended steps above, and can't figure out how to filter features based on % identity match to the ref database... any help there?!

Nicholas_Bokulich · January 10, 2018, 9:25pm

Hi @Hillary_Smith,
You can use exclude-seqs to filter out features based on their % identity. (sorry — I linked to the correct tutorial above, but gave the wrong method name. filter-features would be used to subsequently remove these features from your feature table).

You can use the perc-identity parameter to set the similarity threshold for filtering parameters. To outputs will be produced: features that exceed this % similarity, and those that do not.

I hope that helps!