I am attempting to analyze a dataset generated from fungal ITS sequences, and have run into similar issues as the following two threads, however haven’t been able to solve my problem with the advice therein:
I have tried the following different ways of classifying my sequences and all of them result in kingdom and phylum level classification only. The parameters I have tried are:
training the classifier with extracted and truncated reference reads
training the classifier with extracted (but not truncated) reference reads
training the classifier without extracting reference reads
I have also attempted the above using both the developer and non-developer UNITE database.
Any advice? Why am I still only able to get Kingdom-level classification?!
Thanks for posting! And many thanks for checking out the other forum posts related to this issue first. I have a few questions to get started troubleshooting:
What primers are you using?
What is the length of the sequences that you are classifying?
What types of samples are you analyzing? (just curious but it may be relevant here)
One way to test your classifier would be to classify a subset (10 would be enough) of the same sequences that you used to train the classifier. If you are still getting kingdom-level classifications, something is wrong with the classifier (or possibly the taxonomy format — I assume that you are using the QIIME-formatted UNITE dataset).
The best performance for UNITE would be achieved by training the classifier without extracting reference reads, using the “developer” version of the UNITE sequences. Thanks for already trying these different approaches! It gets us a bit closer to solving the problem…
Would you mind sending your classifier and query sequences so that I could examine these files? You can send as a direct message to me if you do not want to post these publicly on this forum.
It sounds like your classifier is working fine (and we/other qiime2 users have used the UNITE classifier quite a bit so there should not be compatibility issues). So the bad news is this is not an easy fix, but the good news is that you can try to extract the useful sequences and proceed with that subset of the data (see below).
It seems like the sequences themselves are probably the issue — as a quick test I BLASTed the first five query seqs; 4/5 had no significant hits (but had short segments of similarity to non-fungal sequences) and 1/5 hit Acropora. So in spite of the fact that the primers are supposed to be fungi-specific, they appear to be hitting at least some non-target (e.g., host) sequences.
Here’s my advice for the next steps:
as a sanity check, you could test out classify-consensus-blast on the UNITE database to get a “second opinion” on sequence classification. Make sure max_accepts is set to 10 (so that you aren’t just blasting the top hit).
You will want to remove any non-target sequences. You could do this by aligning all sequences to the Acropora pan-genome, but the easier way to do this (in QIIME2) would be to use filter-features to include only sequences that align to the UNITE database within some % identity (maybe 90%? 80%? Off the top of my head I cannot think of a good number. too low and you pick up more non-target seqs, too high and you might lose poorly characterized but truly fungal species — keep in mind that below a certain % id you probably won’t get a good classification anyway). 2A. All hits should be re-classified with your method of choice. Filter all non-hits (non-target sequences) out of your feature table before proceeding with downstream analyses. 2B. You can use NCBI BLAST or similar to try and determine what the non-hits are related to. It seems that at least some are Acropora and perhaps other non-fungi. There may be interesting/useful information in there! (both for experimental and trouble-shooting purposes.)
The other good news is that I might just be wrong and the 5 seqs I blasted don’t represent the rest of the pack! If a large number of seqs align to the UNITE database with 90%+ similarity and still don’t classify well with either method, let us know!
I’m sorry I can’t be the bearer of better news! Please let us know if you need any more help with these downstream steps (which should probably be opened in a new thread).
Hi @Nicholas_Bokulich -
Thanks so much for your advice. I realized I had done something stupid in an earlier step, and figure it might help people if I share it here.
We used 2x300bp paired end technology, so I thought naturally I should join the paired-end reads. However, due to variation in the ITS gene sequence length (which can be upwards of 800bp), the reads actually have a pretty high chance of not joining. I re-did the analysis using only the forward reads, and got way better results. Some samples communities were still classified only at kingdom level, but a BLAST of sequences shows that there is not a high percentage match to any one phylum (or lower level), and the conservative kingdom level classification is probably best.
You can use exclude-seqs to filter out features based on their % identity. (sorry — I linked to the correct tutorial above, but gave the wrong method name. filter-features would be used to subsequently remove these features from your feature table).
You can use the perc-identity parameter to set the similarity threshold for filtering parameters. To outputs will be produced: features that exceed this % similarity, and those that do not.