Large amount of 'Unassigned' ITS sequences in classify-consensus-blast results

kelmermcunha · October 15, 2021, 4:18am

Hello everyone!
Before anything, this is my first post in the QIIME2 Forum. I'm already sorry if this is not the right topic to place my questions. I'll try to provide most information as possible.

With that out of the way, I'm kinda confuse about the results from feature-classifier classify-consensus-blast, where there is a huge amount of 'Unassigned' features. I'm working with pair ended, demultiplexed, and primer-trimmed ITS1 sequences of endophyte fungi, generated with Illumina Miseq. Therefore, these are the steps that I've worked around so far, before the taxonomy assignment via BLAST:

Importing sequence data.

qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path manifest.tsv \
--input-format PairedEndFastqManifestPhred33V2 \
--output-path demux-paired-end.qza
--verbose

DADA2 pipeline parameters. Truncated the sequences based on the QC plots.

qiime dada2 denoise-paired \
--i-demultiplexed-seqs demux-paired-end.qza \
--p-trunc-len-f 250 \
--p-trunc-len-r 120 \
--p-max-ee-f 0.5 \
--p-max-ee-r 0.5 \
--p-chimera-method consensus \
--p-n-threads 0 \
--o-table table.qza \
--o-representative-sequences rep-seqs.qza \
--o-denoising-stats denoising-stats.qza \
--verbose

This is the visualization of DADA2 stats.

BLAST

qiime feature-classifier classify-consensus-blast \
--i-query drimys/rep-seqs.qza \
--i-reference-reads unite-ver8-99-seqs-10.05.2021.qza \
--i-reference-taxonomy unite-ver8-99-tax-10.05.2021.qza \
--p-maxaccepts 1 \
--p-perc-identity 0.8 \
--p-query-cov 0.9 \
--o-classification blast.qza \
--verbose

I'm using the UNITE database (the latest QIIME2 release I could find in their website). These are the results that I'm getting with the BLAST. You can notice that ca. 500 features are classified as 'Unassigned'.

So I'm kinda confused here. It seems that my pipeline steps are fine, as it is in accordance with other posts I've seen. I recognize that my DADA2 and BLAST parameters are stringent, but that probably it's not the cause of this. Even if I use the standard feature-classifier classify-consensus-blast parameters, I still get a lot of 'Unassigned' features.

Any thoughts on what could be happening here? Are these features just sequence/overall errors or it could have something to do with my pipeline steps? If the first case it's correct, I need to just disconsider these features in further analysis, or there's something that can be done?

I'm aware that there are some posts regarding the same problem, but none of them are related to ITS data.

Thank you for reading until here, and I'm also already thankful for your responses!
Bye!

sln · October 15, 2021, 8:03am

Dear Kelmer,
We do encounter the same issue. Almost 95% of our ASVs are not assigned to any taxonomy when analyzing ITS data using UNITE database.
I did not follow the same steps you did but followed this analysis pipeline. So I assume the problem arose in the last step- using UNITE to assign taxonomy.
Recently, I discovered a new reference data processing tutorial and applied it to 16S data. And also found this topic to apply the same approach to fungal data. This might be the solution to our problem but haven't tried yet.
Hope we find a solution.
Selin

Nicholas_Bokulich · October 15, 2021, 8:27am

Hi @kelmermcunha and @sln ,

My guess is that you have sequences that do not match the reference at all. This could be plant or eukaryote sequences, depending on the sample type. What is the sample type?

The BLAST classifier is usually quite straightforward in this case — unassigned hits are usually unassigned because they do not match the reference. Especially because you use this setting:

So the consensus taxonomy step is not run — you are just grabbing the first hit that satisfies these criteria:

The unassigned sequences satisfy neither of those criteria, meaning that they (a) bear little or no similarity to the reference sequences or (2) the query coverage is lower, which would indicate presence of an adapter or barcode sequence.

To troubleshoot I recommend grabbing a few of the unassigned sequences and using the NCBI BLAST webserver to check against the full nucleotide database. This would be a good way to see what off-target sequences you are hitting (e.g., host DNA?)

Good luck!

kelmermcunha · October 15, 2021, 2:39pm

Hello @sln and @Nicholas_Bokulich,
Thank you for the quick reply and attention!

Thanks for the potential alternative solution, I'll try it as my second resource.

--

In my case it is dead wood samples from a plant species in different decay stages. So the host DNA could be a possibility. However, I followed your advice and checked some 'Unassigned' sequences using the NCBI BLAST webserver, and these came out:
Captura de tela de 2021-10-15 11-09-08
For this particular 'Unassigned' feature, the --p-perc-identity and --p-query-cov thresholds are met, so it was filtered based on the E value? It is the only thing I can think about looking to the default parameters of classify-consensus-blast . This happened to other 'Unassigned' features as well.

In other cases, the features really didn't satisfied the criteria I specified.

So I guess some of these features are being classified as 'Unassigned' due to the criteria.
But is there any "solution" to the E value cases? Aren't those E values good as they are pretty small numbers?

Looking with more attention to the results, I also noticed that some different features are classified as the same taxon:
Captura de tela de 2021-10-15 11-21-16
Is it normal or some kind of red flag? I could think about intra-species variation of the targeted region, but my lack of experience doesn't allow me to draw any conclusions. Sorry to bother with another questions.

And again, thanks a ton guys for your attention and insights!!

Nicholas_Bokulich · October 15, 2021, 4:56pm

Now that's strange... I just checked UNITE directly and those species are in there too. So unless if you modified the sequences somehow, there should be hits.

I think not... those e-values are very very low. But you can also adjust e-value thresholds with the classify-consensus-blast action.

next I would recommend running blastn locally, querying these same sequences against the same UNITE reference database, maybe adjusting the settings so that you get some hits and you can inspect the alignments manually. Something is going on and this might be the best way to check whether there is an issue with the query or reference or the method...

That's normal... single nucleotide variants (even within the same genome there can be multiple variable copies of the ITS) will lead to multiple ASVs... so you can consider these possibly different strains, or copy number variants, or possible error... many possibilities but this is definitely very normal.

kelmermcunha · October 16, 2021, 6:14am

Hello @Nicholas_Bokulich, thanks again for the attentive and quick response!

I've run blastn locally and gain some insights about what is going on within QIIME2.

It's turn out that these discrepancies between the NCBI BLAST webserver and classify-consensus-blast results were being generated due to different datasets.

It seems that these 'Unassigned' features are listed just in the INSD dataset, while the UNITE QIIME2 release does not contain the UNITE+INSD junction like the BLAST webserver (I suppose). So, when I ran the blastn locally with UNITE+INSD dataset, that particular feature that I've used as an example get indeed assigned to Vertexicola sp.

However, I'm still getting a considerable number of 'Unassigned' sequences (428 out of 1004), but that I think it's out of your reach, as within QIIME2 everything seems to be fine. So I'll continue with my Unassigned features saga.

I would like to thank you again for your time and this helpful and fun discussion!

As for @sln , you could try to run a blastn locally just like I did with the UNITE+INSD dataset to check if you get more assigned features like me!

Nicholas_Bokulich · October 16, 2021, 6:20am

The NCBI BLAST nucleotide database indeed contains many more sequences (and different content) from UNITE. But UNITE has many different versions each release, and I believe at least one of the QIIME-compatible versions include INSD sequences, so maybe check around here to see which one is right for you.

You could also build an ITS reference database from NCBI sequences following the link that @sln posted.

Good luck!

kelmermcunha · October 21, 2021, 8:25pm

Hello @Nicholas_Bokulich

I'll definitely check their versions to ensure that I've got the INSD-included one!
Otherwise I'll try the kindly suggestion of @sln.

Thank you!

system · November 22, 2021, 2:25am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.