Classification of ITS question

JedL · June 6, 2019, 6:01am

Hi,

New to QIIME2 (using 2019.4) but it's pretty easy to work your way around and learn thanks to the extensive tutorials and forum posts. I have however run into an issue that I can't quite figure out the reason for.

I initially, when classifying some ITS2 sequences, extracted all reads with my primer sequences from the UNITE (ver 8) dynamic reference sequences file and trained a classifier (feature-classifier fit-classifier-naive-bayes) with them as reference reads against the UNITE dynamic reference taxonomy and classified the sample reads (feature-classifier classify-sklearn) and got a nice barplot from it! ITS2extracted.qzv (875.1 KB)

After this though, i was reading on the forum and here, at the bottom, and saw that it's suggested that I should not extract the reads and in fact train the classifier using the whole UNITE database reference reads file provided. So I repeated the training with the full UNITE reference sequences they provide and the same taxonomy provided. I reclassified my samples using the newly trained classifier and got a much less impressive barplot... ITS2fullUNITEtrained.qzv (744.5 KB)

Can anyone explain why with more reference sequences less sample sequences are classified as anything more than unidentified at every level other than fungi? Any help would be most welcome

Nicholas_Bokulich · June 6, 2019, 1:58pm

Wow, that definitely does not look good, but it appears to be a quirk related to your data and/or reference database. Let me explain:

The k__Fungi;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified classification that you are seeing is actually the annotation that is provided in your reference database. So the classifier is not failing to classify beyond kingdom level (otherwise you would see empty ranks, e.g., k__Fungi;__;__;__;__;__;__), rather you are getting great species-level classification to a sequence in the reference database that is annotated as k__Fungi;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified. Does that make sense?

That recommendation is just based on our own basic benchmarks. In your case you definitely see better performance with the extracted reads (probably because extraction culls out the junky unannotated sequence that your sequences are matching to), so I recommend just sticking with that and not looking back unless if you want to do a little more experimentation.

One thing you could do is filter out the "unidentified" sequences from your UNITE database and see if that improves classification (with or without extraction).

Yum, looks like you must be analyzing some grape or fruit samples

Barbs · June 6, 2019, 3:56pm

Hi,

I am joining in as I have exactly the same problem with my ITS2 sequences and UNITE sh_taxonomy_qiime_ver8_dynamic_02.02.2019_dev ref sequence file: Almost 100% of my reads get great species-level classification as:
k__Fungi;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified
As suggested in other threads I did NOT extract the ref reads with my primer sequences but still I end up with this poor assignment.
Since you both suggest that extracting the ref reads before assignment might actually be helpful in some cases I will try that!

One thing you could do is filter out the “unidentified” sequences from your UNITE database and see if that improves classification (with or without extraction).

I will also try that! Thanks a lot for bringing up that topic

Nicholas_Bokulich · June 6, 2019, 4:12pm

oh thanks for listing the file name — I see that this is a new release of UNITE. I have checked out this release and I see that it has 623 sequences with that annotation! I compared this to an older release I have laying around, and there are 0 sequences with that annotation.

So that explains a few things.

we have not seen this particular issue in the past but probably because the new release contains this problematic sequence with a bad annotation, prior release(s) do not.
the decision to extract or not may be particular to individual release versions (and individual databases for that matter)! Ultimately, it may be best to try things both ways to see what effect it has, since peculiarities of individual datasets may also impact this decision.
Unless if you are happy with unhelpful results, I would recommend cutting out the dead wood (unannotated/unidentified sequences) from the UNITE database, and this is something I have done in the past.

Please let us know what you find!

JedL · June 6, 2019, 10:31pm

Thanks both for your help!

I had figured that it was more likely an issue with the database than QIIME but thought that i would check just in case. I hadn’t considered that a new version of the database had come out since the advice I was talking about was given. I’ll play around a little more and see what I can generate

And you nailed it, microbial terroir research is still going strong

Barbs · June 17, 2019, 2:53pm

Hello!

I tried now for days trimming the ITS database for the rogue sequences but I always end up with an error. Apparently there is no qiime plug-in for filtering taxa tables (FeatureData[Taxonomy]) for certain taxa like there is qiime taxa filter-seqs for filtering sequences based on their taxonomic assignment.

That’s why I tried to trim the unidentified taxa out of the fasta ref and tax ref file using R. It worked quite well and I was able to import the trimmed sequences in Qiime artifact. When trying to import the taxonomy file I get an error. I am sure it is something which went wrong during file conversion/import/export to R as the error mentions utf-8 code problem:

File "/home/kschaefe/.conda/envs/qiime2-2019.1/lib/python3.6/site-packages/q2cli/tools.py", line 146, in import_data
view_type=input_format)
File "/home/kschaefe/.conda/envs/qiime2-2019.1/lib/python3.6/site-packages/qiime2/sdk/result.py", line 240, in import_data
return cls.from_view(type, view, view_type, provenance_capture)
File "/home/kschaefe/.conda/envs/qiime2-2019.1/lib/python3.6/site-packages/qiime2/sdk/result.py", line 265, in _from_view
result = transformation(view)
File "/home/kschaefe/.conda/envs/qiime2-2019.1/lib/python3.6/site-packages/qiime2/core/transform.py", line 70, in transformation
new_view = transformer(view)
File "/home/kschaefe/.conda/envs/qiime2-2019.1/lib/python3.6/site-packages/qiime2/core/transform.py", line 220, in wrapped
file_view = transformer(view)
File "/home/kschaefe/.conda/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_types/feature_data/_transformer.py", line 177, in _20
_taxonomy_formats_to_dataframe(str(ff), has_header=False))
File "/home/kschaefe/.conda/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_types/feature_data/_transformer.py", line 51, in _taxonomy_formats_to_dataframe
header=None, dtype=object)
File "/home/kschaefe/.conda/envs/qiime2-2019.1/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/kschaefe/.conda/envs/qiime2-2019.1/lib/python3.6/site-packages/pandas/io/parsers.py", line 446, in _read
data = parser.read(nrows)
File "/home/kschaefe/.conda/envs/qiime2-2019.1/lib/python3.6/site-packages/pandas/io/parsers.py", line 1036, in read
ret = self._engine.read(nrows)
File "/home/kschaefe/.conda/envs/qiime2-2019.1/lib/python3.6/site-packages/pandas/io/parsers.py", line 1848, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1094, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1119, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx", line 1240, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas/_libs/parsers.pyx", line 1256, in pandas._libs.parsers.TextReader._string_convert
File "pandas/_libs/parsers.pyx", line 1494, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf9 in position 109: invalid start byte
An unexpected error has occurred:
'utf-8' codec can't decode byte 0xf9 in position 109: invalid start byte

Is there any other more elegant way to trim those rogue taxa out of my taxonomy file. Did I miss some magic plugin which can do it? Sorry for bringing up another issue here...

Nicholas_Bokulich · June 17, 2019, 6:07pm

That's correct. Usually in QIIME 2 feature data (e.g., taxonomy) is only used if that feature is present in a feature table or list of features (e.g., FeatureData[Sequence] artifact). So there is no need to filter out a FeatureData[Taxonomy] artifact.

I think the case is the same here: just use filter-seqs to filter your seqs, then try training a classifier with those filtered seqs and the full taxonomy. There might be an error, but I think it should just work. Other classification methods — classify-consensus-vsearch, for example — will work without filtering your taxonomy.

Give that a try first, please, then we can try debugging this utf-8 issue if I am wrong about fit-classifier-naive-bayes not needed the taxonomy to be filtered.

Barbs · June 18, 2019, 9:29am

Thanks!
So I tried:

qiime taxa filter-seqs
--i-sequences unite_dev_seq2.qza
--i-taxonomy unite_dev_tax.qza
--p-exclude p__unidentified
--o-filtered-sequences unite_dev_seq3_trim.qza

But ended up with a new error:

File "/home/kschaefe/.conda/envs/qiime2-2019.1/lib/python3.6/site-packages/q2cli/commands.py", line 274, in call
results = action(**arguments)
File "</home/kschaefe/.conda/envs/qiime2-2019.1/lib/python3.6/site-packages/decorator.py:decorator-gen-340>", line 2, in fit_classifier_naive_bayes
File "/home/kschaefe/.conda/envs/qiime2-2019.1/lib/python3.6/site-packages/qiime2/sdk/action.py", line 231, in bound_callable
output_types, provenance)
File "/home/kschaefe/.conda/envs/qiime2-2019.1/lib/python3.6/site-packages/qiime2/sdk/action.py", line 365, in callable_executor
output_views = self._callable(**view_args)
File "/home/kschaefe/.conda/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_feature_classifier/classifier.py", line 316, in generic_fitter
pipeline)
File "/home/kschaefe/.conda/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_feature_classifier/_skl.py", line 31, in fit_pipeline
y, X = list(zip(*data))
ValueError: not enough values to unpack (expected 2, got 0)
Plugin error from feature-classifier:
not enough values to unpack (expected 2, got 0)

There has been a previous posting for this error: Feature Classifier - Not enough values to unpack

But I checked the tax file and there is no # or blank space in the file.
Also I checked some of the lines in the tax file which are mentioned in the error message and they look just fine for me.

Sorry for coming up with a new error...again... Thanks for your help!
unite_dev_tax.qza (691.5 KB)
unite_dev_seq2.qza (4.4 MB)

Nicholas_Bokulich · June 18, 2019, 11:25am

The issue here is that some of the feature IDs in the taxonomy do not match those in the seqs. My guess is this is because you converted lowercase 'ACGT' in the seqs to uppercase at some point, because the mismatched IDs look like this:

# In the seqs
SH1660673.08FU_DQ974771_refs_sinGleTon
# in the taxonomy
SH1660673.08FU_DQ974771_refs_singleton

Go back to the start and uppercase using the example in this tutorial:

awk '/^>/ {print($0)}; /^[^>]/ {print(toupper($0))}' developer/sh_refs_qiime_ver7_99_01.12.2017_dev.fasta > developer/sh_refs_qiime_ver7_99_01.12.2017_dev_uppercase.fasta

That will only convert to uppercase if it is not in the header line.

Barbs · June 19, 2019, 8:48am

Thanks again! Cant believe I missed the GTCA conversion in the IDs. I’ve noticed it it a previous file but I thought I got rid of it-apparently I didnt. Sorry for that!

I got rid of the problem with your suggestion. AND by using the trimmed dataset aI got a much better taxa assignment.
I still get <50% of k__Fungi;__;__;__;__;__;__ assignments with a confidence of 1.00. So the classifier is failing to classify beyond kingdom level. But this is a different issue…

Many thanks for the great support!

Nicholas_Bokulich · June 19, 2019, 11:01am

These are most likely non-target hits, e.g., plants or non-fungal eukaryotes (since you don't have an outgroup in your ref seqs they get classified as Fungi). Spot-check a few to confirm (use NCBI BLAST to make sure they aren't fungi), remove, and move on. If NCBI suggests that these are fungi, let me know.

Barbs · June 19, 2019, 1:51pm

From my 3095 features 310 get the k__Fungi;__;__;__;__;__;__ assignment. I checked 10 of those sequences at NCBI and the outcome was pretty mixed. Most of them were confirmed as fungi - mainly uncultured samples, but >90% query cover & identity. 3 features had super bad eurkaryote hits (<30% query cover).

Those features are very abundand accounting for >50% reads in my samples. So I am a bit reluctant to remove them. Looks like I have to check all 310 individually!?

Nicholas_Bokulich · June 19, 2019, 2:16pm

are your reads mixed orientation?

try classify consensus-vsearch and let us know what happens.

Barbs · June 25, 2019, 12:46pm

Hello again,

are your reads mixed orientation?

My reads are paired-end.

I tried classify consensus-vsearch with all default settings. From the 3095 features are now 972
features Unassigned. So much more than with sklearn classifier. Those
unassigned features include almost all of the 310 k__Fungi;__;__;__;__;__;__ features
from sklearn classifier (and apparaently 662 features more). I cross-checked the
662 additional unassigned features for their assignment with
sklearn-classifier: Most of them do also have a poorer assignment there (max. to phylum
level). So if a feature is assigned as k__Fungi;__;__;__;__;__;__ with sklearn and as
unassigned with vsearch… its quite likely that it is not a fungi feature and I
might exlude it for further down-stream analysis?!

Thanks a lot for your patience!

Nicholas_Bokulich · June 26, 2019, 11:02am

Yes. If both of these classifiers are giving you unclassified/only kingdom level, it is most likely:

these are non-target hits (e.g., plant).
you are targeting the wrong region (sounds like that's not the case here!)

In which case I'd say do another BLAST spot check, filter these from your data, and move on.

NCBI BLASTn has an "exclude uncultured" option — always use that to avoid pesky "uncultured" hits!

system · July 27, 2019, 5:02pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.