I'm loosing too many OTUs with taxa filter-table (~80%)--is this a normal/common phenomenon?

taxa
taxonomy
feature-classifier
filtering

#1

Dear QIIME2 users,
I have a few general questions about filtering 16S data. I understand this is not a QIIME2 software problem, but I’d appreciate insights from senior biologists/users of QIIME2 working with 16S data on this.

I have Illumina 16S data (V4 with 515-806 primers), and I’m running QIIME2-2018.11 to examine microbial abundance and diversity in my samples. I used SILVA132 (99_16S.fna and 99_taxonomy_7_levels.txt) as the reference database and taxonomy database.

$time qiime feature-classifier classify-consensus-blast
–i-query fig2a/rep-seqs.qza
–i-reference-taxonomy fig2a/silva132_taxonomy.qza
–i-reference-reads fig2a/silva132-db.qza
–o-classification fig2a/classify2a
–p-perc-identity 0.90
–p-maxaccepts 1
–verbose

I’ve read about taxa filtering here https://docs.qiime2.org/2018.11/tutorials/filtering/ , and on this thread https://forum.qiime2.org/t/high-yield-of-d-2-alphaproteobacteria-d-3-rickettsiales-d-4-mitochondria-in-samples-from-wild-bats-artifact/7161.

After classification I scanned my taxonomy-to-tsv file and I knew I had to filter “Unassigned” in order to improve “relative abundance” in the samples downstream. To my dismay, “ Unassigned” were about 65%!!! How is this possible? Any suggestions for improving taxonomic assignment? Because I thought silva_132_99_16S.fna was appropriate for this (I could be wrong!). Like what is being suggested here several random posts online https://www.researchgate.net/post/Can_anyone_help_with_pulling_specific_sequences_that_correspond_to_OTU_IDs_from_Qiime

Also I have quite other assignments (with metagenome approx… 17%), do I need to filter these out too? Because I’m interested in Family, Genus and Species assignments yet all this stuff with metagenome (though it hits bacteria) is only classified up to Phylum or Class. If I do, I’m worried I’ll retain only 16% of the original data!

Here:

And lastly, I wanted to confirm if some of the OTUs which are “Unassigned” are actually Unassigned by doing a quick ncbi blast. How do I get the sequences corresponding to the specific OTU Ids from my silva_132_99_16S.fna file? I’ve noted that filter_fasta.py (in QIIME1) could do this, but I’m running QIIME2. Where would I find an equivalent of that script in QIIME2?

If any of my questions are too basic, my apologies. I am at my wits end (almost 2 weeks on this) and I appreciate all the help I can get. This forum has been my life-saver on all things QIIME2!!

Thanks for your help.


(Nicholas Bokulich) #2

Hi @halt_BB,
This sounds like you genuinely have a large fraction of non-target sequences and the high level of filtering is warranted. Of course the classifier impacts this, but you are using a very permissive classifier.

You are telling BLAST to classify each sequence to the taxonomy of the first reference sequence with ≥ 90% similarity. This is generally a bad idea: the classifications will not be precise, and will be very prone to over-classification. You should use a higher perc-identity and higher maxaccepts if you use the blast-LCA classifier.

But the poor classifier choice will actually make it less likely for your query sequences to be unclassified… effectively, those 65% of sequences that are unclassified have no matches that are at least 90% similar.

Use a different classifier or different parameters. I think this particular classification stems from your parameter choices.

see the tabulate command in this section:
https://docs.qiime2.org/2018.11/tutorials/moving-pictures/#taxonomic-analysis

See https://docs.qiime2.org/2018.11/tutorials/quality-control/#evaluating-sequence-quality

good luck!


#3

Hi @Nicholas_Bokulich,

Thank you for the detailed explanation on my current predicament, I really appreciate.

Blockquote This sounds like you genuinely have a large fraction of non-target sequences and the high level of filtering is warranted. Of course the classifier impacts this, but you are using a very permissive classifier.

I think I agree with you on the high percent of “Unassigned”, although I’ve been in denial :sob: hoping I was wrong! The sequencing facility characterized most of my samples as having poor quality DNA (oh…that’s after they sent me totally different sequencing results which turned out to be for a colleague of mine in the same lab!). Unfortunately, i can’t do anything about DNA quality since these are field samples collected at specific time points.

Blockquote But the poor classifier choice will actually make it less likely for your query sequences to be unclassified… effectively, those 65% of sequences that are unclassified have no matches that are at least 90% similar.

Indeed I was being overly permissive (still stuck in the BLAST age :speak_no_evil:).

I am going to try the new qiime q2 feature-classifier with the pretrained silva-132-99-515-806-nb-classifier.qza and hopefully with more stringent filtering, this gives me better assignment …as detailed here https://peerj.com/preprints/3208/

I’ll look at the resources on OTU picking and evaluating quality.

Feedback in 2 days :crossed_fingers:

Again, many thanks @Nicholas_Bokulich!


(Matthew Ryan Dillon) assigned Nicholas_Bokulich #4

(Nicholas Bokulich) #5

Hi @halt_BB,

It happens to all of us! Read quality and non-target DNA amplification/contamination are both issues.

Yes, that would give more reliable assignments at the end of the day (though you will still lose many unassigned sequences — you can spot check a few with NCBI BLAST to see if these are garbage, non-target DNA, or an error, i.e., something that should be assignable in which case I can walk you through more troubleshooting).

Note that’s an ancient pre-print and I don’t think q2-feature-classifier is even benchmarked there… this is the paper you want: https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-018-0470-z

Let us know how it turns out!


(Nicholas Bokulich) unassigned Nicholas_Bokulich #6

#7

Hi @Nicholas_Bokulich,

I just tried running the feature-classifier and there was a compatibility issue. see error below

*$time qiime feature-classifier classify-sklearn \

–i-classifier databases/silva-132-99-515-806-nb-classifier.qza
–i-reads fig2a/rep-seqs.qza
–o-classification fig2a/q2taxonomy.qza
–verbose
Traceback (most recent call last):
File “/Users/papa/miniconda2/envs/qiime2-2018.11/lib/python3.5/site-packages/q2cli/commands.py”, line 274, in call
results = action(**arguments)
File “”, line 2, in classify_sklearn
File “/Users/papa/miniconda2/envs/qiime2-2018.11/lib/python3.5/site-packages/qiime2/sdk/action.py”, line 225, in bound_callable
spec.view_type, recorder)
File “/Users/papa/miniconda2/envs/qiime2-2018.11/lib/python3.5/site-packages/qiime2/sdk/result.py”, line 287, in _view
result = transformation(self._archiver.data_dir)
File “/Users/papa/miniconda2/envs/qiime2-2018.11/lib/python3.5/site-packages/qiime2/core/transform.py”, line 70, in transformation
new_view = transformer(view)
File “/Users/papa/miniconda2/envs/qiime2-2018.11/lib/python3.5/site-packages/q2_feature_classifier/_taxonomic_classifier.py”, line 64, in _1
% (sklearn_version, sklearn.version))
ValueError: The scikit-learn version (0.20.2) used to generate this artifact does not match the current version of scikit-learn installed (0.19.1). Please retrain your classifier for your current deployment to prevent data-corruption errors.

Plugin error from feature-classifier:

The scikit-learn version (0.20.2) used to generate this artifact does not match the current version of scikit-learn installed (0.19.1). Please retrain your classifier for your current deployment to prevent data-corruption errors.

See above for debug info.*

I am thinking about 2 options:

  1. Install qiime2 2019.1 and re-try to see if the error goes away (running the install now…)
  2. Use the classifier that I trained with SILVA132 myself.

Option 2: Tried this and it failed too
*time qiime feature-classifier classify-sklearn \

–i-reads fig2a/rep-seqs.qza
–i-classifier databases/silva132-taxonomy99.qza
–o-classification fig2a/q2taxonomy.qza
–verbose
Traceback (most recent call last):
File “/Users/papa/miniconda2/envs/qiime2-2018.11/lib/python3.5/site-packages/q2cli/commands.py”, line 274, in call
results = action(**arguments)
File “”, line 2, in classify_sklearn
File “/Users/papa/miniconda2/envs/qiime2-2018.11/lib/python3.5/site-packages/qiime2/sdk/action.py”, line 199, in bound_callable
self.signature.check_types(**user_input)
File “/Users/papa/miniconda2/envs/qiime2-2018.11/lib/python3.5/site-packages/qiime2/core/type/signature.py”, line 301, in check_types
name, kwargs[name].type, spec.qiime_type))
TypeError: Parameter ‘classifier’ received an argument of type FeatureData[Taxonomy]. An argument of subtype TaxonomicClassifier is required.

Plugin error from feature-classifier:

Parameter ‘classifier’ received an argument of type FeatureData[Taxonomy]. An argument of subtype TaxonomicClassifier is required.

See above for debug info.*

Then I thought…wait perhaps the classifier requires a FeatureData[Sequence] artifact
So I did…

*time qiime feature-classifier classify-sklearn \

–i-reads fig2a/rep-seqs.qza
–i-classifier databases/silva132-db99.qza
–o-classification fig2a/q2taxonomy.qza
–verbose
Traceback (most recent call last):
File “/Users/papa/miniconda2/envs/qiime2-2018.11/lib/python3.5/site-packages/q2cli/commands.py”, line 274, in call
results = action(**arguments)
File “”, line 2, in classify_sklearn
File “/Users/papa/miniconda2/envs/qiime2-2018.11/lib/python3.5/site-packages/qiime2/sdk/action.py”, line 199, in bound_callable
self.signature.check_types(**user_input)
File “/Users/papa/miniconda2/envs/qiime2-2018.11/lib/python3.5/site-packages/qiime2/core/type/signature.py”, line 301, in check_types
name, kwargs[name].type, spec.qiime_type))
TypeError: Parameter ‘classifier’ received an argument of type FeatureData[Sequence]. An argument of subtype TaxonomicClassifier is required.

Plugin error from feature-classifier:

Parameter ‘classifier’ received an argument of type FeatureData[Sequence]. An argument of subtype TaxonomicClassifier is required.

See above for debug info.*

And it failed too :disappointed:

As I wait for the install to complete (I hope it uses the current scikit-learn version (0.20.2), shouldn’t option 2 be able to work?

Thanks.


#8

Hi @Nicholas_Bokulich,

My qiime 2 2019.1 install was ok :ok_hand: I decided to keep qiime 2 2018.11 as well (in case I run into any issues).

I read up on some conflicts that might arise when one is running 2 conda environs, https://github.com/conda/conda/issues/3580 and noticed it can be done. I had a terminal window of qiime 2 2018.11 active so I started a new session (in a new window) and activated qiime 2 2019.1.

Then I restarted my feature classifier ,

$ time qiime feature-classifier classify-sklearn \

–i-reads fig2/rep-seqs.qza
–i-classifier databases/silva-132-99-515-806-nb-classifier.qza
–o-classification fig2/q2taxonomy.qza
–verbose

The job seems to be running but when I top, I don’t see python anywhere in the running processes

Maybe I could be paranoid, but the same command above, with the same samples but with consensus-blast took about 17mins.

Is this behavior ok for *classify-sklearn *?

Please help!


(Nicholas Bokulich) #9

@halt_BB,
You can definitely run two parallel environments, no problem. The sklearn classifier should take about as long as (or faster than) blast-LCA, but there are exceptions to the rule. In general, an error will be produced if something bad happens, e.g., a memory error or out of disk space… so my guess is it is just still running.