Feature-classifier classify-sklearn: all rep seqs 'unassigned'

Hello!
I would like to add another short question related to the classify-sklearn output and the subsequent taxa bar-plot visualization:
We would like to screen our samples for pathogenic fungi and oomycetes. We generated amplicons with fungi specific primers as well as with modified primers known to amplify oomycota ITS regions in addition to fungal ITS regions. For taxonomic classification we used naive-bayes trained classifiers from UNITE 7.2. However, since these reference data sets contain only few oomycota sequences, we combined the UNITE data with oomycota data from BOLD (barcoding of life database) and generated a second ‘combined UNITE-BOLD classifier’.
A classify-sklearn analysis followed by ‘metadata tabulate’ and ‘taxa barplot’ showed no oomycota reads in our dataset using the combined classifier. To test the oomycota classifier, I rerun the classify-sklearn analysis using a classifier consiting only of the BOLD oomycota data; here, I somehow ‘forced’ taxonomic annotations to oomycota sequences. Since this happend, I assumed that the classifiers were ok.

Since I could not identify oomycota reads in the csv table of the taxa barplot using the combined UNITE-BOLD classifier, I conclude that there are no oomycetes ‘detectable’ in my amplicon sample. I this conclusion correct?
If I want to test such non-standard classifiers, would it be advisable to add a certain number of known oomycota reads to the samples as ‘spike-ins’ to test the classifiers?

Hi, I ran into more problems with qiime feature-classifier classify-sklearn:
In addition to deblur, I wanted to analyse the ITS data (see taxa barplot from above) using
qiime vsearch cluster-features-open-reference ...
qiime vsearch uchime-denovo ...
qiime feature-table filter-features ...
Followed by
qiime feature-classifier classify-sklearn ...
However the last command resulted in memory errors (see: qiime2-q2cli-err-iy124g83.txt).

I updated from 2018.6. to 2018.8 and restarted classify-sklern and got a memory error, but shorter than the first one (see qiime2-q2cli-err-tkph9_5b.txt)

My machine hat the following installed;
KiB Mem : 49461372 total, 48903796 free, 371284 used, 186292 buff/cache
KiB Swap: 2097148 total, 1982504 free, 114644 used. 48729756 avail Mem
I assume, this should be enough!

I renamed the log files from log to txt to allow upload...
qiime2-q2cli-err-tkph9_5b.txt (3.5 KB)
qiime2-q2cli-err-iy124g83.txt (98.8 KB)

Thanks for any suggestions!

Thanks @arwqiime, those are some really interesting experiments.

Are you using the “developer” UNITE sequences? The standard sequences come pre-trimmed so your special oomycota-specific regions may not be in the reference. See the tutorial for details.

Taxonomic classification of amplicons can only be as good as the information in the amplicons. Classification is not perfect. Our recent work has focussed on trying to resolve the inevitable ambiguities.

A classifier can only classify among the classes it has seen, which is why you were able to force classification by training on oomycota only.

You can’t conclude that there are no detectable oomycetes in your sample, unfortunately. Spike-ins could help. A cheaper method would be to look at what the oomycetes were misclassified as (by comparing results from your BOLD and UNITE-BOLD classifiers) and eyeballing the reads, the reference sequences that they were misclassified as, and the oomycota reference sequences, to see if there are any obvious reasons for the confusion.

1 Like

In relation to the memory issue, try running classify-sklearn with --p-n-jobs 1. If you’ve got lots of CPUs the classifier might be loading many large classifiers into memory.

Thanks @BenKaehler, I assume that I have asked too many question that led to some confusion. Let me try to break it up into independent questions:

No. 1: my initial question was related to a good taxonomic classification when using amplicon data from independent runs (indicated as A and B, respectively) and the surprise that when using all runs in one analysis (A+B=C) did not result in any taxonomic classification. The same classifier for bacteria (not fungi)(silva_132_97_16S_341F-805R_classifier.qza) was used.

I still could not find out why?

The other question was related to ITS amplifications of the same experiment which gave me good taxonomic classification when using classification file made from UNITE developer sequences PLUS some BOLD sequences of ascomycota, but different from BOLD classifier alone. .

In fact, I wanted to use the full UNITE developer files: I imported the fasta file and used it together with the taxonomy artifact file to 'qiime feature-classifier fit-classifier-naive-bayes', but I received the following error message:
ValueError: Invalid character in sequence: b't'.
Valid characters: ['S', 'H', 'D', 'G', '.', 'N', 'W', 'Y', 'M', 'R', 'T', 'V', 'B', 'K', 'A', '-', 'C']
Note: Use lowercase if your sequence contains lowercase characters not in the sequence's alphabet.

I inspected the sequences in my default sequence analysis software (Geneious) and could not find errors. SInce the sequences were quite long, I decided to trim the sequences in Geneious using the primers used for amplification (ITS1F / ITS2; see screenshot below). Most sequences are too short in the SSU region and the ITS1F sequence could not be detected, but the ITS2 region could easily be detected and the downstream region be trimmed (red boxes). Therefore, I used the trimmed sequences, exported them as fasta, and constructed the classifiers described above.

From your reply today, I decied to test the original UNITE developer sequences, too. In order to avoid the error above, I imported the fasta into Geneious and re-exported them again as fasta without any modification. When using this re-exported fasta file in q2 and make a classifier, the error message is gone. No idea what might have gone wrong, but the 'original' UNITE fasta file does not work in my hands...

20180914_Geneious-trim

I will test this...
I will double check it on a machine with 192 GB RAM and more jobs to see if this is indeed related to my local machines.

Hi @arwqiime,
We are still looking into your initial question — sorry for the delay on this but we have been backed up this week and this is a complex problem. We will keep you posted.

I can answer your other questions though:

It sounds like BOLD must contain sequences covering clades not covered by UNITE, and vice versa. So they complement each other and improve performance on your dataset.

We've run into this before — the developer sequences do contain some lowercase characters. This causes problems with scikit-bio, the program QIIME 2 uses for handling these sequence data.

The fix is easy — just find and replace lowercase with uppercase using your preferred method.

Based on the fact that the problem disappeared after exporting from Geneious, my guess is this program already converts lower->uppercase on import.

Note that you can also do this trimming within QIIME 2 with feature-classifier extract-reads — just noting in case you are curious (the only advantage is then the trimming would be stored in provenance so you would have a better record of the workflow you used)

It really is just lowercase characters in the sequences. Very silly problem to have but an easy fix (and we have an open issue somewhere to streamline this by converting lowercase->uppercase on import)

I hope that helps! Stay tuned for more details on problem #1!

@arwqiime,
I think I have a pretty good guess what’s going on, but I will need you to run some tests to confirm.

classify-sklearn should not produce stochastic results like you describe — unless if the classifier is getting confused about the orientation that your reads are in. Read orientation can be set manually using the --p-read-orientation parameter. If this is not manually set, the classifier will attempt to determine the read orientation based on the first 100 reads or so (can’t remember the exact #).

This almost always works — unless if the reads are in mixed orientations or if the reads contain non-biological sequence information (or at least sequence information not contained in the reference sequences, e.g., primers that were not trimmed from the query).

Could you try the following?

  1. Re-run classify-sklearn with --p-read-orientation set to same and set to reverse-complement. Please post the QZVs of the results here and let us know if that is a clear fix! No need to proceed to steps 2/3 if setting the orientation manually works.
  2. Confirm that your sequences are not in mixed orientations (it looks like they are not, based on cursory examination, but I do not know your sequencing protocol so pls let us know if they are mixed).
  3. Confirm that your query sequences are trimmed to the same sites as your reference sequences and do not contain untrimmed primers or non-biological DNA that is not present in the reference sequences.

If that doesn’t work, there could be an issue with the classifier that you trained, but that seems less likely at this moment.

Please give that a try and let us know if that solves this issue!

2 Likes

Hi @Nicholas_Bokulich,
Great! The '--p-read-orientation same` seemed to solve my problem. See attached visualization.

Nevertheless, I will inspect my read for any non-biological sequences; they should be free of primer sequences, but I will see whether this is really true.

Thanks again for your really great support!

6b2_Lulo1-Bact1-Mi_deblur-149_silva_132_97_16S_341F-805R_classification.qzv (1.3 MB)

1 Like

Hi @Nicholas_Bokulich,
you suggested trimming with the extract reads command. I have done this with the developer sequences of UNITE 7.2 before, but I realized that from the 58,049 developer sequences used at the input, only 8,659 sequences remained in the output artifact. I assume that most of the ITS1 primer targets (f-primer below) are not present in the original UNITE developer sequences (see my screenshot above). I could not find any optinal parameter to tell the script to trim only on one side, if only one primer could be detected.

qiime feature-classifier extract-reads
--i-sequences UNITE_7-2_97_dynamic-dev.qza
--p-f-primer CTTGGTCATTTACAGGAAGTAA
--p-r-primer GCTGCGTTCTTCATCGATGC
--o-reads UNITE_7-2_97_dynamic-dev_ITS1F-ITS2_ref-seqs.qza

For this reason, I decided to trim the developer sequences outside Qiime2.

1 Like

Hi @Nicholas_Bokulich and @arwqiime,

I finally got some time to look at your original problem. I answered some of your smaller questions when I didn’t have access to my computer, which is why I initially ignored your first set of questions. Sorry for the confusion.

Anyway, Nick was right about the first 100 reads being used to guess the read orientation. Most of the time it works, but clearly we haven’t tested with your custom reference data set. The outcome looks random because the reads that get included in the first 100 reads of the combined data set (set C) contain different reads to both A and B sets. Just the right reads to fool the auto-orientation heuristic, as it happens. The workaround is to force the orientation, as you have discovered.

As we found in our benchmarks, trimming the UNITE reads does not work spectacularly well. So my first recommendation is to stop trimming. In our tests the algorithm was quite robust to extraneous sequences from outside your primers being included in the reference data. This will not increase classification time or memory overhead.

My second recommendation is to use the full 99% UNITE data set. Is there a reason that you’ve restricted your data set to the 97% OTUs?

My third recommendation only occurred to me this afternoon when I too ran into some memory issues when running classify-sklearn. Try setting --p-reads-per-batch to a small number, say 1000. When there are many reads in your samples (I ran into this problem when I had > 500,000), it can cause memory issues. This is in addition to making sure that --p-n-jobs is not too high.

Hope that helps,
Ben

1 Like

ah right — I had forgotten — we recommend against trimming UNITE ITS sequences for that reason.

Very glad to hear that forcing read orientation fixed this issue for you @arwqiime!

Hi @BenKaehler and @Nicholas_Bokulich,
Thank you for your suggestions, in particular for the option to force read orientation.

I used the dynamic UNITE reference data set because of the “dynamic use of clustering thresholds (…) These choices were made manually by experts of those particular lineages of fungi”. This statement on the UNITE website sounds great, but I will also compare it to the 99% clustering.

I already thought about setting the --p-reads-per-batch parameter, but I had difficulties to translate the autoscale default settuings (no. of query sequences / n_jobs) to a meaningfull INTEGER RANGE. I had a total of 6.5 million reads as input total input sequences, and by using 10 CPUs, this would calculate to 650,000. Since you experienced memory problems with > 500,000 reads, it could well be that the default settings have cuased the memory issue. But it is good now to have a numer (1,000) as an orientation for the future.

Thanks again for your great support!

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.