Feature-classifier classify-sklearn: all rep seqs 'unassigned'

arwqiime · September 17, 2018, 7:19am

Hi @Nicholas_Bokulich,
you suggested trimming with the extract reads command. I have done this with the developer sequences of UNITE 7.2 before, but I realized that from the 58,049 developer sequences used at the input, only 8,659 sequences remained in the output artifact. I assume that most of the ITS1 primer targets (f-primer below) are not present in the original UNITE developer sequences (see my screenshot above). I could not find any optinal parameter to tell the script to trim only on one side, if only one primer could be detected.

qiime feature-classifier extract-reads
--i-sequences UNITE_7-2_97_dynamic-dev.qza
--p-f-primer CTTGGTCATTTACAGGAAGTAA
--p-r-primer GCTGCGTTCTTCATCGATGC
--o-reads UNITE_7-2_97_dynamic-dev_ITS1F-ITS2_ref-seqs.qza

For this reason, I decided to trim the developer sequences outside Qiime2.

BenKaehler · September 17, 2018, 11:10am

Hi @Nicholas_Bokulich and @arwqiime,

I finally got some time to look at your original problem. I answered some of your smaller questions when I didn't have access to my computer, which is why I initially ignored your first set of questions. Sorry for the confusion.

Anyway, Nick was right about the first 100 reads being used to guess the read orientation. Most of the time it works, but clearly we haven't tested with your custom reference data set. The outcome looks random because the reads that get included in the first 100 reads of the combined data set (set C) contain different reads to both A and B sets. Just the right reads to fool the auto-orientation heuristic, as it happens. The workaround is to force the orientation, as you have discovered.

As we found in our benchmarks, trimming the UNITE reads does not work spectacularly well. So my first recommendation is to stop trimming. In our tests the algorithm was quite robust to extraneous sequences from outside your primers being included in the reference data. This will not increase classification time or memory overhead.

My second recommendation is to use the full 99% UNITE data set. Is there a reason that you've restricted your data set to the 97% OTUs?

My third recommendation only occurred to me this afternoon when I too ran into some memory issues when running classify-sklearn. Try setting --p-reads-per-batch to a small number, say 1000. When there are many reads in your samples (I ran into this problem when I had > 500,000), it can cause memory issues. This is in addition to making sure that --p-n-jobs is not too high.

Hope that helps,
Ben

Nicholas_Bokulich · September 17, 2018, 1:19pm

ah right — I had forgotten — we recommend against trimming UNITE ITS sequences for that reason.

Very glad to hear that forcing read orientation fixed this issue for you @arwqiime!

arwqiime · September 18, 2018, 6:19am

Hi @BenKaehler and @Nicholas_Bokulich,
Thank you for your suggestions, in particular for the option to force read orientation.

I used the dynamic UNITE reference data set because of the "dynamic use of clustering thresholds (...) These choices were made manually by experts of those particular lineages of fungi". This statement on the UNITE website sounds great, but I will also compare it to the 99% clustering.

I already thought about setting the --p-reads-per-batch parameter, but I had difficulties to translate the autoscale default settuings (no. of query sequences / n_jobs) to a meaningfull INTEGER RANGE. I had a total of 6.5 million reads as input total input sequences, and by using 10 CPUs, this would calculate to 650,000. Since you experienced memory problems with > 500,000 reads, it could well be that the default settings have cuased the memory issue. But it is good now to have a numer (1,000) as an orientation for the future.

Thanks again for your great support!

system · October 19, 2018, 12:19pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.