Classifier training parameters

devonorourke · December 5, 2018, 12:22pm

I'm giving a first whack at training my custom COI database and am following the tutorial guidelines. One caveat - I'm training two separate databases: one that contains all ~ 2 million arthropod records (ALL) and another database that contains only references that contain at least taxonomic information (O-plus).

There's actually 4 training sessions happening - 2 on each database. For each database input, I'm testing a trimmed and untrimmed reference set. The trimming is still ongoing, but the untrimmed reads have been trained and I've generated output.

Regarding that output: am I correct in the classifier's documentation that there is a default number of features that are trained? I was surprised after exporting the equivalent of the --o-classification taxonomy.qza file from this part of the tutorial I was thinking I was going to get a list of 2 million taxa. Instead, I get just over 10,000.

The almost identical number of trained taxa are present in both the ALL and O-plus outputs, which makes me think that there is a default I'm not picking up. I thought perhaps it might be --p-feat-ext--n-features, but that isn't the default number I'm seeing.

In addition to this specific question, I was wondering what parameters experienced users might suggest tweaking. Computational resources or time aren't the issue here - I'd prefer accuracy even if it takes a week.

thanks!

Nicholas_Bokulich · December 5, 2018, 5:06pm

You are describing two different things here. Looks like you are linking to the classifier fitting method, which trains a classifier using your reference sequences. The --o-classification is the output of classify-sklearn, which uses the trained classifier to classify all input sequences. So no, the output of this should contain classifications for all of the query sequences that you are inputting.

Features means something different there. Your sequences (features in QIIME 2 lingo) are being chopped up into kmers (the features used for classifier training). So no there is not a subset being used for training.

Without benchmarking on your COI data I would discourage playing with the default parameters too much. These are rational settings that worked optimally for both 16S and ITS sequences. Adjusting them can cause unexpected issues. E.g., increasing confidence will increase precision (fewer false-positives), but at the expense of higher underclassification! See the original publication for q2-feature-classifier for more details.

I hope that helps!

devonorourke · December 5, 2018, 7:30pm

Sorry for struggling with this...

I want to assign taxonomy to my reads, say myguanoseqs.qza.

Am I correct that I start by training the classifier first, using my reference sequences, say COIdbSeqs.qza and reference taxonomy COIdbTax.qza.

To do this, I run

qiime feature-classifier fit-classifier

with the COIdb*.qza as inputs, and generate a classifier.qza file.

The next step in the tutorial talks about testing the classifier. Maybe that's where I'm mistaken. I wanted to do that, but also want to actually classify my guano sequences - myguanoseqs.qza.
I have mock communities as well. Are each of these tests just separate inputs for this function:?

qiime feature-classifier classify-sklearn

My concern about doing this wrong was because, thus far, I've only supplied the reference sequences as my input for all steps (either 2 million, or about 1.6 million depending on which database) yet in each case i got back a test result with 10,000 sequences... If I trained with 2 million sequences, and test with 2 million, shouldn't my results include 2 million comparisons?

Thanks for the help!

ps. the full code executed for a particular database was as follows:

REFSEQ=/path/to/COIdbSeqs.qza
REFTAX=/path/to/COIdbTax.qza

## train the classifier
qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads "$REFSEQ" \
  --i-reference-taxonomy "$REFTAX" \
  --o-classifier classifier_all_raw.qza

## test the classifier
qiime feature-classifier classify-sklearn \
  --i-classifier classifier_all_raw.qza \
  --i-reads "$REFSEQ" \
  --o-classification classifierTax_all_raw.qza

## export data for analysis
qiime tools export --input-path classifierTax_all_raw.qza --output-path classifierTax_all_raw

Nicholas_Bokulich · December 5, 2018, 8:52pm

"test" may be a misleading term here. You definitely should NOT input the sequences used to train as input (I mean you could but it is meaningless and technically incorrect from a "testing" standpoint).

The input here should be some other sequences, whether these are unknown query sequences or whether you are testing with a mock community.

That should not be happening — you should get the same # of classifications as you input query sequences. Could you share the inputs and outputs? cc: @BenKaehler

Thanks!

devonorourke · December 5, 2018, 9:20pm

Am I correct than in thinking what I should have done after training the classifier is input a different set of representative sequences like my mock community?

I take it this is where the tax-credit piece comes in, where we can look both at cross-validated and mock community means to evaluate our taxonomy assignments?

Given that I have multiple mock communities (I have 3 different mock communities, though there are several mock members which overlap between the 3), this seems likely what I should be doing. My only worry is that each of these mock communities only have about 20-25 members, so it's not a particularly long list to evaluate classification against (though they are quite diverged taxa).

Nicholas_Bokulich · December 5, 2018, 9:41pm

Correct. So any sort of query but not reference sequences used to train that classifier.

Correct — at least for cross-validated. Mock communities you can test directly with QIIME 2, using q2-quality-control to evaluate accuracy for each mock community.

Yep, that's the problem with mock communities. This is where cross-validation and tax-credit come into play...

BenKaehler · December 6, 2018, 5:16am

Thanks @devonorourke and @Nicholas_Bokulich!

I agree with Nick that testing your classifier on your reference data won't tell you much (like classification accuracy) unless you use full cross validation (like we did for q2-clawback). As a unit test (used to detect whether basic functionality is correct), though, it's not terrible. It appears to have done its job here: you should be getting classification for many more of your original reference sequences.

The sort of testing that you should be doing depends on the questions you're trying to answer. If you're wondering whether this approach works at all for your data, then the mock communities are great and q2-quality-control should tell you what you need to know. If you want to train hyperparameters, full cross validation is probably better.

Are you please able to share your reference data set with us so that we can debug why you're seeing so few classified sequences?

Thanks!

devonorourke · December 6, 2018, 10:57am

Crap. I have everything backwards... Three things:

@BenKaehler and @Nicholas_Bokulich should have received a direct message from me with a link to the reference sequences and taxonomy files. Hopefully this helps with troubelshooting.
I was wrong with regards to which of my classifier tests actually finished:

The classifier script that finished was for the trimmed references, not for the untrimmed. It was this trimmed dataset that is apparently vastly smaller - just 10,073 sequences as opposed to 2 million. The input into classifier training, and the eventually finished classifier test which I thought had too few reads (the ~10k reads, not the 2 million) was from this trimmed dataset. So, as you'd expect, my subsequent trained classifier and input read set were both just 10K long. No technical glitches, just a dumb user.

Yet I'm unclear why so few sequences made the cutoff - the parameters I used were:

qiime feature-classifier extract-reads \
  --i-sequences "$REFSEQ" \
  --p-f-primer GGTCAACAAATCATAAAGATATTGG \
  --p-r-primer GGWACTAATCAATTTCCAAATCC \
  --p-trunc-len 181 \
  --p-min-length 160 \
  --p-max-length 220 \
  --o-reads ref_seqs_all_trim.qza

I expect that my amplicons are all about 180 bp long, so I chose those parameters to try to target the correct fragment sizes. Perhaps I'm making a mistake with the nature of how the forward and reverse barcodes are implemented. Are the f-primer and r-primer supposed to be the 5' --> 3' orientation for both? Is the r-primer supposed to be the reverse complement like with most read trimming programs? It's not clear in the documentation, but perhaps that's just one of those things you're supposed to know.

Am I correct that neither of the extract-reads nor naive-classifier functions are multithreaded?

Thanks!

Nicholas_Bokulich · December 6, 2018, 1:44pm

Glad you sorted out the sequence count issue @devonorourke!

Either many of your reference sequences are not hit by the primers, are outside of the min/max length, or are in mixed orientations.

Yes

no

Good point! That would help. I have opened an issue to clarify the docs.

Correct.

devonorourke · December 6, 2018, 3:09pm

It looks like I set my primers in the expected orientations. I should have looked harder into your actual python code which shows that you take the reverse complement of the reverse primer...

I think where the results got wonky was my max length parameter, like you suggested. It looks like the boundaries for which my primers span is usually 229 bases, and I specified the max to be 220.

Just to clarify, say I have a reference like this:

NNNNN[forwardPrimermatch]{coiAmplicon}[reversePrimermatch]NNNN

The total length from the left edge of [forwardPrimermatch] to the right edge of [reversePrimermatch] is 229 bases. The {coiAmplicon} internal span is 181 bases.

I set my parameters initially as:

qiime feature-classifier extract-reads
  --p-trunc-len 181 \
  --p-min-length 160 \
  --p-max-length 220 \

But perhaps that was a mistake. I thought the min/max length were about the amplicon itself that remains after trimming the primers - the {coiAmplicon] region. At least, that seems to be the idea in the documentation:

 --p-min-length INTEGER RANGE    Minimum amplicon length. Shorter amplicons
                                  are discarded. Applied after trimming and
                                  truncation, so be aware that trimming may
                                  impact sequence retention.

If that's correct, then perhaps the problem was the trunc-len parameter? I think I shouldn't have used this in the first place and just set it to zero.

Nicholas_Bokulich · December 6, 2018, 3:19pm

Yep, everything you have written is correct (and you can look further down in the python code you linked to to see where trimming occurs) — though I am not certain if trunc-len is the issue. I'd suggest giving it a try (dropping that param) and see what happens.

I meant the read orientation, not the primer orientation. If your reference sequences are in mixed orientations then the primers will only hit those that are complementary, not the reverse complement.

devonorourke · December 6, 2018, 4:16pm

Just curious if there is any idea out there for how long one might expect a job to take if you're trying to classify 2 million sequences with the naive-Bayes approach. This thread made it sound like I could be waiting a long time.

I know that the 2 million sequences are unique, but their associated taxonomies are not necessarily unique (ie. one taxa can be represented multiple times with distinct sequences). Perhaps I should try to pare down the number of samples.

When the classifiers were trained for microbiome communities, how many samples were included?

Nicholas_Bokulich · December 6, 2018, 4:32pm

Yep, it will take a long time for that many. I do not know off-hand... but we benchmarked classification time in the q2-feature-classifier paper if you want to take a look there (read length is another factor not accounted for there so your results may differ). Multithreading is available for classification though!

That's fine. That's a good thing.

This is why we still often use OTU clustering of reference databases — to reduce size/complexity while still retaining most information. That's just an aside, not a recommendation at this stage... I believe we've already had that conversation on a separate topic.

number of samples does not matter. It is only # of sequences (and also sequence length, etc, hence why extracting reads is helpful).

Sorry you may be waiting for many hours! Try multithreading to speed things up a bit

devonorourke · December 6, 2018, 4:54pm

I'm going in circles I think.

Multithreading is available for classification with Vsearch, but it's not available in the qiime feature-classifier fit-classifier-naive-bayes method, right? That's the part I'm wondering about as the job has been running for about 36 hours (happy to wait 360 if it needs to, just wondering how long to expect).The runtime estimate, I think, is from Fig. 5b in your paper, which should only take my 2 million reads about 12 hours. Like you mentioned, it's going to vary by read length, but most of my references are about 650 bp - I'm not sure if your calculations were based on full length 16S or a portion therein, but it's not like my sequences are likely any longer than yours, so I'm concerned something is going on with my parameter (though all I used were the bare bones defaults):

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads refseqs.qza \
  --i-reference-taxonomy reftax.qza \
  --o-classifier classifier_all_raw.qza

Here's hoping you can invoke some holiday cheer by pointing out the errors of my ways and help me discover how to implement multithreading with the above command

Nicholas_Bokulich · December 6, 2018, 5:06pm

oh you said classification so i thought you were talking about classify-sklearn. That can be parallelized but not training.

that's for classification only, not training. training is usually a bit faster... but 2 million is more than i have tried previously!

sorry, think you'll just need to wait it out pls let us know how long it takes for future reference!

devonorourke · December 7, 2018, 3:16pm

It appear the training finished after about 48 hours.
The classifier hasn't been working though - it keeps running out of memory. I'm unclear which parameter might be helpful (is it the batch size?). Most of our computer nodes are 128 GB RAM with up to 24 threads, but it's failed even on our big nodes with up to 800 GB of RAM.

Thanks!

Nicholas_Bokulich · December 7, 2018, 3:49pm

48 hr doesn't sound too bad, given the massive size of this dataset.

Not surprising, given the massive size of the dataset... SILVA is the largest database I work with, often needs ~32 GB to classify a large-ish # of query seqs, and contains only 700k ref seqs. So your database would take somewhat more memory...

There's lots of advice on the forum already regarding memory issues with this. Adjusting batch size will help. Don't run with too many jobs — each job reads a new instance of the classifier into memory, so if you have 128 GB total on the node you will max out quickly. Talk to the system admins to see if you can access a high-memory node to tackle this job... or else use fewer jobs and wait it out.

I hope that helps!

devonorourke · December 7, 2018, 4:35pm

Am I understand this right:
Each node used the max memory?
We use SLURM for managing our jobs and I can configure the number of CPUs and memory allocation directly (don't need sys admin to request memory - the last job that failed was using the 1Tb RAM node).

I'll look into the links you've provided and see if anyone has experience with using SLURM to manage this particular task.

Thanks