Classifier training parameters

Sorry for struggling with this…

I want to assign taxonomy to my reads, say myguanoseqs.qza.

Am I correct that I start by training the classifier first, using my reference sequences, say COIdbSeqs.qza and reference taxonomy COIdbTax.qza.

To do this, I run

qiime feature-classifier fit-classifier

with the COIdb*.qza as inputs, and generate a classifier.qza file.

The next step in the tutorial talks about testing the classifier. Maybe that’s where I’m mistaken. I wanted to do that, but also want to actually classify my guano sequences - myguanoseqs.qza.
I have mock communities as well. Are each of these tests just separate inputs for this function:?

qiime feature-classifier classify-sklearn

My concern about doing this wrong was because, thus far, I’ve only supplied the reference sequences as my input for all steps (either 2 million, or about 1.6 million depending on which database) yet in each case i got back a test result with 10,000 sequences… If I trained with 2 million sequences, and test with 2 million, shouldn’t my results include 2 million comparisons?

Thanks for the help!

ps. the full code executed for a particular database was as follows:

REFSEQ=/path/to/COIdbSeqs.qza
REFTAX=/path/to/COIdbTax.qza

## train the classifier
qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads "$REFSEQ" \
  --i-reference-taxonomy "$REFTAX" \
  --o-classifier classifier_all_raw.qza

## test the classifier
qiime feature-classifier classify-sklearn \
  --i-classifier classifier_all_raw.qza \
  --i-reads "$REFSEQ" \
  --o-classification classifierTax_all_raw.qza

## export data for analysis
qiime tools export --input-path classifierTax_all_raw.qza --output-path classifierTax_all_raw
1 Like

"test" may be a misleading term here. You definitely should NOT input the sequences used to train as input (I mean you could but it is meaningless and technically incorrect from a "testing" standpoint).

The input here should be some other sequences, whether these are unknown query sequences or whether you are testing with a mock community.

That should not be happening — you should get the same # of classifications as you input query sequences. Could you share the inputs and outputs? cc: @BenKaehler

Thanks!

Am I correct than in thinking what I should have done after training the classifier is input a different set of representative sequences like my mock community?

I take it this is where the tax-credit piece comes in, where we can look both at cross-validated and mock community means to evaluate our taxonomy assignments?

Given that I have multiple mock communities (I have 3 different mock communities, though there are several mock members which overlap between the 3), this seems likely what I should be doing. My only worry is that each of these mock communities only have about 20-25 members, so it’s not a particularly long list to evaluate classification against (though they are quite diverged taxa).

Correct. So any sort of query but not reference sequences used to train that classifier.

Correct — at least for cross-validated. Mock communities you can test directly with QIIME 2, using q2-quality-control to evaluate accuracy for each mock community.

Yep, that's the problem with mock communities. This is where cross-validation and tax-credit come into play...

Thanks @devonorourke and @Nicholas_Bokulich!

I agree with Nick that testing your classifier on your reference data won’t tell you much (like classification accuracy) unless you use full cross validation (like we did for q2-clawback). As a unit test (used to detect whether basic functionality is correct), though, it’s not terrible. It appears to have done its job here: you should be getting classification for many more of your original reference sequences.

The sort of testing that you should be doing depends on the questions you’re trying to answer. If you’re wondering whether this approach works at all for your data, then the mock communities are great and q2-quality-control should tell you what you need to know. If you want to train hyperparameters, full cross validation is probably better.

Are you please able to share your reference data set with us so that we can debug why you’re seeing so few classified sequences?

Thanks!

Crap. I have everything backwards... Three things:

  1. @BenKaehler and @Nicholas_Bokulich should have received a direct message from me with a link to the reference sequences and taxonomy files. Hopefully this helps with troubelshooting.

  2. I was wrong with regards to which of my classifier tests actually finished:

The classifier script that finished was for the trimmed references, not for the untrimmed. It was this trimmed dataset that is apparently vastly smaller - just 10,073 sequences as opposed to 2 million. The input into classifier training, and the eventually finished classifier test which I thought had too few reads (the ~10k reads, not the 2 million) was from this trimmed dataset. So, as you'd expect, my subsequent trained classifier and input read set were both just 10K long. No technical glitches, just a dumb user.

Yet I'm unclear why so few sequences made the cutoff - the parameters I used were:

qiime feature-classifier extract-reads \
  --i-sequences "$REFSEQ" \
  --p-f-primer GGTCAACAAATCATAAAGATATTGG \
  --p-r-primer GGWACTAATCAATTTCCAAATCC \
  --p-trunc-len 181 \
  --p-min-length 160 \
  --p-max-length 220 \
  --o-reads ref_seqs_all_trim.qza

I expect that my amplicons are all about 180 bp long, so I chose those parameters to try to target the correct fragment sizes. Perhaps I'm making a mistake with the nature of how the forward and reverse barcodes are implemented. Are the f-primer and r-primer supposed to be the 5' --> 3' orientation for both? Is the r-primer supposed to be the reverse complement like with most read trimming programs? It's not clear in the documentation, but perhaps that's just one of those things you're supposed to know.

  1. Am I correct that neither of the extract-reads nor naive-classifier functions are multithreaded?

Thanks!

1 Like

Glad you sorted out the sequence count issue @devonorourke!

Either many of your reference sequences are not hit by the primers, are outside of the min/max length, or are in mixed orientations.

Yes

no

Good point! That would help. I have opened an issue to clarify the docs.

Correct.

It looks like I set my primers in the expected orientations. I should have looked harder into your actual python code which shows that you take the reverse complement of the reverse primer...

I think where the results got wonky was my max length parameter, like you suggested. It looks like the boundaries for which my primers span is usually 229 bases, and I specified the max to be 220.

Just to clarify, say I have a reference like this:

NNNNN[forwardPrimermatch]{coiAmplicon}[reversePrimermatch]NNNN

The total length from the left edge of [forwardPrimermatch] to the right edge of [reversePrimermatch] is 229 bases. The {coiAmplicon} internal span is 181 bases.

I set my parameters initially as:

qiime feature-classifier extract-reads
  --p-trunc-len 181 \
  --p-min-length 160 \
  --p-max-length 220 \

But perhaps that was a mistake. I thought the min/max length were about the amplicon itself that remains after trimming the primers - the {coiAmplicon] region. At least, that seems to be the idea in the documentation:

 --p-min-length INTEGER RANGE    Minimum amplicon length. Shorter amplicons
                                  are discarded. Applied after trimming and
                                  truncation, so be aware that trimming may
                                  impact sequence retention.

If that's correct, then perhaps the problem was the trunc-len parameter? I think I shouldn't have used this in the first place and just set it to zero.

Yep, everything you have written is correct (and you can look further down in the python code you linked to to see where trimming occurs) — though I am not certain if trunc-len is the issue. I'd suggest giving it a try (dropping that param) and see what happens.

I meant the read orientation, not the primer orientation. If your reference sequences are in mixed orientations then the primers will only hit those that are complementary, not the reverse complement.

Just curious if there is any idea out there for how long one might expect a job to take if you’re trying to classify 2 million sequences with the naive-Bayes approach. This thread made it sound like I could be waiting a long time.

I know that the 2 million sequences are unique, but their associated taxonomies are not necessarily unique (ie. one taxa can be represented multiple times with distinct sequences). Perhaps I should try to pare down the number of samples.

When the classifiers were trained for microbiome communities, how many samples were included?

Yep, it will take a long time for that many. I do not know off-hand... but we benchmarked classification time in the q2-feature-classifier paper if you want to take a look there (read length is another factor not accounted for there so your results may differ). Multithreading is available for classification though!

That's fine. That's a good thing.

This is why we still often use OTU clustering of reference databases — to reduce size/complexity while still retaining most information. That's just an aside, not a recommendation at this stage... I believe we've already had that conversation on a separate topic.

number of samples does not matter. It is only # of sequences (and also sequence length, etc, hence why extracting reads is helpful).

Sorry you may be waiting for many hours! Try multithreading to speed things up a bit

I'm going in circles I think.

Multithreading is available for classification with Vsearch, but it's not available in the qiime feature-classifier fit-classifier-naive-bayes method, right? That's the part I'm wondering about as the job has been running for about 36 hours (happy to wait 360 if it needs to, just wondering how long to expect).The runtime estimate, I think, is from Fig. 5b in your paper, which should only take my 2 million reads about 12 hours. Like you mentioned, it's going to vary by read length, but most of my references are about 650 bp - I'm not sure if your calculations were based on full length 16S or a portion therein, but it's not like my sequences are likely any longer than yours, so I'm concerned something is going on with my parameter (though all I used were the bare bones defaults):

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads refseqs.qza \
  --i-reference-taxonomy reftax.qza \
  --o-classifier classifier_all_raw.qza

Here's hoping you can invoke some holiday cheer by pointing out the errors of my ways and help me discover how to implement multithreading with the above command :christmas_tree:

oh you said classification so i thought you were talking about classify-sklearn. That can be parallelized but not training.

that's for classification only, not training. training is usually a bit faster... but 2 million is more than i have tried previously!

sorry, think you'll just need to wait it out :frowning: pls let us know how long it takes for future reference!

It appear the training finished after about 48 hours.
The classifier hasn’t been working though - it keeps running out of memory. I’m unclear which parameter might be helpful (is it the batch size?). Most of our computer nodes are 128 GB RAM with up to 24 threads, but it’s failed even on our big nodes with up to 800 GB of RAM.

Thanks!

:champagne:

48 hr doesn't sound too bad, given the massive size of this dataset.

Not surprising, given the massive size of the dataset... SILVA is the largest database I work with, often needs ~32 GB to classify a large-ish # of query seqs, and contains only 700k ref seqs. So your database would take somewhat more memory...

There's lots of advice on the forum already regarding memory issues with this. Adjusting batch size will help. Don't run with too many jobs — each job reads a new instance of the classifier into memory, so if you have 128 GB total on the node you will max out quickly. Talk to the system admins to see if you can access a high-memory node to tackle this job... or else use fewer jobs and wait it out.

I hope that helps!

Am I understand this right:
Each node used the max memory?
We use SLURM for managing our jobs and I can configure the number of CPUs and memory allocation directly (don’t need sys admin to request memory - the last job that failed was using the 1Tb RAM node).

I’ll look into the links you’ve provided and see if anyone has experience with using SLURM to manage this particular task.

Thanks

The good news is I got the classifier to finally complete, but the bad news is I’m not sure I fully understand why it as failing in the first place (suspicions outlined below).

It looks like many issues posted about classifier memory are from users testing the pre-trained SiLVA database with their laptop with 8-16 GB or RAM, followed by replies indicating that they need more memory, and suggesting to try using a compute cluster or go the AWS route. My process was seg faulting on our cluster, using either 120, 500, or 800 GB memory, so while I’m still getting an error that quite distinctly seems memory related, I don’t understand how it could be crashing with a high mem node.

slurmstepd: error: Job 36840 exceeded memory limit (408895836160 > 134209339392), being killed
slurmstepd: error: Exceeded job memory limit
slurmstepd: error: *** JOB 36840 ON node108 CANCELLED AT 2018-12-08T00:48:02

One think that would be valuable to understand is how the information provided in the script are ultimately getting read into memory for the process:

  1. My reference classifier is just over 500 MB; I’m guessing this is read into memory entirely before even one sequence is queried? I’m curious how much the decompression inflates that value to. The entire classifier must get read in at once, right?
  2. My representative sequence dataset contains about 10,000 sequence variants to be classified. It seems likely that this also wouldn’t be causing a significant memory issue, but maybe this should be tuned down with fewer batches? This is the only part where batching reads can reduce memory footprint (but just take longer if you have fewer reads per batch)?

My suspicion as to why it finally completed without the memory error… I think I’m misinterpreting the documentation describing how I should be allocating CPUs. With our SLURM job manager on our cluster we define the number of cpus to use per task and the number of tasks (which I’m interpreting as “threads” and “cpus”, respectively). I wonder if the --p-n-jobs parameter is supposed to be referring to the number of CPU nodes, or the number of threads per CPU. When I think about multithreading (like with vsearch) I usually think of that --p-n-jobs parameter as referring tot he number of “threads”, but maybe that’s where I’m mistaken. Is it referring to the number of cpus instead? If it is referring to the number of CPUs, I’m wondering what to specify in terms of the number of tasks per CPU - the threads thing. It doesn’t seem like there is a parameter in this script for that.

The program would always crash, no matter how much memory I gave it (up to 800Gb) if I specified 24 CPUs in my SLURM script, and used the --p-n-jobs parameter set to -1. I did that thinking that was how to use all the threads, but perhaps it’s my mistake that it’s trying to use all the CPUs on the cluster, which my SLURM script is specifying not to do…

Once I switched that parameter back to 1, the program finished the job with the regular 128GB RAM node after about 8 hours.

Thanks for any input you can provide about how that -n-jobs parameter is supposed to be interpreted with regards to threads vs. cpus.

Yes, the entire classifier is read into memory before one sequence is queried. So that explains why you had immediate memory issues before... you were reading in N classifiers to memory (where N = number of jobs you are running). 500 MB would decompress to a much larger size (not sure how large), so all in all it is not unimaginable that 24 jobs would eat up 1 TB of RAM in no time.

Yep, a large number of queries will impact memory usage, and this is where batch size matters. (and more batches does increase time a little bit).

You should read up on what scikit-learn does in this regard — sorry, I don't know the actual answer so you should get it straight from the horse's mouth! (and then let me know — I've made the same assumptions you have so would like to know if I'm wrong) :horse:

Hi @Nicholas_Bokulich,
Here is the horses documentation. In looking through the qiime script implementing scikit-learn, it looks like the n_jobs parameter is implemented beginning in line 40, where you define the predict function.
In that code, you specify incorporate n_jobs parameter, but do not define the particular backend parameter which will modify the meaning that n_jobs is interpreted as. Looking at the very beginning of the joblib.Parallel documentation makes it clear that you can interpret the n_jobs parameter either as the number of CPUs to be used, or as the number of threads, depending on how you specify that backend parameter:

Parameters:	
n_jobs: int, default: None
The maximum number of concurrently running jobs, such as the 
number of Python worker processes when backend=”multiprocessing” 
or the size of the thread-pool when backend=”threading”. 

So I think, if I’m understanding anything about this properly, there is more to it than QIIME’s current implementation. If you want to leave it as is, it looks like it defaults to CPU usage, but if you want to be able to use this script and specify threads, then more needs to be added to the code. I think, at a minimum, it requires the addition of the prefer argument in joblib.Parallel, but again, I’m not certain. Whomever in QIIME made this plugin possible is probably going to understand exactly what modification is necessary, and my guess is it won’t be a huge addition.

Hope this sheds some light on the situation rather than muddying the waters, but who knows.

Thanks!

1 Like

Thanks @devonorourke

You're looking at him. Yep, sounds like a simple modification. I will need to mull this a bit more before deciding what is appropriate.