Building a COI database from BOLD references

devonorourke · September 10, 2020, 3:34pm

What are the primer pairs you use, and what’s the expected amplicon size?
I spent a lot of time banging my head against the wall trying to get mafft to cooperate, and the process felt a bit more art than science. My recommendation is to find a collection of sequences you think will be fairly similar, and generally complete - no ambiguous nucleotides, no gaps, and a similar length. Then use that subset to get your primers aligned first, and check if you’re introducing many gaps in that subset.

Thanks for catching the type! I’ll update that in a minute.

No easy answer to the question about how well primer trimming will perform with classification of target / non-target taxa. Stay tuned for an upcoming preprint soon though, as I think some of the COI tests I performed might help begin addressing how arthropod and chordate COI sequences perform a bit differently in terms of classification accuracy.
Remember that you’re not performing a particular primer-based trimming exactly - it’s not like using cutadapt or something where you seed it a primer sequence, and then remove/retain things based on whether or not you get a decent enough alignment to those primer sequences. Instead, with the mafft approach, we’re using those primers (which might be built for arthropods), to seed the global alignment itself. The first 100-1000 sequences you use to build this reference alignment anchor where all subsequent sequences get aligned to. So even if those mammalian sequences are quite different around the ~20 bp stretch of a primer sequence, there’s still all the rest of the COI sequence it’s going to get aligned to (relative to that reference input you start with). The result is that you will drop some sequences, but you shouldn’t see a drop in mammalian sequences if you seed with arthropod reference alignments - they’re similar enough.

Does that make sense?

Nicholas_Bokulich · September 10, 2020, 4:16pm

This would introduce other issues, though — many methods that accept a FeatureData[Sequence] artifact would choke on any gaps.

The point of QIIME 2's type system is to head off problems like this by validating that data are in the correct format and hence compatible with downstream methods. This is why validation was extended to check the whole file on import, to avoid unforeseen clashes downstream.

So sorry to say but I think you should add degapping to the list of pre-q2 processing (at least for now, until more BOLD-specific functionality can be added, because at the end of the day this is right now a peculiarity of BOLD).

devonorourke · September 10, 2020, 4:40pm

I can definitely appreciate that allowing funky sequences just opens up a huge can of worms. Or more specifically, about 9000% more forum posts that QIIME dev’s will have to handle

On the other hand, I think it would be kind of neat to have a new data import type where you can allow for these kinds of exceptions, but it’s labeled exactly as intended:
FeatureType[SequenceAtOwnRisk]

What do you think? … only 95% kidding …

I’ll keep all the BOLD-specific pre-processing, no worries. Thanks for the discussion though - it’s great to always know why developers arrive at the decisions they do.

hemprichbennett · September 15, 2020, 1:48pm

For the time being I’m using ANML as per your tutorial, as I’ve got some data generated with both ANML and ZBJ I’d like to use this on. But ultimately there could be a few other primer pairs that I eventually use. Right now my issues are with the initial alignment though (creating the file reference_MSA). It keeps having a weirdly high number of gaps, but I’ll attempt tweaking the input file a load today.

Yeah, that makes sense about the trimming of target/non-target taxa. I guess a potential issue could be that the amplicon from non-target taxa may not actually be for the CO1 region in question, but I assume that this would be unlikely, and in that case theres no guarantee that the amplicon is even for CO1 and not another gene entirely.

One thing I forgot to add in my previous post: my Mac had issues generating bold_rawMetadata_forQiime.tsv because of an unaccepted non-standard character in one of the metadata csv files. Changing the command from sort -uk1,1 to LC_ALL=C sort -uk1,1 fixed it.

hemprichbennett · September 23, 2020, 9:01am

I’ve now got to the final step! And have a slightly more basic question as a result. Can I ask what sort of computing system and parameters you’re using to make the classifier? When I run it on my laptop (FWIW a fairly solid MacBook Pro) or our cluster it quickly runs out of RAM.

On the cluster I get Plugin error from rescript: Unable to allocate 25.7 GiB for an array with shape (20000, 172343) and data type float64

Admittedly it was probably optimistic trying to do this on a laptop! I’m still learning my way around my new university cluster’s SLURM scheduler but it seems I’m hitting a wall with allocating enough space for the job to execute. I’ve tried increasing the number of nodes I’m using, increasing --p-n-jobs and decreasing --p-reads-per-batch, which I believe should split the rescript task over a larger number of nodes, but every time it seems that the issue is that all of the dataset has to be processed somehow by every node first, as the numbers in the error message don’t change. The seqs.qza file is 27MB and the taxa.qza file is 9.4MB so the fact that rescript is dealing with such a large object in the first 16 minutes of operation suggests that theres something going on that I could optimise, was wondering if you had any thoughts as to what

devonorourke · September 23, 2020, 9:46am

If I remember right, it’s somewhere between 40-80 GB RAM for these classifiers with 1+ million sequences. It’s a lot.

Reducing the number of reads per batch helps, but I’ve found increasing the number of jobs will increase the memory footprint. You can lower that n-jobs parameter to 1 or 2 if you’re stuck without having the ability to request more RAM.

Glad to hear you made it all the way to the end - congrats! Now you’re officially responsible for generating all the new COI classifiers

hemprichbennett · September 23, 2020, 12:39pm

Oh god, wish me luck!

Things seem to be working for generating my ANML classifier now (well, its proceeded 90 minutes without crashing, when normally it dies in under 20). I think the important thing was tinkering with some cluster settings though. Hopefully in a few days we’ll find that it worked! Now to begin working on a ZBJ classifier…

hemprichbennett · November 16, 2020, 9:44am

Sadly I’m back with another error message! Our cluster was being temperamental so I ran it with reduced demands locally on my MBP. It just crashed after 20 days of running seemingly happily ()

I was using the following command, with ${primer} corresponding to the primer I was wanting to use, with the relevant files in an appropriately named directory. This is a modification so that someday I can do multiple primers at once on the cluster in an array job.

qiime rescript evaluate-fit-classifier \ --i-sequences ${primer}/final_${primer}_seqs.qza \ --i-taxonomy ${primer}/final_${primer}_taxa.qza \ --p-reads-per-batch 400 \ --p-n-jobs 1 \ --output-dir reference_libraries/fitClassifier_bold_${primer}

I got the following error in my terminal from rescript

Plugin error from rescript:

Missing one or more files for TSVTaxonomyDirectoryFormat: 'taxonomy.tsv'

This seems odd: rescript was running fine for the previous 20 days using the taxa.qza object. The log file gives the following:

Validation: 111.59s /Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/q2_feature_classifier/classifier.py:102: UserWarning: The TaxonomicClassifier artifact that results from this method was trained using scikit-learn version 0.21.2. It cannot be used with other versions of scikit-learn. (While the classifier may complete successfully, the results will be unreliable.) warnings.warn(warning, UserWarning) Training: 35045.83s Classification: 1683053.34s /Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/rescript/evaluate.py:76: UserWarning: The lists of input taxonomies and labels are different lengths. Additional taxonomies will be labeled numerically by their order in the inputs. Note that if these numbers match existing labels, those data will be grouped in the visualization. warnings.warn(msg, UserWarning) Traceback (most recent call last): File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/q2cli/commands.py", line 328, in __call__ results = action(**arguments) File "<decorator-gen-138>", line 2, in evaluate_fit_classifier File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/qiime2/sdk/action.py", line 240, in bound_callable output_types, provenance) File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/qiime2/sdk/action.py", line 477, in _callable_executor_ outputs = self._callable(scope.ctx, **view_args) File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/rescript/cross_validate.py", line 55, in evaluate_fit_classifier evaluation, = _eval([taxonomy], [observed_taxonomy]) File "<decorator-gen-334>", line 2, in evaluate_classifications File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/qiime2/sdk/action.py", line 240, in bound_callable output_types, provenance) File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/qiime2/sdk/action.py", line 477, in _callable_executor_ outputs = self._callable(scope.ctx, **view_args) File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/rescript/cross_validate.py", line 246, in evaluate_classifications expected_taxonomies = [t.view(pd.Series) for t in expected_taxonomies] File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/rescript/cross_validate.py", line 246, in <listcomp> expected_taxonomies = [t.view(pd.Series) for t in expected_taxonomies] File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/qiime2/sdk/result.py", line 277, in view return self._view(view_type) File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/qiime2/sdk/result.py", line 289, in _view result = transformation(self._archiver.data_dir) File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/qiime2/core/transform.py", line 68, in transformation self.validate(view, validate_level) File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/qiime2/core/transform.py", line 143, in validate view.validate(level) File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/qiime2/plugin/model/directory_format.py", line 171, in validate getattr(self, field)._validate_members(collected_paths, level) File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/qiime2/plugin/model/directory_format.py", line 105, in _validate_members % (self._directory_format.__class__.__name__, self.pathspec)) qiime2.core.exceptions.ValidationError: Missing one or more files for TSVTaxonomyDirectoryFormat: 'taxonomy.tsv'

I don’t know much about the inner workings of rescript, but it looks like it expected the qza object to contain a file that was specifically named taxonomy.tsv. My file ${primer}/final_${primer}_taxa.qza was working fine until that point, is there a step that I may have missed when creating the qza file? Or perhaps it would have worked if the file which was loaded into qiime earlier had specifically been named ‘taxonomy.tsv’?

devonorourke · November 16, 2020, 1:33pm

Hey @hemprichbennett,

The first thing that jumped out to me (and is a message I've received multiple times in various projects myself) is this:

I believe the conclusion we reached in this post addressing this issue was that you need to ensure that the version of scikit-learn used to train and test the classifier need to be the same (though the particular QIIME2 version does not necessarily, as there can be multiple QIIME releases with the same scikit-learn versions...).

From the error message above, it's evident that the classifier file was trained using scikit-learn v 0.21.2. Can you confirm whether or not your QIIME environment is using the same version of scikit-learn?

hemprichbennett · November 16, 2020, 2:11pm

Ooh, interesting. I’ll have a look into identifying the versions that were used for each, I’m not sure where that could have been set actually. As per the tutorial, I was using evaluate-fit-classifier to train and then test the classifier in a single command, so I would assume that the version of scikit-learn would be identical for both as there wouldn’t be any point where user input caused a different version to be loaded?

Thanks!

Nicholas_Bokulich · November 16, 2020, 2:56pm

You are right @hemprichbennett, but good sleuthing @devonorourke!

That warning is just issued whenever fitting a classifier, telling you not to use it with other sklearn releases, but does not impact anything here. We can just ignore it for now.

@hemprichbennett this error is basically saying that after running for 20 days you reached the final stage (evaluation) only to find that there was an error with the expected taxonomy file.

No... the filenames of inputs at import do not matter, as those files are renamed internally when they are saved to an artifact file.

I am not 100% sure what happened, but from the sound of it this file failed to save internally — however, the message itself is a bit of a red herring (as explained below) so I suspect basically this job was too big and too long, and something unintentional happened along the way. Basically, evaluate-fit-classifier creates a new reference taxonomy file by subsetting to match the sequence IDs (since the IDs of the two must match but sometimes the taxonomy is a superset). The file must have saved, because this taxonomy is already being used (and validated) at the classifier fitting step.

You could run "qiime tools validate" on the input sequences and taxonomy to see if that dredges up any details, but it probably won't... I do not think that the inputs are to blame, I also do not think that this is a bug (no evidence of that yet), I am inclined to blame system failure (which may or may not occur if you re-run the job! I'm guessing you don't want to wait 20 days to find out ).

For now, I think the best thing to do may be to see if you can run this with a smaller set of sequences. Either grabbing a random subset or clustering into OTUs at a high level (e.g., 90%) would accomplish this. Basically, what I am hoping to do here is just rule out the possibility of a bug (again, I don't see evidence of this, but this is the best first step)

devonorourke · November 16, 2020, 4:58pm

Thanks @Nicholas_Bokulich for providing all the great details!

@hemprichbennett - you might be able to randomly select sequences with this tip, but it might be easier to do externally from QIIME. Alternatively, there might be a few RESCRIPt tricks you could use if you have a list of IDs you want to play with first (and filter from the larger dataset)… something like qiime rescript filter-taxa - see here (section on ** Creating a classifier from RefSeqs data**).

SoilRotifer · November 16, 2020, 6:02pm

Hi @devonorourke and @hemprichbennett,

The great @misialq has recently added subsample-fasta to RESCRIPt. Which might make things a little easier.

-Mike

hemprichbennett · November 17, 2020, 1:17pm

Aha, subsample-fasta worked well thanks. I’ve successfully ran evaluate-fit-classifier on a proportion of 0.01 of the reads without any errors, am now trying it again on 0.1.

Interestingly I managed to run fit-classifier-naive-bayes on all of the reads overnight without issue, so presumably my previous error had occurred during the classifier evaluation stage. I guess at some point I’ll have to rerun the command on the full file and see if theres an inherent issue in trying to use evaluate-fit-classifier on these files, or if as I suspect @Nicholas_Bokulich was correct and it was an unfortunate system error. I expect in future I’ll keep the classifier fitting and evaluation steps separate, just in case any system error occurs during the evaluation and causes me to lose everything!

Nicholas_Bokulich · November 17, 2020, 2:01pm

great! you are part of the way there. you can just re-run the individual steps outside of this pipeline. Next steps:

run "classify-sklearn" and use your trained classifier to classify the sequences that you used for training (the purpose of this classifier is to see best-case performance, when you know the correct classifications).
run "rescript evaluate-classifications" to compare the true taxonomy vs. the predicted taxonomy for those reference reads.

All you are missing is the different subsampling/validation steps that evaluate-fit-classifier runs, but those are just safety checks and not necessarily needed.

With a database of this size that is probably wise!

smayne11 · April 15, 2021, 12:56pm

@devonorourke I see there’s a caveat not to use the the bold_anml artifacts with other primers. I’m using ZBJ, so it’s completely contained within the ANML region. I don’t have the computing power to do the whole trimming/filtering process, but I was thinking using the bold_anml_classifier.qza would be better than using a naive bayes trained on untrimmed sequences. At least with sequences trimmed to the ANML site, most of the excess is trimmed away.

Does that logic check out, or am I missing something important in the process?

devonorourke · April 15, 2021, 1:57pm

Hi @smayne11, thanks for your question.

I think you should give the ANML classifier a shot. I’d be curious to see how many of your sequence variants are classified (or unclassified), and among those with some taxonomic labels, what fraction are being assigned Family, Genus, or Species-level information.

If the classifier isn’t working for you and you want to generate your primer-specific classifier starting from the broader BOLD sequence and taxonomy files now hosted by QIIME2 here:

Even if you lack the local computing power, you might dive into renting a machine through something like Google Cloud or AWS - that’s the route I ended up taking for some of these compute-heavy tasks.

Good luck, and do please let me know how you make out with the ANML classifier.

smayne11 · April 15, 2021, 3:50pm

Thanks so much! I will definitely let you know how it goes with the ANML classifier.

Looks like those are newer versions of the files from the tutorial. Is there somewhere I can go for the most up to date versions in the future? I can download those here, but can’t for the life of me find them anywhere on the QIIME2 website or forums.

I’m also planning on trying a classifier trained on bold_anml_seqs.qza and bold_anml_taxa.qza filtered to just records from the US & Canada using the raw metadata file. I assume that would be in the same place, but if not, do you have an updated version of it as well?

devonorourke · April 15, 2021, 4:43pm

I believe the s3 buckets are the same files as those linked at the top of this post; at least, I didn’t change them on purpose, so if the files are different for some reason please let me know. The metadata file is also at the top of the post or click here, where you can download it directly from the OSF repo. I’ve been encouraged to submit a new post on the forum that clarifies just where these files will live and how to properly cite the resources - something I hope to do soon but haven’t had the time to get done properly.

A long time ago I once built a barplot of the various arthropod orders by their country of origin according to the BOLD records. My advice is to be wary of using geographic information as a requirement, as it may leave out may good quality references that lack any country name. I’d be interested to know how different your outcomes are between data classified by US&Canada only, versus those classified by any BOLD resources regardless of geographic information.

Good luck

smayne11 · April 15, 2021, 5:43pm

Got it, sorry for the confusion. I was just noticing the “2021.04” in the address and thinking that meant they were from this month.

In case you weren’t aware, the classifier linked above no longer works on the most recent Qiime version–specifically the new scikit-learn version. Easy to train a new one on the seqs and taxa you provide though, so not a big issue. Edit: looks like the server I have access to can’t handle this step for the full database. Only 64GB RAM, so not a surprise looking back at the tutorial. Looks like I won’t be able do a good comparison without getting access to more computing power, which might or might not be possible for me. Will still let you know how the US & Canada version works, which was possible with 64GB RAM.

Thanks for the warning on geographic filtering. I’ll let you know how that turns out too. I am mostly worried about adding a bunch of sequences that shouldn’t be in my study system (Western Mass.) at all and potentially decreasing the confidence of the classifier.