Building a COI database from BOLD references

Hey @hemprichbennett,

Welcome to the forum :qiime2: :partying_face: !

I don't think you're missing anything - if you're referring to the step where you import the raw BOLD sequences, you have to refer to these as FeatureData[AlignedSequence] because a great many of them contain gap characters (-). If you try to import them as 'FeatureData[Sequence] instead, you'll get an error, indicating that the data contain characters that aren't allowed. So they reason we're importing as that data type is simply to avoid the import error.

However, you're 100% correct - these data are not aligned. We have to do that later on in the workflow. They aren't the same length either (though you can see in this tutorial that a great many of them fall within a very similar length). All the steps (and more!) you're curious about regarding trimming and filtering sequences are going to happen after these raw BOLD sequences are imported.

Let me know if something in the workflow isn't working for you. Thanks, and good luck.

2 Likes

Hi Devon,

Great, thanks for the welcome and thanks for clarifying! I’ve now isolated the problem: qiime2 2020.8.0 doesn’t allow the fasta file to be imported as FeatureData[AlignedSequence], however qiime2 2019.10.0 does. As it appears that the 2019 version is currently required for using rescript then I guess its unlikely that many others will be encountering this error, but its perhaps worth noting that it could be an issue as time goes on. When trying to use qiime2 2020.8.0 it gives me the error message

There was a problem importing bold_rawSeqs_forQiime.fasta:

bold_rawSeqs_forQiime.fasta is not a(n) AlignedDNAFASTAFormat file:

The sequence starting on line 38 was length 618. All previous sequences were length 659. All sequences must be the same length for AlignedDNAFASTAFormat.

I have two other minor issues.

  1. I’m on OSX and guess the tutorial was written for linux? For the early bash loop to work I have to change the cut command from cut -f 1 badseqs.txt -d ‘:’ to cut -f 1 -d ‘:’ badseqs.txt. If the linux cut syntax is a bit more relaxed then it could be worth changing that?
  2. The input-format command on the taxonomy import step is currently commented out :wink:

Cheers, I’ll continue through the tutorial now :slight_smile:

Thanks for the :mag: investigative work @hemprichbennett,
Perhaps @Nicholas_Bokulich or @thermokarst might know what’s going on with the discrepancy between what I was doing earlier and this error message with the most recent QIIME 2 release. In any event, glad you’re getting it to work with an earlier version.

Could you clarify your second point/minor issue? Were you getting an error message if you didn’t uncomment that line of code? I don’t recall that line being required. In other words, this should be sufficient, right?

qiime tools import \
  --type 'FeatureData[Taxonomy]' \
  --input-path bold_rawTaxa_forQiime.tsv \
  --output-path bold_rawTaxa.qza

Just to be clear, I’ve removed the #--input-format HeaderlessTSVTaxonomyFormat line…

For me it actually requires the input format to be specified, if I remove that line or keep it commented out then I get the error

There was a problem importing bold_rawTaxa_forQiime.tsv:

bold_rawTaxa_forQiime.tsv is not a(n) TSVTaxonomyFormat file:

[‘Feature ID’, ‘Taxon’] must be the first two header values. The first two header values provided are: [‘10001757’, ‘tax=k__Animalia;p__Annelida;c__Polychaeta;o__Phyllodocida;f__Nereididae;g__;s__’] (on line 1).

So it looks like by default on my machine it expects a header in the file. I guess the default input format may differ between versions of Qiime2?

Thanks for the sanity check. You were right all along - it’s just how I ended up creating this taxonomy file (it has no header, so QIIME expects me to pass along that parameter I had commented out to indicate that the incoming file has no header). You can import a taxnomy file without that parameter, but you have to pass in the two-column line it expects:

Feature ID    Taxon

I updated the documentation to reflect that the parameter is required. Thanks for all your troubleshooting! Keep it up (though hopefully, no more troubles :slight_smile: )!

Sure thing!

Prior to the 2020.8 release, the import validation for FeatureData[AlignedSequence] only looked at the first 5ish lines of the file. As of 2020.8 we now have implemented robust validation for these types of data.

1 Like

Thanks @thermokarst,
What I’d like to be able to do is import a sequence that is not aligned, but does contain gap characters. When retrieving reference sequences from BOLD, it’s possible for a sequence to contain one or more gap characters like this:

> seq1
AAT-CGAAACAGA-TC

What’s awful is that they aren’t actually aligned! So, in order to use the fancy new tools available in RESCRIPt, I need to be able to import these raw sequences. Trying to import them as a regular FeatureType[Sequence] will fail, correct? Is there an option that allows for gap characters to be tolerated as sequence data?

Certainly, it’s straightforward to remove these with other tools prior to importing in QIIME, but I wanted to check whether I could get these into QIIME straight from the BOLD downloads.

Thanks!

Since you are already doing some pre-processing of the seqs, maybe just add degapping to the list (with tr, sed, awk, whatever) instead of importing and degapping with RESCRIPt. After all, as you say these are not really alignments just seqs with gaps.

Of course, doing it all in RESCRIPt would be great, so that the processing steps are stored in provenance... but:
a) this is really an issue specific to BOLD
b) you are doing other pre-processing steps that are also specific for formatting BOLD

so a bit of degapping prior to importing would not hurt.

The best solution would be to eventually get a BOLD-specific formatting step (e.g., create a BOLD fasta format with the appropriate transformers such that the seqs and taxonomy are automatically formatting when any BOLD seqs are imported). :wink:

1 Like

I suppose my preference is to get data in its earliest form in QIIME so that provenance tracking can be retained, like you suggested. With that in mind, I guess the trade-off in question is provenance versus uniformity. It would be great if these raw sequences could be imported into QIIME, perhaps with a warning message noting that some characters my cause downstream errors, rather than an outright error message preventing their import. Specifically, allowing FeatureType[Sequence] could allow these gap characters, even if they aren’t actually alignment files. That would resolve my BOLD specific issue, but perhaps could also resolve others should they import data with gaps from other sources?

2 Likes

I’m making my way through, no major issues to report yet! Having some difficulties with mafft introducing lots of gaps, but thats a problem for next week now :wink:

One minor thing: theres a typo in the step where you remove leading/trailing ambiguous nucleotides. You use grep on ./degap_tmpdir/dna-sequences.fasta/dna-sequences.fasta when it should be ./degap_tmpdir/dna-sequences.fasta

I have a general question about the overall logic of the pipeline: I understand the utility of trimming to the region of the primer of interest, but I wonder how well this will work for assigning non-target taxa. In our case, sometimes when amplifying bat diet with arthropod-specific primers we’ll actually amplify some bat or fungal DNA even though there isn’t a perfect binding site. To assign these sequences as bat we would need to have some mammal sequences in our reference library, but I assume that trimming mammal DNA in-silico using arthropod-specific primers won’t work, as the primers won’t match the template. However it also seems problematic to include full-length mammal folmer regions in a database of heavily trimmed arthropod sequences. Just wondering your thoughts on this?

Thanks very much
Dave

What are the primer pairs you use, and what’s the expected amplicon size?
I spent a lot of time banging my head against the wall trying to get mafft to cooperate, and the process felt a bit more art than science. My recommendation is to find a collection of sequences you think will be fairly similar, and generally complete - no ambiguous nucleotides, no gaps, and a similar length. Then use that subset to get your primers aligned first, and check if you’re introducing many gaps in that subset.

Thanks for catching the type! I’ll update that in a minute.

No easy answer to the question about how well primer trimming will perform with classification of target / non-target taxa. Stay tuned for an upcoming preprint soon though, as I think some of the COI tests I performed might help begin addressing how arthropod and chordate COI sequences perform a bit differently in terms of classification accuracy.
Remember that you’re not performing a particular primer-based trimming exactly - it’s not like using cutadapt or something where you seed it a primer sequence, and then remove/retain things based on whether or not you get a decent enough alignment to those primer sequences. Instead, with the mafft approach, we’re using those primers (which might be built for arthropods), to seed the global alignment itself. The first 100-1000 sequences you use to build this reference alignment anchor where all subsequent sequences get aligned to. So even if those mammalian sequences are quite different around the ~20 bp stretch of a primer sequence, there’s still all the rest of the COI sequence it’s going to get aligned to (relative to that reference input you start with). The result is that you will drop some sequences, but you shouldn’t see a drop in mammalian sequences if you seed with arthropod reference alignments - they’re similar enough.

Does that make sense?

2 Likes

This would introduce other issues, though — many methods that accept a FeatureData[Sequence] artifact would choke on any gaps.

The point of QIIME 2's type system is to head off problems like this by validating that data are in the correct format and hence compatible with downstream methods. This is why validation was extended to check the whole file on import, to avoid unforeseen clashes downstream.

So sorry to say but I think you should add degapping to the list of pre-q2 processing (at least for now, until more BOLD-specific functionality can be added, because at the end of the day this is right now a peculiarity of BOLD).

2 Likes

I can definitely appreciate that allowing funky sequences just opens up a huge can of worms. Or more specifically, about 9000% more forum posts that QIIME dev’s will have to handle :nauseated_face:

On the other hand, I think it would be kind of neat to have a new data import type where you can allow for these kinds of exceptions, but it’s labeled exactly as intended:
FeatureType[SequenceAtOwnRisk]

What do you think? :thinking: … only 95% kidding … :wink:

I’ll keep all the BOLD-specific pre-processing, no worries. Thanks for the discussion though - it’s great to always know why developers arrive at the decisions they do.

2 Likes

For the time being I’m using ANML as per your tutorial, as I’ve got some data generated with both ANML and ZBJ I’d like to use this on. But ultimately there could be a few other primer pairs that I eventually use. Right now my issues are with the initial alignment though (creating the file reference_MSA). It keeps having a weirdly high number of gaps, but I’ll attempt tweaking the input file a load today.

Yeah, that makes sense about the trimming of target/non-target taxa. I guess a potential issue could be that the amplicon from non-target taxa may not actually be for the CO1 region in question, but I assume that this would be unlikely, and in that case theres no guarantee that the amplicon is even for CO1 and not another gene entirely.

One thing I forgot to add in my previous post: my Mac had issues generating bold_rawMetadata_forQiime.tsv because of an unaccepted non-standard character in one of the metadata csv files. Changing the command from sort -uk1,1 to LC_ALL=C sort -uk1,1 fixed it.

I’ve now got to the final step! And have a slightly more basic question as a result. Can I ask what sort of computing system and parameters you’re using to make the classifier? When I run it on my laptop (FWIW a fairly solid MacBook Pro) or our cluster it quickly runs out of RAM.

On the cluster I get Plugin error from rescript: Unable to allocate 25.7 GiB for an array with shape (20000, 172343) and data type float64

Admittedly it was probably optimistic trying to do this on a laptop! I’m still learning my way around my new university cluster’s SLURM scheduler but it seems I’m hitting a wall with allocating enough space for the job to execute. I’ve tried increasing the number of nodes I’m using, increasing --p-n-jobs and decreasing --p-reads-per-batch, which I believe should split the rescript task over a larger number of nodes, but every time it seems that the issue is that all of the dataset has to be processed somehow by every node first, as the numbers in the error message don’t change. The seqs.qza file is 27MB and the taxa.qza file is 9.4MB so the fact that rescript is dealing with such a large object in the first 16 minutes of operation suggests that theres something going on that I could optimise, was wondering if you had any thoughts as to what :slight_smile:

1 Like

If I remember right, it’s somewhere between 40-80 GB RAM for these classifiers with 1+ million sequences. It’s a lot.

Reducing the number of reads per batch helps, but I’ve found increasing the number of jobs will increase the memory footprint. You can lower that n-jobs parameter to 1 or 2 if you’re stuck without having the ability to request more RAM.

Glad to hear you made it all the way to the end - congrats! Now you’re officially responsible for generating all the new COI classifiers :wink:

3 Likes

Oh god, wish me luck! :wink:

Things seem to be working for generating my ANML classifier now (well, its proceeded 90 minutes without crashing, when normally it dies in under 20). I think the important thing was tinkering with some cluster settings though. Hopefully in a few days we’ll find that it worked! Now to begin working on a ZBJ classifier…

Sadly I’m back with another error message! Our cluster was being temperamental so I ran it with reduced demands locally on my MBP. It just crashed after 20 days of running seemingly happily (:cry:)

I was using the following command, with ${primer} corresponding to the primer I was wanting to use, with the relevant files in an appropriately named directory. This is a modification so that someday I can do multiple primers at once on the cluster in an array job.

qiime rescript evaluate-fit-classifier \ --i-sequences ${primer}/final_${primer}_seqs.qza \ --i-taxonomy ${primer}/final_${primer}_taxa.qza \ --p-reads-per-batch 400 \ --p-n-jobs 1 \ --output-dir reference_libraries/fitClassifier_bold_${primer}

I got the following error in my terminal from rescript

Plugin error from rescript:

Missing one or more files for TSVTaxonomyDirectoryFormat: 'taxonomy.tsv'

This seems odd: rescript was running fine for the previous 20 days using the taxa.qza object. The log file gives the following:

Validation: 111.59s /Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/q2_feature_classifier/classifier.py:102: UserWarning: The TaxonomicClassifier artifact that results from this method was trained using scikit-learn version 0.21.2. It cannot be used with other versions of scikit-learn. (While the classifier may complete successfully, the results will be unreliable.) warnings.warn(warning, UserWarning) Training: 35045.83s Classification: 1683053.34s /Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/rescript/evaluate.py:76: UserWarning: The lists of input taxonomies and labels are different lengths. Additional taxonomies will be labeled numerically by their order in the inputs. Note that if these numbers match existing labels, those data will be grouped in the visualization. warnings.warn(msg, UserWarning) Traceback (most recent call last): File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/q2cli/commands.py", line 328, in __call__ results = action(**arguments) File "<decorator-gen-138>", line 2, in evaluate_fit_classifier File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/qiime2/sdk/action.py", line 240, in bound_callable output_types, provenance) File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/qiime2/sdk/action.py", line 477, in _callable_executor_ outputs = self._callable(scope.ctx, **view_args) File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/rescript/cross_validate.py", line 55, in evaluate_fit_classifier evaluation, = _eval([taxonomy], [observed_taxonomy]) File "<decorator-gen-334>", line 2, in evaluate_classifications File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/qiime2/sdk/action.py", line 240, in bound_callable output_types, provenance) File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/qiime2/sdk/action.py", line 477, in _callable_executor_ outputs = self._callable(scope.ctx, **view_args) File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/rescript/cross_validate.py", line 246, in evaluate_classifications expected_taxonomies = [t.view(pd.Series) for t in expected_taxonomies] File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/rescript/cross_validate.py", line 246, in <listcomp> expected_taxonomies = [t.view(pd.Series) for t in expected_taxonomies] File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/qiime2/sdk/result.py", line 277, in view return self._view(view_type) File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/qiime2/sdk/result.py", line 289, in _view result = transformation(self._archiver.data_dir) File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/qiime2/core/transform.py", line 68, in transformation self.validate(view, validate_level) File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/qiime2/core/transform.py", line 143, in validate view.validate(level) File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/qiime2/plugin/model/directory_format.py", line 171, in validate getattr(self, field)._validate_members(collected_paths, level) File "/Users/davehemprichbennett/opt/anaconda3/envs/crag_conda/lib/python3.6/site-packages/qiime2/plugin/model/directory_format.py", line 105, in _validate_members % (self._directory_format.__class__.__name__, self.pathspec)) qiime2.core.exceptions.ValidationError: Missing one or more files for TSVTaxonomyDirectoryFormat: 'taxonomy.tsv'

I don’t know much about the inner workings of rescript, but it looks like it expected the qza object to contain a file that was specifically named taxonomy.tsv. My file ${primer}/final_${primer}_taxa.qza was working fine until that point, is there a step that I may have missed when creating the qza file? Or perhaps it would have worked if the file which was loaded into qiime earlier had specifically been named ‘taxonomy.tsv’?

Hey @hemprichbennett,

The first thing that jumped out to me (and is a message I've received multiple times in various projects myself) is this:

I believe the conclusion we reached in this post addressing this issue was that you need to ensure that the version of scikit-learn used to train and test the classifier need to be the same (though the particular QIIME2 version does not necessarily, as there can be multiple QIIME releases with the same scikit-learn versions...).

From the error message above, it's evident that the classifier file was trained using scikit-learn v 0.21.2. Can you confirm whether or not your QIIME environment is using the same version of scikit-learn?

Ooh, interesting. I’ll have a look into identifying the versions that were used for each, I’m not sure where that could have been set actually. As per the tutorial, I was using evaluate-fit-classifier to train and then test the classifier in a single command, so I would assume that the version of scikit-learn would be identical for both as there wouldn’t be any point where user input caused a different version to be loaded?

Thanks!