Building a COI database from BOLD references

I am confused about the classifier file bold_full_ArthOnly_classifier.qza. Can I extract or export the fasta sequences from this?
Where would I get the corresponding files maybe named bold_full_ArthOnly_classifier_seqs.qza and bold_full_ArthOnly_classifier_taxa.qza?

Were you ever able to figure out how to do this? It seems as though the arthonly classifier was generated with scikit-learn version 0.21.2. I'm not sure how to downgrade this. So my other option was to re-generate it, but the seq/tax files were not included. Maybe Devon would be willing to share?

Just a note that the R code used for retrieving sequence data from BOLD may sporadically cause records to be dropped; I saw a few instances in the R script where the parser had issues with quotes and had to do a workaround. See this issue, which was recently addressed in the bold R package on GitHub and should be available in an upcoming release.

2 Likes

Hello,

I was attempting to build a new BOLD database and was wondering if there was a suggested resolution to the error that began appearing after the update to "2020.8"...besides reverting to a version before the update. Here is the error: There was a problem importing bold_rawSeqs_forQiime.fasta:

bold_rawSeqs_forQiime.fasta is not a(n) AlignedDNAFASTAFormat file:

The sequence starting on line 34 was length 597. All previous sequences were length 602. All sequences must be the same length for AlignedFASTAFormat.

@John I got around it by using seqkit directly and removing gaps:

seqkit seq -w0 -g bold_rawSeqs_forQiime.fasta > bold_rawSeqs_forQiime.degapped.fasta

Then:

qiime tools import \
  --input-path bold_rawSeqs_forQiime.degapped.fasta \
  --output-path bold_rawSeqs.qza \
  --type 'FeatureData[Sequence]

You have to be a bit careful in steps w/ seqkit and use -w 0, as the default FASTA output will be wrapped at 60 characters so some of the one-liner steps like the following will break:

grep -v '^>' bold_rawSeqs_forQiime.degapped.fasta |  \
  sed 's/^N*//' | rev | sed 's/^N*//' | awk '{print length}' | sort | \
  uniq -c | sort -k2,2n > seqlength_Ntrimmed_hist.txt
3 Likes

2 off-topic replies have been split into a new topic: error while making BOLD reference database

Please keep replies on-topic in the future.

An off-topic reply has been split into a new topic: extract-reads: why no matches?

Please keep replies on-topic in the future.

An off-topic reply has been split into a new topic: can BOLD references be used on invertebrates

Please keep replies on-topic in the future.

An off-topic reply has been split into a new topic: Using BOLD database

Please keep replies on-topic in the future.

An off-topic reply has been split into a new topic: error while using rescript

Please keep replies on-topic in the future.

Hello, I use primers from “ An improved method for utilizing high-throughput amplicon sequencing to determine the diets of insectivorous animals”, and I find that you use different reverse primers. One is 5 '- GGWACTAATCATTTCAAATCC-3', and your is 5 '- ggatttggaaattgagtwcc-3'. Can I use the "bold_anml_classifier. qza" classifier provided by you for taxonomy annotation.Is that work?

Hi @PanZhu921,

I would not suggest doing this, as the primer sequences can affect how these amplicon regions are extracted from the sequence reference database. Potentially affecting your ability to classify your reads appropriately.

I would use your own primers to extract the amplicon region. You can also supplement parts of the BOLD tutorial, with this tutorial.

Hi @devonorourke,

Thanks for your detail instruction of building the COI database! I am trying to use it to build the COI database for Diptera, but currently stuck at the step:
qiime rescript dereplicate
--i-sequences bold_ambi_hpoly_length_filtd_seqs.qza
--i-taxa bold_rawTaxa.qza
--p-mode 'super'
--p-derep-prefix
--p-rank-handles 'greengenes'
--o-dereplicated-sequences bold_derep1_seqs.qza
--o-dereplicated-taxa bold_derep1_taxa.qza

I received this error:

Plugin error from rescript:

Parameter 'rank_handles' received ['greengenes'] as an argument, which is incompatible with parameter type: List[Str % Choices('disable')] | List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]

Debug info has been saved to /dev/shm/jobs/39316550/qiime2-q2cli-err-f9wqntj8.log

Looks like it does not like "greengenes". Could you please help me with that? That would be much appreciated!

Thank you

Kind regards,

Toan

Hi @Toan,

Thanks (as usual!) with some insight from @SoilRotifer, I've learned that part of the error you're seeing is because my tutorial methods have since been updated, and the --p-rank-handles 'greengenes' parameter is no longer the method of choice. What I would have mentioned, using the old version of RESCRIPt that existed when I was putting this tutorial together would be to point out that the taxonomic divisions that your various labels fall into - are outside of the expected groupings that the greengenes rank-handle style is looking for. These taxonomic "groupings" are referred to as rank-handles. You could see the old rank-handle methods that were permitted in this code here, however, this my not exactly be your issue if you are using an updated version of QIIME2 and RESCRIPt.

Instead, the new code handling taxonomic ranks is perfectly fine with these expanded rank-handles, including the ones listed in your error message. Thus, I believe the trick for you is to just drop the --p-rank-handles 'greengenes' argument altogether, and instead define the ranks you have directly. Don't quote me on that very last bit about defining your ranks directly - I haven't used the updated RESCRIPt version, and others in the forum can likely point you exactly what your next steps are if you can't find them directly from the help command with:

rescript dereplicate --help

or

qiime rescript merge-taxa --help

To conclude, I suspect the issue is that the instructions in the tutorial are now outdated for certain input taxonomy files, like yours might be. Nevertheless, RESCRIPt is now indeed capable of handling these expanded rank-handle types, and you just need to adjust that --p-rank-handles parameter accordingly to resolve the error message you are receiving.

2 Likes

Hi @cjfields,
Your suggestion is actually good to import the bold database to qiime2, but how can you deal with the next steps when it requires type FeatureData[AlignedSequence]?
Thanks,
Toan

Hi @Toan,

I'm sorry, but I do not understand your questions.

  • Are you asking if I am suggesting to import data from BOLD to QIIME? You can certainly use that as a resource, and this tutorial shows how to leverage those resources, though I would point out that you can alternatively import from COI data from NCBI. We even compared the two data sources in the RESCRIPt paper, so have a look there if you're wondering which way to go about it.
  • Can you possibly rephrase your question about what you mean by "next steps when it requires..."? What are the exact steps you are referring to? Are they QIIME-specific, or as shown in this tutorial, related to preprocessing before any particular QIIME analysis?

Good luck,
Devon

Hi @devonorourke,

So sorry for the confusion here :slightly_smiling_face:!
I was having a problem to import my bold database with the type 'FeatureData[AlignedSequence]' into Qiime2 because it has gaps and Chris Field suggested to use seqkit to remove the gaps before importing to Qiime2 under 'FeatureData[Sequence]'. I have tried this and it works. However, when I used the output .qza to continue with your tutorial on REScript, for example, when I run this command "qiime rescript degap-seqs", it did not work because it requires the input Qiime artifact qza under 'FeatureData[AlignedSequence]' type.

I replied to his comment, not sure why it comes down to you. Hopefully I am not making you more confused :slight_smile:
.
Thanks,

Toan

Hi @Toan,

The reason why you can't import as a FeatureData[Sequence] or FeatureData[AlignedSequence] is that the file from BOLD is neither. See earlier in this thread. For whatever reason, there are gaps in some sequences but not others.

This creates a conundrum, as you can't import as FeatureData[Sequence] as there is a sequence within the file that contains a gap. So, we then decide to import as a FeatureData[AlignedSequence], but we can't because we're not actually importing an alignment. That is, in an alignment, every sequence should be the same exact length, i.e. the nucleotides and the gaps for each sequence should sum to the same string length, hence this error message:

This is why the seqkit command was used, to remove those spurious gaps in a few sequences, which were in fact, not in a sequence alignment. If you were able to import an actual alignment as FeatureData[AlignedSequence], then you can use qiime rescript degap-seqs .... This command will take an alignment and remove all gap characters to, essentially, "unalign" the sequences. That is, convert the alignment to FeatureData[Sequence].

Hopefully, this helps to clarify things. :slight_smile:

Hi @SoilRotifer,

Thanks a lot for clarifying this! I have the same problem that there are some spurious gaps in some sequences. I tried to use seqkit to remove these gap, and import as 'FeatureData[Sequence]'. Then, I used "qiime alignment mafft" to do alignment and convert it to QIIME artifact under the type FeatureData[AlignedSequence]. However, this command requires huge memory. I set 100GB but still failed to run :frowning:

Do you have any suggestions on this?

Thank you,

Toan

No problem. :wink:

Did you try using the --p-parttree option of mafft? That might help.

-Mike