Building a COI database from BOLD references

Nicholas_Bokulich · February 8, 2023, 6:21am

An off-topic reply has been split into a new topic: Using BOLD database

Please keep replies on-topic in the future.

lizgehret · June 4, 2023, 10:41pm

An off-topic reply has been split into a new topic: error while using rescript

Please keep replies on-topic in the future.

PanZhu921 · June 7, 2023, 10:22am

Hello, I use primers from “ An improved method for utilizing high-throughput amplicon sequencing to determine the diets of insectivorous animals”, and I find that you use different reverse primers. One is 5 '- GGWACTAATCATTTCAAATCC-3', and your is 5 '- ggatttggaaattgagtwcc-3'. Can I use the "bold_anml_classifier. qza" classifier provided by you for taxonomy annotation.Is that work?

SoilRotifer · June 8, 2023, 4:46pm

Hi @PanZhu921,

I would not suggest doing this, as the primer sequences can affect how these amplicon regions are extracted from the sequence reference database. Potentially affecting your ability to classify your reads appropriately.

I would use your own primers to extract the amplicon region. You can also supplement parts of the BOLD tutorial, with this tutorial.

Toan · September 13, 2023, 2:50am

Hi @devonorourke,

Thanks for your detail instruction of building the COI database! I am trying to use it to build the COI database for Diptera, but currently stuck at the step:
qiime rescript dereplicate
--i-sequences bold_ambi_hpoly_length_filtd_seqs.qza
--i-taxa bold_rawTaxa.qza
--p-mode 'super'
--p-derep-prefix
--p-rank-handles 'greengenes'
--o-dereplicated-sequences bold_derep1_seqs.qza
--o-dereplicated-taxa bold_derep1_taxa.qza

I received this error:

Plugin error from rescript:

Parameter 'rank_handles' received ['greengenes'] as an argument, which is incompatible with parameter type: List[Str % Choices('disable')] | List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]

Debug info has been saved to /dev/shm/jobs/39316550/qiime2-q2cli-err-f9wqntj8.log

Looks like it does not like "greengenes". Could you please help me with that? That would be much appreciated!

Thank you

Kind regards,

Toan

devonorourke · September 13, 2023, 1:33pm

Hi @Toan,

Thanks (as usual!) with some insight from @SoilRotifer, I've learned that part of the error you're seeing is because my tutorial methods have since been updated, and the --p-rank-handles 'greengenes' parameter is no longer the method of choice. What I would have mentioned, using the old version of RESCRIPt that existed when I was putting this tutorial together would be to point out that the taxonomic divisions that your various labels fall into - are outside of the expected groupings that the greengenes rank-handle style is looking for. These taxonomic "groupings" are referred to as rank-handles. You could see the old rank-handle methods that were permitted in this code here, however, this my not exactly be your issue if you are using an updated version of QIIME2 and RESCRIPt.

Instead, the new code handling taxonomic ranks is perfectly fine with these expanded rank-handles, including the ones listed in your error message. Thus, I believe the trick for you is to just drop the --p-rank-handles 'greengenes' argument altogether, and instead define the ranks you have directly. Don't quote me on that very last bit about defining your ranks directly - I haven't used the updated RESCRIPt version, and others in the forum can likely point you exactly what your next steps are if you can't find them directly from the help command with:

rescript dereplicate --help

or

qiime rescript merge-taxa --help

To conclude, I suspect the issue is that the instructions in the tutorial are now outdated for certain input taxonomy files, like yours might be. Nevertheless, RESCRIPt is now indeed capable of handling these expanded rank-handle types, and you just need to adjust that --p-rank-handles parameter accordingly to resolve the error message you are receiving.

Toan · September 13, 2023, 10:35pm

Hi @cjfields,
Your suggestion is actually good to import the bold database to qiime2, but how can you deal with the next steps when it requires type FeatureData[AlignedSequence]?
Thanks,
Toan

devonorourke · September 14, 2023, 1:58pm

Hi @Toan,

I'm sorry, but I do not understand your questions.

Are you asking if I am suggesting to import data from BOLD to QIIME? You can certainly use that as a resource, and this tutorial shows how to leverage those resources, though I would point out that you can alternatively import from COI data from NCBI. We even compared the two data sources in the RESCRIPt paper, so have a look there if you're wondering which way to go about it.
Can you possibly rephrase your question about what you mean by "next steps when it requires..."? What are the exact steps you are referring to? Are they QIIME-specific, or as shown in this tutorial, related to preprocessing before any particular QIIME analysis?

Good luck,
Devon

Toan · September 14, 2023, 11:40pm

Hi @devonorourke,

So sorry for the confusion here !
I was having a problem to import my bold database with the type 'FeatureData[AlignedSequence]' into Qiime2 because it has gaps and Chris Field suggested to use seqkit to remove the gaps before importing to Qiime2 under 'FeatureData[Sequence]'. I have tried this and it works. However, when I used the output .qza to continue with your tutorial on REScript, for example, when I run this command "qiime rescript degap-seqs", it did not work because it requires the input Qiime artifact qza under 'FeatureData[AlignedSequence]' type.

I replied to his comment, not sure why it comes down to you. Hopefully I am not making you more confused
.
Thanks,

Toan

SoilRotifer · September 19, 2023, 3:01pm

Hi @Toan,

The reason why you can't import as a FeatureData[Sequence] or FeatureData[AlignedSequence] is that the file from BOLD is neither. See earlier in this thread. For whatever reason, there are gaps in some sequences but not others.

This creates a conundrum, as you can't import as FeatureData[Sequence] as there is a sequence within the file that contains a gap. So, we then decide to import as a FeatureData[AlignedSequence], but we can't because we're not actually importing an alignment. That is, in an alignment, every sequence should be the same exact length, i.e. the nucleotides and the gaps for each sequence should sum to the same string length, hence this error message:

This is why the seqkit command was used, to remove those spurious gaps in a few sequences, which were in fact, not in a sequence alignment. If you were able to import an actual alignment as FeatureData[AlignedSequence], then you can use qiime rescript degap-seqs .... This command will take an alignment and remove all gap characters to, essentially, "unalign" the sequences. That is, convert the alignment to FeatureData[Sequence].

Hopefully, this helps to clarify things.

Toan · September 19, 2023, 9:50pm

Hi @SoilRotifer,

Thanks a lot for clarifying this! I have the same problem that there are some spurious gaps in some sequences. I tried to use seqkit to remove these gap, and import as 'FeatureData[Sequence]'. Then, I used "qiime alignment mafft" to do alignment and convert it to QIIME artifact under the type FeatureData[AlignedSequence]. However, this command requires huge memory. I set 100GB but still failed to run

Do you have any suggestions on this?

Thank you,

Toan

SoilRotifer · September 19, 2023, 10:07pm

No problem.

Did you try using the --p-parttree option of mafft? That might help.

-Mike

Toan · September 20, 2023, 12:08am

Hi @SoilRotifer

I tried to add --p-parttree, but it keeps showing this error: Missing option '--o-alignment'. ("--output-dir" may also be used), even i already had it in the command.

Here is my command:

qiime alignment mafft
--i-sequences $input/COins_Seqs.qza
--p-parttree
--output-dir $output/alligned_COins_Seqs.qza

I tried either --o-alignment' or "--output-dir", it gave me the same error. I did not get this error when I did not use --p-parttree

Can you please check if i need to add anything else?

Many many thanks,

Toan

SoilRotifer · September 20, 2023, 1:17pm

Hi @Toan,

Are you adding a \ at the end of each line? Remember when spreading a command over multiple lines in the terminal, you need to add \ at the end of each line. Otherwise it will execute each line as if it is a separate command. This is the most common reason for the error message you're seeing. So, your command should look like this:

qiime alignment mafft \
--i-sequences $input/COins_Seqs.qza \
--p-parttree \
--output-dir $output/alligned_COins_Seqs.qza

I assume you've appropriately set up your $input and $output variables.

Toan · September 20, 2023, 7:19pm

Hi @SoilRotifer,

You are absolutely right! when I added --p-parttree, I completely forgot to add \ at the end. Then, I gave a try without --p-parttree option with 300GB of memory but it still failed because of OOM!

I just set another job with --p-parttree with 300GB for 96 hours. Hopefully it will be enough. I has been running well till now. Wish me luck!

Thanks a lot for your kind support!

Toan

Joby.robleto · February 5, 2024, 7:37pm

Hi Devon! Thank you very much for this tutorial.
I'm just starting to Build my own COI Database which will have only Nematodes and Platyhelminths sequences, I have found an issue while trying to remove sequences with non IUPAC characters. I'm getting an error: "Argument list too long". It seems like is trying to take all sequences, rather than just those with non IUPAC characters. I literally copied your code:

grep -v '^>' bold_allrawSeqs.fasta | grep [^THADGRC.SMBNWVKY-] -n > badseqs.txt

Have you heard of someone who had have this issue before? Can you check if there is something we are missing?

devonorourke · February 8, 2024, 12:43am

Hi @Joby.robleto ,
I'm guessing here, but it sounds like you are having a bash error (not an error with a particular QIIME command). I've received this kind of error when trying to run a function from the terminal that has just wayyyyyy too many results; for example, trying to list all the files in a directory, when there are 100,000+ files. I might get a result like this:

/bin/ls:  cannot execute [Argument list too long]

What's strange to me is that the command I can see in the tutorial that might fit with your question shouldn't generate such an error, because it's acting on a single file (rather than a list of many files):

grep -v '^>' bold_allrawSeqs.fasta | grep [^THADGRC\.SMBNWVKY-] -n > badseqs.txt

A couple of questions for troubleshooting:

Can you confirm how many records exist within that bold_allrawSeqs.fasta? Something like this should work:

grep -c '^>' bold_allrawSeqs.fasta

Assuming the above command works, what happens if you save the first part to a temporary file? In other words, don't pipe it to the second grep command directly. Does this command execute without issue?

grep -v '^>' bold_allrawSeqs.fasta > tmp_bold_allrawSeqs_noheaders.fasta

Assuming the tmp_bold_allrawSeqs_noheaders.fasta can be created, you can then test that file directly on the second grep command to see if that succeeds:

grep [^THADGRC.SMBNWVKY-] -n tmp_bold_allrawSeqs_noheaders.fasta > badseqs.txt

Good luck

Joby.robleto · February 8, 2024, 4:54am

Hi Devon, thank you for your answer!
Those first steps work fine. Sorry I didn't specify at first, but the exact line I'm getting the error is when I run the following commands:

myval=$(cat droplines.txt)
sed -e $myval bold_allrawSeqs.fasta > cleaned_bold_allrawSeqs.fasta

After that I get the error:
-bash: /usr/bin/sed: Argument list too long

What I have noticed is that my droplines.txt file contains all line numbers, like it is detecting an non IUPAC character in every sequence. Actually the error seems to come from the making of badseqs.txt since badseqs.txt contains all of my sequences(33117 same as bold_allrawSeqs.fasta).

Any ideas on whats wrong?
Thanks for your support.

devonorourke · February 8, 2024, 5:47pm

Thanks @Joby.robleto,
Apologies on my inability to give you an obvious "it's that!" response, but I haven't touched this pipeline in a few years. What I can tell you for certain is that the error -bash: /usr/bin/sed: Argument list too long is definitely attributed to this step:

sed -e $myval bold_allrawSeqs.fasta > cleaned_bold_allrawSeqs.fasta

In running sed -e $myval, you're telling the sed program to run an expression defined by that $myval file. I'm not sure that's necessarily what you want to do, because $myval is just the same as the entirety of all lines in droplines.txt (that is to say, maybe I shouldn't have done it this way in the first place! ).

If we could back up a step or two before the error:

How many lines do you have in badseqs.txt? (run wc -l badseqs.txt)
Can you print out the first few lines? (head -3 badseqs.txt)
Did you run the step to generate droplines.txt? If so, can you print out a few lines (head -3 droplines.txt)?

I think knowing what info you have on your end will help resolve this sed issue.

Joby.robleto · February 15, 2024, 2:07am

Hi Devon! Thank you very much for your assistance.
Reading in the comments I realized I would also have to degapp my sequences and then import them back to Qiime2, I mean at least remove the hyphens on the sequences because the newer versions of Qiime2 don´t accept them. So I did that and it seems that it also solve my problems in the step of removing non-IUPAC characters