An off-topic reply has been split into a new topic: Using BOLD database
Please keep replies on-topic in the future.
An off-topic reply has been split into a new topic: Using BOLD database
Please keep replies on-topic in the future.
An off-topic reply has been split into a new topic: error while using rescript
Please keep replies on-topic in the future.
Hello, I use primers from “ An improved method for utilizing high-throughput amplicon sequencing to determine the diets of insectivorous animals”, and I find that you use different reverse primers. One is 5 '- GGWACTAATCATTTCAAATCC-3', and your is 5 '- ggatttggaaattgagtwcc-3'. Can I use the "bold_anml_classifier. qza" classifier provided by you for taxonomy annotation.Is that work?
Hi @PanZhu921,
I would not suggest doing this, as the primer sequences can affect how these amplicon regions are extracted from the sequence reference database. Potentially affecting your ability to classify your reads appropriately.
I would use your own primers to extract the amplicon region. You can also supplement parts of the BOLD tutorial, with this tutorial.
Hi @devonorourke,
Thanks for your detail instruction of building the COI database! I am trying to use it to build the COI database for Diptera, but currently stuck at the step:
qiime rescript dereplicate
--i-sequences bold_ambi_hpoly_length_filtd_seqs.qza
--i-taxa bold_rawTaxa.qza
--p-mode 'super'
--p-derep-prefix
--p-rank-handles 'greengenes'
--o-dereplicated-sequences bold_derep1_seqs.qza
--o-dereplicated-taxa bold_derep1_taxa.qza
I received this error:
Plugin error from rescript:
Parameter 'rank_handles' received ['greengenes'] as an argument, which is incompatible with parameter type: List[Str % Choices('disable')] | List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')]
Debug info has been saved to /dev/shm/jobs/39316550/qiime2-q2cli-err-f9wqntj8.log
Looks like it does not like "greengenes". Could you please help me with that? That would be much appreciated!
Thank you
Kind regards,
Toan
Hi @Toan,
Thanks (as usual!) with some insight from @SoilRotifer, I've learned that part of the error you're seeing is because my tutorial methods have since been updated, and the --p-rank-handles 'greengenes'
parameter is no longer the method of choice. What I would have mentioned, using the old version of RESCRIPt that existed when I was putting this tutorial together would be to point out that the taxonomic divisions that your various labels fall into - are outside of the expected groupings that the greengenes
rank-handle style is looking for. These taxonomic "groupings" are referred to as rank-handles
. You could see the old rank-handle methods that were permitted in this code here, however, this my not exactly be your issue if you are using an updated version of QIIME2 and RESCRIPt.
Instead, the new code handling taxonomic ranks is perfectly fine with these expanded rank-handles, including the ones listed in your error message. Thus, I believe the trick for you is to just drop the --p-rank-handles 'greengenes'
argument altogether, and instead define the ranks you have directly. Don't quote me on that very last bit about defining your ranks directly - I haven't used the updated RESCRIPt version, and others in the forum can likely point you exactly what your next steps are if you can't find them directly from the help command with:
rescript dereplicate --help
or
qiime rescript merge-taxa --help
To conclude, I suspect the issue is that the instructions in the tutorial are now outdated for certain input taxonomy files, like yours might be. Nevertheless, RESCRIPt is now indeed capable of handling these expanded rank-handle types, and you just need to adjust that --p-rank-handles
parameter accordingly to resolve the error message you are receiving.
Hi @cjfields,
Your suggestion is actually good to import the bold database to qiime2, but how can you deal with the next steps when it requires type FeatureData[AlignedSequence]?
Thanks,
Toan
Hi @Toan,
I'm sorry, but I do not understand your questions.
Good luck,
Devon
Hi @devonorourke,
So sorry for the confusion here !
I was having a problem to import my bold database with the type 'FeatureData[AlignedSequence]' into Qiime2 because it has gaps and Chris Field suggested to use seqkit to remove the gaps before importing to Qiime2 under 'FeatureData[Sequence]'. I have tried this and it works. However, when I used the output .qza to continue with your tutorial on REScript, for example, when I run this command "qiime rescript degap-seqs", it did not work because it requires the input Qiime artifact qza under 'FeatureData[AlignedSequence]' type.
I replied to his comment, not sure why it comes down to you. Hopefully I am not making you more confused
.
Thanks,
Toan
Hi @Toan,
The reason why you can't import as a FeatureData[Sequence]
or FeatureData[AlignedSequence]
is that the file from BOLD is neither. See earlier in this thread. For whatever reason, there are gaps in some sequences but not others.
This creates a conundrum, as you can't import as FeatureData[Sequence]
as there is a sequence within the file that contains a gap. So, we then decide to import as a FeatureData[AlignedSequence]
, but we can't because we're not actually importing an alignment. That is, in an alignment, every sequence should be the same exact length, i.e. the nucleotides and the gaps for each sequence should sum to the same string length, hence this error message:
This is why the seqkit command was used, to remove those spurious gaps in a few sequences, which were in fact, not in a sequence alignment. If you were able to import an actual alignment as FeatureData[AlignedSequence]
, then you can use qiime rescript degap-seqs ...
. This command will take an alignment and remove all gap characters to, essentially, "unalign" the sequences. That is, convert the alignment to FeatureData[Sequence]
.
Hopefully, this helps to clarify things.
Hi @SoilRotifer,
Thanks a lot for clarifying this! I have the same problem that there are some spurious gaps in some sequences. I tried to use seqkit to remove these gap, and import as 'FeatureData[Sequence]'. Then, I used "qiime alignment mafft" to do alignment and convert it to QIIME artifact under the type FeatureData[AlignedSequence]. However, this command requires huge memory. I set 100GB but still failed to run
Do you have any suggestions on this?
Thank you,
Toan
No problem.
Did you try using the --p-parttree
option of mafft? That might help.
-Mike
Hi @SoilRotifer
I tried to add --p-parttree, but it keeps showing this error: Missing option '--o-alignment'. ("--output-dir" may also be used), even i already had it in the command.
Here is my command:
qiime alignment mafft
--i-sequences $input/COins_Seqs.qza
--p-parttree
--output-dir $output/alligned_COins_Seqs.qza
I tried either --o-alignment' or "--output-dir", it gave me the same error. I did not get this error when I did not use --p-parttree
Can you please check if i need to add anything else?
Many many thanks,
Toan
Hi @Toan,
Are you adding a \
at the end of each line? Remember when spreading a command over multiple lines in the terminal, you need to add \
at the end of each line. Otherwise it will execute each line as if it is a separate command. This is the most common reason for the error message you're seeing. So, your command should look like this:
qiime alignment mafft \
--i-sequences $input/COins_Seqs.qza \
--p-parttree \
--output-dir $output/alligned_COins_Seqs.qza
I assume you've appropriately set up your $input
and $output
variables.
Hi @SoilRotifer,
You are absolutely right! when I added --p-parttree, I completely forgot to add \ at the end. Then, I gave a try without --p-parttree option with 300GB of memory but it still failed because of OOM!
I just set another job with --p-parttree with 300GB for 96 hours. Hopefully it will be enough. I has been running well till now. Wish me luck!
Thanks a lot for your kind support!
Toan
Hi Devon! Thank you very much for this tutorial.
I'm just starting to Build my own COI Database which will have only Nematodes and Platyhelminths sequences, I have found an issue while trying to remove sequences with non IUPAC characters. I'm getting an error: "Argument list too long". It seems like is trying to take all sequences, rather than just those with non IUPAC characters. I literally copied your code:
grep -v '^>' bold_allrawSeqs.fasta | grep [^THADGRC.SMBNWVKY-] -n > badseqs.txt
Have you heard of someone who had have this issue before? Can you check if there is something we are missing?
Hi @Joby.robleto ,
I'm guessing here, but it sounds like you are having a bash error (not an error with a particular QIIME command). I've received this kind of error when trying to run a function from the terminal that has just wayyyyyy too many results; for example, trying to list all the files in a directory, when there are 100,000+ files. I might get a result like this:
/bin/ls: cannot execute [Argument list too long]
What's strange to me is that the command I can see in the tutorial that might fit with your question shouldn't generate such an error, because it's acting on a single file (rather than a list of many files):
grep -v '^>' bold_allrawSeqs.fasta | grep [^THADGRC\.SMBNWVKY-] -n > badseqs.txt
A couple of questions for troubleshooting:
bold_allrawSeqs.fasta
? Something like this should work:grep -c '^>' bold_allrawSeqs.fasta
grep -v '^>' bold_allrawSeqs.fasta > tmp_bold_allrawSeqs_noheaders.fasta
tmp_bold_allrawSeqs_noheaders.fasta
can be created, you can then test that file directly on the second grep command to see if that succeeds:grep [^THADGRC.SMBNWVKY-] -n tmp_bold_allrawSeqs_noheaders.fasta > badseqs.txt
Good luck
Hi Devon, thank you for your answer!
Those first steps work fine. Sorry I didn't specify at first, but the exact line I'm getting the error is when I run the following commands:
myval=$(cat droplines.txt)
sed -e $myval bold_allrawSeqs.fasta > cleaned_bold_allrawSeqs.fasta
After that I get the error:
-bash: /usr/bin/sed: Argument list too long
What I have noticed is that my droplines.txt file contains all line numbers, like it is detecting an non IUPAC character in every sequence. Actually the error seems to come from the making of badseqs.txt since badseqs.txt contains all of my sequences(33117 same as bold_allrawSeqs.fasta).
Any ideas on whats wrong?
Thanks for your support.
Thanks @Joby.robleto,
Apologies on my inability to give you an obvious "it's that!" response, but I haven't touched this pipeline in a few years. What I can tell you for certain is that the error -bash: /usr/bin/sed: Argument list too long
is definitely attributed to this step:
sed -e $myval bold_allrawSeqs.fasta > cleaned_bold_allrawSeqs.fasta
In running sed -e $myval
, you're telling the sed program to run an expression defined by that $myval
file. I'm not sure that's necessarily what you want to do, because $myval
is just the same as the entirety of all lines in droplines.txt (that is to say, maybe I shouldn't have done it this way in the first place! ).
If we could back up a step or two before the error:
badseqs.txt
? (run wc -l badseqs.txt
)head -3 badseqs.txt
)droplines.txt
? If so, can you print out a few lines (head -3 droplines.txt
)?I think knowing what info you have on your end will help resolve this sed issue.
Hi Devon! Thank you very much for your assistance.
Reading in the comments I realized I would also have to degapp my sequences and then import them back to Qiime2, I mean at least remove the hyphens on the sequences because the newer versions of Qiime2 don´t accept them. So I did that and it seems that it also solve my problems in the step of removing non-IUPAC characters