I am trying to create a classifier using sequences from BOLD to find the types of insects in my sample. I was able to download all the sequences from BOLD, but some contain dashes “-”. I think this isn’t letting me create the classifier, what should I do to move forward? just delete those specific sequences or is there a work around?
I am fairly new to quite so I am sorry if this is a dumb question
I'm not super familiar with BOLD, but dashes like this could appear if the sequences have been aligned to each other, which is pretty common for databases.
If the BOLD database has undergone Multiple Sequence Alignment, you could cut out the region you have amplified first, then remove the - dashes - from your reads, at which point they should be compatible with the Qiime 2 taxonomy classifiers.
Once we confirm why there are dashes, we can go from there.
I don’t know how to insert it like you did. I am just asking about the dashes because I don’t want to erase something important! I was also thinking about just completely removing that organism and sequence
Same! I wouldn't want to malform the database, or remove a section without fully understanding it.
I took at quick look at this part of the Gold handbook, but I'm not sure it answered my question. Let me know if you find anything more useful in the handbook. I think we really need to get a Gold expert here.
I got a way to filter out the sequences with dashes so that’s exciting
Now I am trying to create the taxonomy file, do you have any recommendations? I am dealing with a bunch of sequences so writing them individually would be time consuming
If you have any troubles, you might find that it needs to have the full header in there.
Like this!
AMIG116-08|Ephemerella dorothea|COI-5P|HQ151895 k__Animalia; p__Arthropoda; c__Insecta; o__Ephemeroptera; f__Ephemerellidae; g__Ephemerella; s__dorothea
There is a tab right here ^
I was going based off the green genes and I didnt see the full header. How did you make that l looking character? Should I include the header before moving forward?
The | bar character is on the same key as \ on the right side of the keyboard, and you get it by pressing shift+\.
The bar character is used to separate things, just like , or ;, and is also used in linux pipes.
Only if you need to! If using AMIG116-08 as a heading works great, you are good to go and don't need the full header it all. You will have to try it and find out.
So while I was extracting the reference reads I got an error code:
(qiime2-2019.7) ksup20672mac:Desktop swhitne1$ qiime feature-classifier extract-reads
--i-sequences US_insecta.qza
--p-f-primer TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGAGATATTGGAACWTTATATTTTATTTTTGG
--p-r-primer GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGWACTAATCAATTWCCAAATCCTCC
--o-reads ref-seq.qza
Plugin error from feature-classifier:
string index out of range
Debug info has been saved to /var/folders/tk/jfjwhjvx36302yc8z85_ql6s__5t7k/T/qiime2-q2cli-err-bfhechjx.log
Traceback (most recent call last):
File "/Users/swhitne1/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/q2cli/commands.py", line 327, in __call__
results = action(**arguments)
File "</Users/swhitne1/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/decorator.py:decorator-gen-351>", line 2, in extract_reads
File "/Users/swhitne1/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/qiime2/sdk/action.py", line 240, in bound_callable
output_types, provenance)
File "/Users/swhitne1/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/qiime2/sdk/action.py", line 383, in _callable_executor_
output_views = self._callable(**view_args)
File "/Users/swhitne1/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/q2_feature_classifier/_cutter.py", line 166, in extract_reads
first_read = next(reads)
File "/Users/swhitne1/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/q2_feature_classifier/_cutter.py", line 110, in _gen_reads
amp = _approx_match(seq, f_primer, r_primer, identity)
File "/Users/swhitne1/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/q2_feature_classifier/_cutter.py", line 94, in _approx_match
beg, b_matches, b_length = _semisemiglobal(f_primer, seq)
File "/Users/swhitne1/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/q2_feature_classifier/_cutter.py", line 60, in _semisemiglobal
_local_aln(primer, sequence)
File "/Users/swhitne1/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/q2_feature_classifier/_cutter.py", line 46, in _local_aln
skbio.alignment.local_pairwise_align_ssw(one_primer, sequence)
File "/Users/swhitne1/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/skbio/alignment/_pairwise.py", line 731, in local_pairwise_align_ssw
constructor(alignment.aligned_target_sequence, metadata=metadata2,
File "skbio/alignment/_ssw_wrapper.pyx", line 339, in skbio.alignment._ssw_wrapper.AlignmentStructure.aligned_target_sequence (skbio/alignment/_ssw_wrapper.c:3560)
File "skbio/alignment/_ssw_wrapper.pyx", line 376, in skbio.alignment._ssw_wrapper.AlignmentStructure._get_aligned_sequence (skbio/alignment/_ssw_wrapper.c:4165)
IndexError: string index out of range
I'm not sure if this is a good idea --- the "sequences with dashes" usually mean that the sequences have been aligned, which means that you're possibly importing the wrong files here (if they have unaligned sequences for this database I think that is what you want). By removing any sequences with alignment gap characters you are now removing valid features from this database, probably not what you want.
As far as the "string index out of range" - my off-the-cuff guess is that this is related to the aligned-ness I mentioned above.
Someone at BOLD was able to filter out the sequences that contain dashes, so it makes my database smaller but I think it is still substantial since it has 101,120 records.
Should I go back and ask them for the sequences to be unaligned?
The issue here is that you are potentially mis-using data. You appear to be using aligned sequences where you should be working with unaligned sequences. I would get my hands on the unaligned sequences and then go from there
Thanks for the response. I am confused about the aligned vs unaligned sequences, what is the main difference? Sorry if this is a dumb question, I just thought unaligned contained dashes and I thought Qiime2 couldn’t interpret those for classifiers.