Trouble create a classifier table

Hey guys,

I am trying to create a classifier using sequences from BOLD to find the types of insects in my sample. I was able to download all the sequences from BOLD, but some contain dashes “-”. I think this isn’t letting me create the classifier, what should I do to move forward? just delete those specific sequences or is there a work around?

I am fairly new to quite so I am sorry if this is a dumb question

Thank you

1 Like

Good morning Whitney,

This is not a dumb question at all!

Like this?

>heading1
ACTG--CTTG
>heading2
AATG-CCTTG

I'm not super familiar with BOLD, but dashes like this could appear if the sequences have been aligned to each other, which is pretty common for databases.

If the BOLD database has undergone Multiple Sequence Alignment, you could cut out the region you have amplified first, then remove the - dashes - from your reads, at which point they should be compatible with the Qiime 2 taxonomy classifiers.

Once we confirm why there are dashes, we can go from there.

Colin

Hey Colin,

I guess I should have included an example. This is what a sequence from BOLD looks like:

>AMIG116-08|Ephemerella dorothea|COI-5P|HQ151895
TTGGGACTTCTTTAAGTCTCCTTATTCGAGCTGAGTTAGGGCAGCCTGGGTCCCTTATTGGAGATGACCAAATCTATAATGTTATCGTAACTGCTCACGCCTTTATTATAATCTTCTTTATGGTAATGCCCATTATAATTGGAGGGTTTGGGAATTGGTTAGTTCCTCTCATGCTTGGAGCCCCTGATATAGCTTTCCCCCGTATAAATAACATAAGCTTTTGGCTTTTACCTCCTGCTCTAACACTCCTATTAGCTAGCAGCATAGTAGAAAGTGGGGCGGGGACAGGTTGAACAGTTTACCCTCCACTAGCTTCTGGGATTGCTCATGCTGGAGGCTCTGTAGACCTTGCCATTTTCTCACTTCATTTAGCGGGGGTTTCTTCTATTCTCGGGGCTGTAAACTTTATTACCACAACCATTAATATACGCGCAAGTGGTATATCAATAGACCGCATTCCACTTTTTGTGTGGTCAGTACTAATTACAGCTATTTTGCTCTTGCTTTCCCTCCCAGTTTTAGCGGGAGCCATCACCATGCTCCTCACTGACCGTAACCTTAATACATCCT-
>BEECD351-09|Hymenoptera|COI-5P
--------------------------------------------------------------------------------------GAAATTGAATTAATAATGATCAAATTTATAACTCAATTGTAACCTCACACGCATTCATTATAATTTTTTTCATAGTTATACCATTCATAATCGGAGGTTTCGGAAACTGACTTACACCGTTAATATTAGGAGCGCCCGACATGGCTTTCCCACGAATAAATAATATAAGATTCTGATTATTACCCCCATCAATTTTAATCATTTTAATAAGAAT----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

I don’t know how to insert it like you did. I am just asking about the dashes because I don’t want to erase something important! I was also thinking about just completely removing that organism and sequence

Shannon

Same! I wouldn't want to malform the database, or remove a section without fully understanding it.

I took at quick look at this part of the Gold handbook, but I'm not sure it answered my question. Let me know if you find anything more useful in the handbook. I think we really need to get a Gold expert here. :1st_place_medal:

Hey Colin,

I got a way to filter out the sequences with dashes so that’s exciting :partying_face:

Now I am trying to create the taxonomy file, do you have any recommendations? I am dealing with a bunch of sequences so writing them individually would be time consuming

Many thanks

Shannon

2 Likes

Great job Shannon! Did you even figure out why they were there.

Good idea. OK, so I see some taxonomy inside your headers.

>AMIG116-08|Ephemerella dorothea|COI-5P|HQ151895
        -> | between these lines| <-

So my idea would be to write a small script that would

  1. Make a new file with just the headers in your fasta file.
  2. Cut out the taxonomy string that is between those lines.
  3. Match up your original headers with the cut taxonomy strings you just made.

Do you want to try writing up that script? :scroll: :pen:

Colin

I was able to do it through excel, I think. just wanted to confirm this looks right:

AMIG116-08 k__Animalia; p__Arthropoda; c__Insecta; o__Ephemeroptera; f__Ephemerellidae; g__Ephemerella; s__dorothea

Thank you for all the help

That looks great!

If you have any troubles, you might find that it needs to have the full header in there.

Like this!

AMIG116-08|Ephemerella dorothea|COI-5P|HQ151895 k__Animalia; p__Arthropoda; c__Insecta; o__Ephemeroptera; f__Ephemerellidae; g__Ephemerella; s__dorothea
                     There is a tab right here ^

Great progress today, Shannon.

1 Like

Hey Colin,

I was going based off the green genes and I didnt see the full header. How did you make that l looking character? Should I include the header before moving forward?

Thanks again

Shannon

Hello Shannon,

The | bar character is on the same key as \ on the right side of the keyboard, and you get it by pressing shift+\.

The bar character is used to separate things, just like , or ;, and is also used in linux pipes.

Only if you need to! If using AMIG116-08 as a heading works great, you are good to go and don't need the full header it all. You will have to try it and find out.

Let me know how it goes.

Hey Colin,

So while I was extracting the reference reads I got an error code:

(qiime2-2019.7) ksup20672mac:Desktop swhitne1$ qiime feature-classifier extract-reads
--i-sequences US_insecta.qza
--p-f-primer TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGAGATATTGGAACWTTATATTTTATTTTTGG
--p-r-primer GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGWACTAATCAATTWCCAAATCCTCC
--o-reads ref-seq.qza
Plugin error from feature-classifier:
string index out of range
Debug info has been saved to /var/folders/tk/jfjwhjvx36302yc8z85_ql6s__5t7k/T/qiime2-q2cli-err-bfhechjx.log

Did I use the wrong f and r primers?

I’m not sure what ‘string index out of range’ means for this plugin. What does the log file say?

Traceback (most recent call last):
  File "/Users/swhitne1/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/q2cli/commands.py", line 327, in __call__
    results = action(**arguments)
  File "</Users/swhitne1/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/decorator.py:decorator-gen-351>", line 2, in extract_reads
  File "/Users/swhitne1/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/qiime2/sdk/action.py", line 240, in bound_callable
    output_types, provenance)
  File "/Users/swhitne1/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/qiime2/sdk/action.py", line 383, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/Users/swhitne1/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/q2_feature_classifier/_cutter.py", line 166, in extract_reads
    first_read = next(reads)
  File "/Users/swhitne1/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/q2_feature_classifier/_cutter.py", line 110, in _gen_reads
    amp = _approx_match(seq, f_primer, r_primer, identity)
  File "/Users/swhitne1/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/q2_feature_classifier/_cutter.py", line 94, in _approx_match
    beg, b_matches, b_length = _semisemiglobal(f_primer, seq)
  File "/Users/swhitne1/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/q2_feature_classifier/_cutter.py", line 60, in _semisemiglobal
    _local_aln(primer, sequence)
  File "/Users/swhitne1/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/q2_feature_classifier/_cutter.py", line 46, in _local_aln
    skbio.alignment.local_pairwise_align_ssw(one_primer, sequence)
  File "/Users/swhitne1/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/skbio/alignment/_pairwise.py", line 731, in local_pairwise_align_ssw
    constructor(alignment.aligned_target_sequence, metadata=metadata2,
  File "skbio/alignment/_ssw_wrapper.pyx", line 339, in skbio.alignment._ssw_wrapper.AlignmentStructure.aligned_target_sequence (skbio/alignment/_ssw_wrapper.c:3560)
  File "skbio/alignment/_ssw_wrapper.pyx", line 376, in skbio.alignment._ssw_wrapper.AlignmentStructure._get_aligned_sequence (skbio/alignment/_ssw_wrapper.c:4165)
IndexError: string index out of range

This is what it says!

Wow, I’ve never seen that one before!

Let’s get a real qiime dev in here and see what they say. @ebolyen

Thanks for posting that Shannon.

2 Likes

I'm not sure if this is a good idea --- the "sequences with dashes" usually mean that the sequences have been aligned, which means that you're possibly importing the wrong files here (if they have unaligned sequences for this database I think that is what you want). By removing any sequences with alignment gap characters you are now removing valid features from this database, probably not what you want.

As far as the "string index out of range" - my off-the-cuff guess is that this is related to the aligned-ness I mentioned above.

1 Like

Hey Matt,

Someone at BOLD was able to filter out the sequences that contain dashes, so it makes my database smaller but I think it is still substantial since it has 101,120 records.

Should I go back and ask them for the sequences to be unaligned?

Thank you

Shannon

2 Likes

The issue here is that you are potentially mis-using data. You appear to be using aligned sequences where you should be working with unaligned sequences. I would get my hands on the unaligned sequences and then go from there :crossed_fingers:

1 Like

Hey Matt,

Thanks for the response. I am confused about the aligned vs unaligned sequences, what is the main difference? Sorry if this is a dumb question, I just thought unaligned contained dashes and I thought Qiime2 couldn’t interpret those for classifiers. :sweat_smile:

Thank you

Shannon

3 Likes