Vsearch not finding reference sequence although it is in the database

Dear all,

I normally use Vsearch and a custom database consisting of the gene we have amplified and sequenced. I use “qiime vsearch cluster-features-open-reference” and representative sequences outputted from DADA2. This works perfectly.
Instead of using this custom database I tried the same procedure but instead of the extracted genes I have the entire genomes of the bacterial strains as database. I made sure that the extracted genes I normally look for are present in the new database. However, when running Vsearch it does not find the bacterial strains any longer (The clusters are identical but it outputs random names instead of the species names from the database). If I add a “> TEST” with just the amplified gene to the database it finds this one, so the procedure seems to work.
I am not really sure how to troubleshoot this. Anyone got an idea?

Thank you
Kristian

Hi @Kristiansj!

We need a little more information in order to help you:

  1. What version of QIIME 2 are you using?
  2. What are the commands you are running? Copy and paste please.
  3. What are the results or errors you are seeing? Copy and paste any relevant terminal output, and provide any relevant QZAs or QZVs (or screenshots).

Thanks!

:qiime2:

Thank you for the quick reply and sorry for not providing much information.

  1. I am using qiime2 2019.7.

  2. Database is loaded in by:
    qiime tools import
    --type 'FeatureData[Sequence]'
    --input-path 20200911_9strainmix_genomes_validated.txt
    --output-path 9strain_database.qza

Denoised etc with DADA2
qiime dada2 denoise-paired
--i-demultiplexed-seqs ../../../raw_data/9StrainSamples_data.qza
--o-table 9Strains_dada2
--p-trunc-len-f 290
--p-trunc-len-r 290
--output-dir 9Strains_dada2

OTU/taxonomy with Vsearch
qiime vsearch cluster-features-open-reference
--i-table ../dada2/9strains_dada2.qza
--i-sequences ../dada2/9strains_dada2/representative_sequences.qza
--i-reference-sequences ../../../databases/9strain_database.qza
--p-perc-identity 0.99
--o-clustered-table 9strains_vsearch_open_reference_table_099iden.qza
--o-clustered-sequences 9strains_vsearch_open_reference_sequences_099iden.qza
--o-new-reference-sequences 9strains_vsearch_new_reference_sequences_099iden.qza

I don't get any errors as such but when I use full genomes Vsearch only assigns a few OTUS. In the below I added a >test gene to the database

When I don't have that added it doesn't find it in the full genome database

See next post for screendumps of the database

Hope this clarifies the question. Let me know if you need more information

Best
Kristian

Database screendumps:

Screendump of the ">test" added to the database file
Screenshot 2020-11-05 at 17.40.32

Screendump of the exact same sequence present in one of the full genomes in the same database file:
Screenshot 2020-11-05 at 17.40.54

Hello @Kristiansj,
I am sorry my response has been so slow. Your question is on my radar and I am doing research to get you an answer.

Thank you so much for your patience.

1 Like

Hey Chloe. Thank you for the update. There is no need to spend too much time on - I kinda hoped there was something obvious I’d missed. I can just extract the alleles from the full genome database and use that. A bit time-consuming but it will provide the answers I need in this particular case, I think :slight_smile:

Hi @Kristiansj,

cluster-features-open-reference has a “strand” parameter, which is set to “plus” by default (i.e., only searches alignments in the forward direction). This is often a cause of failed searches, if the reference is in the reverse orientation. Try setting this parameter to “both” and let us know what happens.

3 Likes

Hey @Nicholas_Bokulich ,
Thank you for the reply. I tried your suggestion and unfortunately it had no effect. Since I know just having the allele in the database works, while having the whole genome didn't, I decided to brute force my way through to find the problem by iterative removing bases from the 5' end of the full genome till it works. In the below, "test2" works, however, when there is a bit more 5' it does not work anymore.
I did not find the exact place (I could continue but it got tiresome), however, based on what I found it seems like VSEARCH stops working when the database entry gets too long (?). Let me know if the below images don't make sense.

Best
Kristian

Hi @Kristiansj,

Sounds like perhaps an issue with VSEARCH itself. Increasing reference length will of course impact the time and search results. You could adjust the max-accepts and max-rejects parameters as well.

Otherwise, I think that trimming the reference to the gene of interest may be the solution.

1 Like

Hi again @Nicholas_Bokulich,

Yes… that is also what I fear. Strange though - I thought it was pretty common to look for reference sequences among full genomes. Anyways, I will just extract the GOI and use that in a smaller database. Thank you all for you time!

Maybe not with VSEARCH (or at least not for OTU clustering as you are doing). VSEARCH performs an initial search via kmer matching prior to alignment, so including the full genome may skew the kmer profile too much.

My recommendation not only because it will fix your problem; also it will be faster and more efficient downstream!

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.