I normally use Vsearch and a custom database consisting of the gene we have amplified and sequenced. I use “qiime vsearch cluster-features-open-reference” and representative sequences outputted from DADA2. This works perfectly.
Instead of using this custom database I tried the same procedure but instead of the extracted genes I have the entire genomes of the bacterial strains as database. I made sure that the extracted genes I normally look for are present in the new database. However, when running Vsearch it does not find the bacterial strains any longer (The clusters are identical but it outputs random names instead of the species names from the database). If I add a “> TEST” with just the amplified gene to the database it finds this one, so the procedure seems to work.
I am not really sure how to troubleshoot this. Anyone got an idea?
Hey Chloe. Thank you for the update. There is no need to spend too much time on - I kinda hoped there was something obvious I’d missed. I can just extract the alleles from the full genome database and use that. A bit time-consuming but it will provide the answers I need in this particular case, I think
cluster-features-open-reference has a “strand” parameter, which is set to “plus” by default (i.e., only searches alignments in the forward direction). This is often a cause of failed searches, if the reference is in the reverse orientation. Try setting this parameter to “both” and let us know what happens.
Hey @Nicholas_Bokulich ,
Thank you for the reply. I tried your suggestion and unfortunately it had no effect. Since I know just having the allele in the database works, while having the whole genome didn't, I decided to brute force my way through to find the problem by iterative removing bases from the 5' end of the full genome till it works. In the below, "test2" works, however, when there is a bit more 5' it does not work anymore.
I did not find the exact place (I could continue but it got tiresome), however, based on what I found it seems like VSEARCH stops working when the database entry gets too long (?). Let me know if the below images don't make sense.
Sounds like perhaps an issue with VSEARCH itself. Increasing reference length will of course impact the time and search results. You could adjust the max-accepts and max-rejects parameters as well.
Otherwise, I think that trimming the reference to the gene of interest may be the solution.
Yes… that is also what I fear. Strange though - I thought it was pretty common to look for reference sequences among full genomes. Anyways, I will just extract the GOI and use that in a smaller database. Thank you all for you time!
Maybe not with VSEARCH (or at least not for OTU clustering as you are doing). VSEARCH performs an initial search via kmer matching prior to alignment, so including the full genome may skew the kmer profile too much.
My recommendation not only because it will fix your problem; also it will be faster and more efficient downstream!