running uchime-ref after deblur -- general questions and parameters

anna-schrecengost · August 19, 2021, 9:09pm

Hi, I want to try using qiime vsearch uchime-ref on my ASVs that I've obtained after running Deblur. This is because I've noticed there are still a lot of chimeras in my dataset. On a side note -- has anyone else had this issue with chimeras and Deblur, and what did you do about it?

Anyway, the developer of usearch recommends trying both -mode high_confidence and -mode balanced in this case. I am wondering what the equivalent parameters are for qiime vsearch uchime-ref? I suspect it is setting the value of --pminh, but not sure what I should set that to for this case ( mode high_confidence having a high false negative rate and mode balanced trying to balance false pos and negative).

I'd appreciate any insight, thanks!

thermokarst · August 23, 2021, 2:30pm

Hi @anna-schrecengost!

cc @wasade - any thoughts on this?

Hmm, that's a tricky one - q2-vsearch is built with vsearch, a wholly different software package from usearch. I just had a read through the vsearch support forum and came across this, which is a little useful, but doesn't seem to get to the root of your question:

https://groups.google.com/g/vsearch-forum/c/ffrrHqy4QM0/m/r0gMx3U2DwAJ

The cluster_otus command in USEARCH performs clustering and chimera detection simultaneously. VSEARCH does not include the cluster_otus command. The cluster_fast, cluster_size and cluster_smallmem commands in VSEARCH performs clustering only. It is therefore highly recommended to perform chimera detection with uchime_denovo or uchime_ref using VSEARCH. - Torbjørn

I had a quick read through the following thread, and it might help provide some guidance for you - let me know!

https://groups.google.com/g/vsearch-forum/c/kh8UZuJlT6g/m/1oLffmiqAgAJ

:qiime2:

wasade · August 23, 2021, 3:25pm

Hi @anna-schrecengost,

Deblur uses VSEARCH internally to filter chimeras. Here is the exact command.

Chimera checking and filtering is an unsolved problem though.

All the best,
Daniel

anna-schrecengost · August 23, 2021, 3:40pm

Hi @thermokarst, thank you for these resources! And oops, I did not realize how different vsearch is from usearch

anna-schrecengost · August 23, 2021, 3:50pm

Hi @wasade, thank you for your reply! I don't think I provided enough context in my post. I understand that Deblur removes chimeras with a denovo method (which is an implemnation of uchime_denovo, is that right?). But I have lots of chimeras still present in my data after running deblur, which has really been messing with the phylogeny I'm trying to build. I read in the deblur paper that "After applying Deblur, only reads likely to have been presented to the sequencer are retained. However, it is possible that the reads would still contain chimeras originating from PCR.", which could possibly be what's happening with my data? I would like to see if doing further chimera filtering would improve my phylogeny

wasade · August 23, 2021, 4:48pm

Thanks, @anna-schrecengost. What method is being used to indicate a high presence of chimeras in the data, and a disrupted phylogeny? If this isn't being used, I recommend considering fragment-insertion as the backbone will be fixed.

Aggressive chimera filtering likely will remove real data, and the agreement among the algorithms isn't great. Is there an indication that the presence of chimeras disrupts the biological conclusions being derived from the data?

Best,
Daniel

anna-schrecengost · August 23, 2021, 8:54pm

Thank you @wasade - I did not know about the dangers of filtering chimeras with different algorithms. I know there are some other things going on that are likely affecting my data, and although this should probably be a put in different topic, I'll give you some context:

I can tell there is something weird happening with my data just by inspecting the phylogeny. Basically, I am working on a meta-analysis of 18S rRNA gene sequencing studies, focused on anaerobic ciliates. I processed ~20 studies (which involved both the V4 and V9 regions) and now am using fragment-insertion to construct a phylogeny. I constructed a ciliate reference tree for this (which after clustering at 97% similarity came out to be ~2000 mostly nearly full-length sequences). After inserting my ASVs into the reference tree with fragment-insertion, I expected only sequences assigned to ciliates and unassigned sequences in the tree, or at least for the non-ciliate sequences to fall within the outgroup. However I found many taxonomically assinged non-ciliate sequenes inserted into my ciliate tree. I BLASTed a dozen of those and they were likely chimeras (50% query coverage to unrelated organisms).

So I know something is disrupting the biological conclusions that I can derive from the data, and I meant to test whether it might be the chimeras by filtering them out more. However, I am realizing there are many other factors that may have left me with an ugly tree, for example:

the default % identity threshold that fragment-insertion uses is prettly low for my purposes, so I should utilize exclude-seqs at a higher threshold before running the insertion tree
I think the reference tree itself needs some work. There are two very long-branching outgroup sequences that I need to remove, and also I didn't manage to re-root the tree before going ahead and running sepp, which I am now thinking is likely a big problem (I followed these steps to build the tree). so I am re-running a new tree right now to fix these issues
it's always possible that the taxonomic annotations are wrong and these sequences are actually ciliate sequences (I used classify-sklearn). but, it's hard to know if I can trust this

Sorry for the rambling and not including more context with my original post. This forum has been very helpful for a newcomer like myself and I appreciate your help!

wasade · August 25, 2021, 12:00am

Hi @anna-schrecengost,

This is really cool!!! I agree with the exploration of the upstream pieces first. Getting a reference constructed can be painful, and as you're encountering, sometimes the data are wonky. If you're able, rooting the tree would be valuable. What types of sequences are being used for an outgroup? Long branches should be suspect. It may be useful to examine the alignment of suspect sequences relative to highly trusted ones to see if a breakpoint can be spotted. Bad chimeras can create a lot of problems. A breakpoint 2nt in from 5' probably won't be that bad though (and detecting that would be not fun).

Once rooted, you can take the taxonomy for your reference sequences, and decorate it on to the tree (GitHub - biocore/tax2tree: Automated taxonomy decoration onto a tree). The result can be visualized and explored with Empress (GitHub - biocore/empress: A fast and scalable phylogenetic tree viewer for microbiome data analysis). It may help with the overall QC process. And, the resulting taxonomy from decoration can be used for classification of short reads either where they place or with classify-sklearn

It is possible that some of the input data to the reference are problematic. For Greengenes, we ended up dropping any sequence with > 1% non-ATGC, anything < 1200nt, and anything where the % variation relative to a "core" set of trusted sequences was >10%. We found a lot of not great data in Genbank when trolling for records. For chimeras, we compared against isolates IIRC using uchime-ref, and only considered chimeras if bridging class-level or higher.

Best,
Daniel

anna-schrecengost · August 25, 2021, 4:39pm

Thank you @wasade!!

In regards to the reference/outgroup sequences: For the reference sequences I used all the ciliate sequences from PR2 (which has been integrated with Euk-Ref/ Euk-Ref Ciliophora, so it should already be manually curated, similar to how you described, except that their bp threshold is lower (500 bp)). I also used full-length 18S sequences from my lab as well as from some collaborators. The reference tree looks like a nice standard ciliate tree except for the funky outgroup sequences. For those I used a couple hundred sequences of apicomplexans and dinoflagellates also from the PR2 database. Perhaps this is too many sequences and that combined with the tree being unrooted caused some of my problems? I am trying now with only 5 full-length, high quality outgroup sequences. I am curious about your thoughts on the ideal number of outgroup sequences for this application of fragment-insertion?

Thank you for pointing me toward tax2tree! And I have been using Empress, it's really great! It's really nice that I can easily visualize the q2 assigned taxonomy and metadata in the tree as well.

And, the resulting taxonomy from decoration can be used for classification of short reads either where they place or with classify-sklearn

When you write this, do you mean that I could either assign taxonomy based on where they place in the phylogenetic tree, or also by using the decorated taxonomy as a reference databse in classify-sklearn? Also, tax2tree decorates the tree with Greengenes taxonomy, is that correct? I am interested in decorating the tree with our curated taxonomy (especially with the underrepresented anaerobic ciliates, many of these sequences have not been published yet). I have a tab-separated .txt file with the reference sequences IDs and their taxonomy but have been having trouble figuring out how to use that to automatically annotate the tree. I know this is possible in iTOL but have been curious if there are other options because I find it a bit hard to use.

Thank you!
Anna

wasade · September 13, 2021, 6:13pm

Hi @anna-schrecengost,

It's my understanding that there are a lot of open questions with outgroups. For Greengenes, we just used all of the sequences that we could objectively define as Archaea for the outgroup (in that case, IIRC we used the ssu-align model association).

That's wonderful to hear tax2tree + empress worked well

Tax2tree can decorate using any input taxonomy, and the taxonomy does not need to be complete. Just needs some tip name -> lineage information. The lineage information needs to be well formed such that a) taxonomic rank prefixes are included (eg "d__Eukarya; p__Chordata; ...etc"), b) for the records included, all must have the same number of taxonomic levels and c) there cannot be any gaps in the ranks (eg "d__Eukarya; p__; c__Mammalia")

The names are placed on the internal nodes of the tree. If that tree is subsequently used for fragment insertion, the internal nodes will remain named, so you can pull off the lineage of the placed fragments by just walking the ancestors. I believe the fragment insertion plugin supports this already under the classify-otus-experimental but I haven't used it specifically; it's definitely possible via the skbio.TreeNode API though via the .ancestors method.

Does that help? Sorry for the delayed reply, busy few weeks.

All the best,
Daniel

system · October 15, 2021, 4:18am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.