RESCRIPt "super" dereplicate incorrect taxonomic behavior

Will_Rumfelt · October 2, 2023, 4:13pm

Hi QIIME Community,

I have been following Devon O'Rourke's tutorial on building a COI reference database and think I may have found a bug in RESCRIPt dereplicate (or I'm missing something ):

Prior to dereplication, I have a reference sequence with the taxonomic identification:
GMGMU2090-20 k__Animalia;p__Arthropoda;c__Insecta;o__Hymenoptera;f__;g__;s__

After dereplicating with the "super" option, this reference looks like:
GMGMU2090-20 k__Animalia;p__Arthropoda;c__Insecta;o__Hymenoptera;f__Coccinellidae;g__Harmonia;s__Harmonia axyridis

However, Harmonia axyridis is not a member of the Hymenoptera order, it should be part of Coleoptera. At first I thought this might be due to an incorrect label of a Harmonia reference that allowed it to collapse with the Hymenoptera sequence, but no Harmonia axyridis sequences in the pre-dereplication file are labelled as Hymenoptera.

Any thoughts on what may be happening here? I appreciate any input!

SoilRotifer · October 2, 2023, 5:19pm

Hi @Will_Rumfelt,

This is interesting. Can you provide more details on the exact commands, and the options used?

Feel free to DM me any of your input files, especially the reference files prior and after dereplication.

-Mike

Will_Rumfelt · October 2, 2023, 5:58pm

Hi @SoilRotifer,

Here is the full command that produced the seemingly errant tax id:

qiime rescript dereplicate \
	--i-sequences qiimeArts/bold_ambi_hpoly_length_filtd_seqs.qza \
	--i-taxa qiimeArts/bold_cleaned_taxa.qza \
	--p-mode 'super' \
	--p-derep-prefix \
	--o-dereplicated-sequences qiimeArts/bold_derepSuper_seqs.qza \
	--o-dereplicated-taxa qiimeArts/bold_derepSuper_taxa.qza

Using the following command resulted in the sequence retaining its original tax id:

qiime rescript dereplicate \
    --i-sequences qiimeArts/bold_ambi_hpoly_length_filtd_seqs.qza \
    --i-taxa qiimeArts/bold_cleaned_taxa.qza \
    --p-mode 'uniq' \
    --p-derep-prefix \
    --o-dereplicated-sequences qiimeArts/bold_derepUniq_seqs.qza \
    --o-dereplicated-taxa qiimeArts/bold_derepUniq_taxa.qza

For now I went through the rest of the COI database construction with the 'uniq' dereplicated files - this resulted in retaining some extra sequences, which makes sense based on my understanding of 'super' and 'uniq' - and have seen no evidence of it recreating the Hymenoptera/Harmonia tax id.

I'll get the reference files either uploaded here or DM'ed to you shortly!

-Will

Will_Rumfelt · October 2, 2023, 6:59pm

Here is a Google Drive folder with the files before and after dereplication, using each of the 'super' and 'uniq' options.

BOLD_COI_DB

Cheers,
Will

SoilRotifer · October 2, 2023, 8:40pm

Thanks @Will_Rumfelt!

I just downloaded the data. Feel free to remove the link if needed. Give me a few days to investigate.

Just one quick off-the-cuff comment: I downloaded the files from Devon's tutorial (which used 'super'), and did not notice any taxonomic issues similar to the one you've outlined. Not sure what that means, but at least we have a baseline.

devonorourke · October 3, 2023, 2:19pm

The problem is the record, ultimately. But the bioinformatic process may be something to keep track of. The issue is that the specific record you have indicated, from BOLD, is labeled exactly as you are showing. I agree, Harmonia axyridis is a lady bug (a beetle!), and not an ant or wasp or bee. But that's what your record from BOLD says it is, a

If you search in V4 BOLD system's public website (click here), then search for your label ( GMGMU2090-20), you'll see the resultant taxonomic label (or, just click this link with the appropriate search string entered):

SoilRotifer · October 3, 2023, 4:07pm

Hi @devonorourke ,

This may not be the entire problem... Indeed his base bold_cleaned_taxa.qza file does contain the record with the incorrect order, Hymenoptera:

GMGMU2090-20	k__Animalia;p__Arthropoda;c__Insecta;o__Hymenoptera;f__;g__;s__

I downloaded @Will_Rumfelt 's files, then ran the commands myself. Like so:

qiime rescript dereplicate \
	--i-sequences bold_ambi_hpoly_length_filtd_seqs.qza \
	--i-taxa bold_cleaned_taxa.qza \
	--p-rank-handles 'kingdom' 'phylum' 'class' 'order' 'family' 'genus' 'species' \
	--p-mode 'super' \
	--p-derep-prefix \
	--o-dereplicated-sequences bold_derepSuper_seqs_test.qza \
	--o-dereplicated-taxa bold_derepSuper_taxa_test.qza

The output actually becomes:

GMGMU2090-20	k__Animalia;p__Arthropoda;c__Insecta;o__Hymenoptera;f__Coccinellidae;g__Harmonia;s__Harmonia axyridis

So, I am wondering if there is an issue with the taxon rank backfilling or something? I've not figured out why this happens yet...

Will_Rumfelt · October 3, 2023, 4:13pm

Thank for the response, @devonorourke!

I'm still confused as to why the label would change after dereplication. If the BOLD record is k__Animalia;p__Arthropoda;c__Insecta;o__Hymenoptera;f__;g__;s__ when downloaded, would we expect it to end up with a mismatched family/genus/species after dereplication? Does the 'super' algorithm add lower taxonomic classifications to references without requiring the higher order classifications to match? I might be misunderstanding the 'super' option - any clarification is appreciated!

On a separate note, both Devon's and @SoilRotifer's tutorials on building databases and RESCRIPt work have been immensely helpful for various aspects of my database/metabarcoding work - thank you both!

SoilRotifer · October 4, 2023, 3:02pm

Hi @Will_Rumfelt ,

I just wanted to provide you with an update. We are actively investigating this issue. I was able to make a small reproducible example of the erroneous mixed taxonomy. In the short term, I'd suggest that you to stick with the uniq option for the time being. We'll keep you posted as we work on this.

SoilRotifer · October 5, 2023, 2:33pm

Hi @Will_Rumfelt

After conferring with @Nicholas_Bokulich, it turns out the code is working as written / intended. Below is the result of our discussion, which ultimately describes how --p-mode super works:

Dereplication with super mode, looks for the majority annotation (from among all collapsed sequences) at each rank independently. BUT it also collapses substrings into superstrings at each rank.

Let's assume we have 3 identical reference sequences with the following annotations:

ID	annotation
GMGMR1817-18	k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Coccinellidae;g__Harmonia;s__Harmonia axyridis
GMGMU2090-20	k__Animalia;p__Arthropoda;c__Insecta;o__Hymenoptera;f__;g__;s__
GMGMV5203-20	k__Animalia;p__Arthropoda;c__Insecta;o__Hymenoptera;f__;g__;s__

Using --p-mode super on this sequence set would indeed yield the following (bolding the order annotation that is wrong):
k__Animalia;p__Arthropoda;c__Insecta;o__Hymenoptera;f__Coccinellidae;g__Harmonia;s__Harmonia axyridis

Why? Because super mode is first binning the annotations at each rank to yield something like:
[k__Animalia, k__Animalia, k__Animalia]
...
[o__Coleoptera, o__Hymenoptera, o__Hymenoptera]
[f__Coccinellidae, f__, f__]
[g__Harmonia, g__, g__]
[s__Harmonia axyridis, s__, s__]

It then finds the superstrings at each rank independently:
[o__Coleoptera, o__Hymenoptera, o__Hymenoptera]
[f__Coccinellidae, f__Coccinellidae, f__Coccinellidae]
[g__Harmonia, g__Harmonia, g__Harmonia]
[s__Harmonia axyridis, s__Harmonia axyridis]

and then finds the majority annotation to yield the incorrect hybrid annotation above. Note that because the majority annotation has a bunch of empty ranks, the final annotation is "polluted" by a single reference annotation that does not have empty ranks at genus and species levels.

Intended use of --p-mode super:
super mode was written with the assumption that you (a) do not have misannotations in your database and (b) you do not have empty ranks, or at least these empty ranks are not the "correct" annotation. Super was written to try to "fill in the blanks" when these annotations exist and will never work if the "desired" annotation is one with empty ranks.

I hope this helps!

Will_Rumfelt · October 5, 2023, 3:09pm

Cool, thanks @SoilRotifer! (And others along the way!)

That's a super helpful description of how super works, I didn't realize that it would add the lower ranks to a sequence if the higher ranks weren't all matches.

uniq seems to be the best choice for my large BOLD database then!

SoilRotifer · October 6, 2023, 4:17pm

Hi @Will_Rumfelt,

Just to follow up...

If you'd like to reduce any potential negative affects of incomplete or mis-annotated taxonomy, you can remove reference sequences that do not have at least a family or genus-level taxonomy. You can achieve this by using taxonomy-based filtering of the sequences. Then proceed with any other filtering that you require prior to finalizing and training your reference database.

For any retained sequences that you know are mis-annotated, you can use RESCRIPt's edit-taxonomy action to fix these.

-Cheers!