filter for taxonomic completeness

devonorourke · July 22, 2020, 1:59pm

Suppose I have the following taxonomic information assigned to some sequence features as follows:

seq1    k__Animalia;p__Arthropoda;c__Insecta;o__Diptera;f__Sciaridae;g__Bradysia;s__Bradysia nomica
seq2    k__Animalia;p__Arthropoda;c__Insecta;o__Diptera;f__Ceratopogonidae;g__Dasyhelea;s__
seq3    k__Animalia;p__Arthropoda;c__Insecta;o__Diptera;f__Cecidomyiidae;g__;s__

Notice how seq1 has some taxa assigned from phylum through to species, but seq2 is lacking species-level information, and seq3 is missing family, genus, and species-rank information?

It seems like the taxonomy-based filtering described here is the tool I'm looking for, but I wanted to confirm the appropriate syntax. The examples used therein describe how to include for a taxonomic rank (--p-include p_), and exclude when a sample exists but wants to be discarded (--p-exlcude chlorploast).

In my case, I want to drop from something using a prefix that isn't distinct. I can't just use --p-include s__ because that keeps everything, right? Is there support for a wildcard perhaps? Something like --p-exclude s__$? I'd probably need to be careful and also build in some kind of trailing white space bit too...

Thanks!

jwdebelius · July 22, 2020, 2:57pm

Hi @devonorourke,

I hope you're doing well!

i don't have an answer for you, just a philosophical question about your database. When I see your seq2 annotation, it suggests to me that your database (as formatteed) doesn't have a species level assignment for that organism. I know you're doing COI and I dont know the completeness of the database, but I'd be curious about how many are there.

Otherwise, sorry for the question and no answer.

Best,
Justine

devonorourke · July 22, 2020, 3:03pm

Yeah, thanks for nothing @jwdebelius!

The annotation can happen in two ways:

As you describe, a sequence can get assigned by a reference that was incomplete in the first place. To your question, yes, I have created COI databases where species information is lacking, so this is absolutely one possibility.
However, another situation can arise even if you filter that database ahead of time and require all references to have species information. This occurs when you apply an LCA consensus approach to the assigned taxonomy. If a sequence has two top hits with different species names, the resulting classification will be ambiguous.

In either case I want to filter these records out. While my example was using species-rank filtering, I'm actually hoping to apply this to records that lack family-rank information. I just gave the species-level example as it seemed simpler.

Hope that helps, and you're doing well also!

Nicholas_Bokulich · July 22, 2020, 3:11pm

For what purpose? Are you hoping to measure taxonomic completeness? or do you want to count the number of features missing family level? or do you want to wash those unclassified OTUs right out of your hair?

you could probably do something like `--p-exclude 'f__;g__' if the taxonomies do not terminate at the rank where they are unclassified (as I believe is the case in your LCA example)

devonorourke · July 22, 2020, 3:19pm

Thanks @Nicholas_Bokulich

Immediate purpose is to filter out features missing family level. After looking at this kind of data for a while, I've yet to come across an instance where an ASV or OTU feature was lacking at least family-level information and was itself highly abundant or occurred in repeated samples. I could filter using frequency(ofdetection)-based or abundance(ofsequence)-based thresholds, and likely would filter out a similar grouping of features, but my motivation in looking at these samples was to retain features that I have better certainty of their taxonomic identity.

The secondary purpose was to find out if wildcard filtering was even a thing. I didn't see anything in the filtering documentation that mentioned it, but perhaps that easter egg is in there some where... I think your example will certainly work in my specific use case, but I'm still unclear how I might filter out species-level ambiguous taxa. Not a problem for me right now, but just wanted to raise that question.

Thanks!

Nicholas_Bokulich · July 22, 2020, 3:44pm

I did a bit of digging and I think yes, certain wildcard expressions are allowed, so you could do something like --p-include s__? to include only features with a species label (in this case I think ? will match any single character, whereas * will match zero or more so s__* will still match empty species).

q2-taxa is using an sqlite query under the hood:

github.com

qiime2/q2-taxa/blob/3c82fac569f3b454bc74da3bc68f180797a1c700/q2_taxa/_method.py#L55


      
          # ensuring that there are no "extra ids" in the returned ids_to_keep.
          taxonomy = taxonomy.filter_ids(feature_ids)
          
          if mode == 'exact':
              query_template = "Taxon='%s'"
          elif mode == 'contains':
              if include is not None:
                  include = include.replace('_', '\\_')
              if exclude is not None:
                  exclude = exclude.replace('_', '\\_')
              query_template = "Taxon LIKE '%%%s%%' ESCAPE '\\'"
          else:
              raise ValueError('Unknown mode: %s' % mode)
          
          # First identify the features that are included (if no includes are
          # provided, include all features).
          if include is not None:
              include = include.split(query_delimiter)
              ids_to_keep = set()
              for e in include:
                  query = query_template % e

So it looks like any sqlite wildcard other than _ can be used for pattern matching.

devonorourke · July 22, 2020, 3:48pm

!
thanks @Nicholas_Bokulich

p.s. those are eggs, in case it's unclear

thermokarst · July 22, 2020, 4:54pm

I'm pretty sure SQLITE (the backend that runs this filter query) uses % and _ for substitutions:

and

https://sqlite.org/lang_expr.html#like

This is pretty common in databases, but doesn't seem to line up with the * glob we all know and love.

Nicholas_Bokulich · July 22, 2020, 5:08pm

So I think I was mixing my my LIKEs and GLOBs when suggesting this:

Since filter-seqs (a) uses LIKE and (b) escapes _ characters, I think the short answer to this question:

is that filter-seqs does have wildcard filtering, but it is pretty limited since you can only use the % wildcard. That will not work for your unclassified species example.

So maybe RESCRIPt can help you out here... it has a filter-taxa method that uses regular expressions (including the familiar ? and * wildcards) for including/excluding taxa. You can use that method to filter your taxonomy, then filter your sequences with:

qiime feature-table filter-seqs \
    --i-data seqs.qza \
    --m-metadata-file filtered-taxonomy.qza \
    --o-filtered-data filtered-seqs.qza

to only include seqs that are found in the taxonomy file.

To be honest, I have not used wildcards for filtering with that method yet — it was designed and tested for specific substring filtering — but as far as I can tell wildcards are allowed, so this may be an opportunity to chase a few bugs out of the woodwork and tighten up regex filtering!

system · August 22, 2020, 11:21pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.