Equivalent to remove PyNAST failures in QIIME2

calozupone · May 23, 2018, 8:07pm

Hello,
I am working with a dataset that contains some samples with many sequences that would have been removed as PyNAST failures with QIIME1. These sequences (which are unassigned taxonomically when running through greengenes) appear to more common in samples with low amounts of bacterial versus human signal in biopsy (might be mitochondria). I am concerned about using the protocol of MAFFT + fast tree in the moving pictures tutorial because it will be trying to align and make a tree with these "junk sequences" and that is going to make for a very ugly tree. I just ran the data successfully through SEPP, which I believe will not add to the tree sequences that are very distant from the reference 16S rRNA sequences. This appears to be a good alternative to MAFFT+fast tree in QIIME2, but I do not see an equivalent part of QIIME2 where sequences that are not close to anything in the reference tree/database are dropped from the OTU table. It would be nice to have an equivalent to the production of a OTU table without PyNAST failures as is done in QIIME1 in recommended QIIME2 protocols. I am also just generally concerned about the recommended MAFFT + fast tree protocol because many datasets may have some PyNAST failure sequences and these will produce very long branches.
Thanks,
Cathy

gregcaporaso · May 24, 2018, 11:03am

Hi @calozupone,
I also worry about those sequences when using the MAFFT + FastTree protocol.

One thing that I've been doing lately is removing sequences that aren't assigned at least to the phylum level. This section of the docs illustrates how to do that for a FeatureTable, and this section illustrates how to do that for FeatureData[Sequences] .

I do think SEPP is likely to be a better choice here though, and we're planning to work some discussion of that into the Moving Pictures tutorial for the June release.

calozupone · May 24, 2018, 12:35pm

Thanks Greg! That is really helpful!

calozupone · May 24, 2018, 12:35pm

Hi again,

I think it is a great idea to remove the sequences not assigned at least to the phylum level and then make the table. Maybe this protocol would be good to work into moving pictures tutorial?

Thanks for your helpful response,
Cathy

gregcaporaso · May 24, 2018, 1:09pm

Hi @calozupone,
Yes, I agree about working this into the Moving Pictures tutorial. I made a note about that in this issue. I'm going to try to do this for the next release (scheduled for June).

antgonza · May 24, 2018, 4:18pm

Just to add to the discussion, perhaps the way to go about this is to have 2 options: (1) what has been discussed so far: assign taxonomies, remove "noisy" features by taxonomic-classification, build tree; and (2) build trees using fragment insertion, remove any features that can't be inserted and filter insertions based on a threshold, for example (this could be a full discussion too) any new long insertions or insertions that create new clades.

Bringing this up, as we all know, taxonomic classification is a complicated subject and filtering based on them, IMOO, is even more complicated. By the way, SEPP has a beta feature to assign taxonomy via the insertions, which also is another topic perhaps worth discussing for future benchmark and development.

calozupone · May 24, 2018, 9:24pm

Yes the latter option would be more similar to the QIIME 1 remove PyNAST failures protocol and I think is potentially a good option. Currently if you run things through SEPP to make the tree it is not really clear how many sequences may have been dropped and so then your estimates of what level to rarefy at before e.g. doing unifrac that is based on the full pre-SEPP set of sequences, may be wrong. Would be good at any rate to have a way to filter both the ref-seqs and feature table files based on what is kept in the tree.

gregcaporaso · May 25, 2018, 11:00am

Hi all,
I'd see the SEPP-based workflow for handling this working as follows:

Generate your FeatureTable and FeatureData[Sequences] (e.g., DADA2, deblur, ...)
Generate SEPP insertion tree from FeatureData[Sequences] to ultimately generate Phylogeny[Rooted], which may contain fewer tips than there are sequences in FeatureData[Sequences] (because some sequences get filtered).
Filter the FeatureTable based on the Phylogeny[Rooted], so that any features that aren't represented in Phylogeny[Rooted] are dropped from the FeatureTable. (Phylogeny[Rooted] can contain features/tips that aren't represented in Phylogeny[Rooted].)
Downstream steps (e.g., core-metrics) use this filtered FeatureTable.

@antgonza, it'd be very cool to see the other SEPP filtering functionality you describe exposed in the q2-fragment-insertion plugin! It should also be relatively straight-forward to benchmark the taxonomy assignment using tax-credit (the system used for this paper). I think @nbokulich would be willing to advise on that if one of the SEPP developers were interested.

antgonza · May 25, 2018, 2:39pm

I think that workflow makes sense and I would love to see if we can come up with a way to filter down spurious insertions.

Anyway, I'll be interested in seeing and helping with the benchmarks, cc-ing @Stefan as he's the main developer of the SEPP q2 plugin to get his interest on this.

Stefan · May 29, 2018, 8:37am

here are my two cents:

@rejected sequences: I haven't seen a direct option to change the threshold to reject sequences, but from my experience, only very unrelated sequences are rejected and their accumulated relative abundance is typically low ~ 1%.

@filter feature table: I agree, we should add a function to filter a feature table such that only features remain that are nodes of a phylogeny (and reports the fraction of lost reads per sample). Since this seems to be a quite general function, I am not sure if we should add this to the q2-fragment-insertion plugin or to the phylogeny or feature-table plugin. Right now, I don't know how to do it with qiime2 plugins only. I am using some homebrew python code like

    tips = {t.name for t in tree_sepp.tips() if t.name is not None if t.name.isdigit() is False}
    counts = counts_deblur[counts.index.isin(tips)]
    pandas2biom(file_counts_deblur_insepp, counts)

where counts is a pandas.DataFrame of the feature-table, I rely on the assumption that all tips of the reference tree are numeric (Greengenes 13.8 99% OTU-IDs) and inserted fragments are nucleotid sequences, i.e. non-numeric. pandas2biom is another homebrew function to save a pandas.DataFrame as a biom-table.

@new clades: that's a very interesting idea. We currently store all placements for fragment sequences of the 250k samples managed in Qiita. Once I find the time, I'd like to place all those fragments into one gigantic tree, traverse it and try to identify those nodes where many novel fragments with relative long insertion distance get inserted close to the root, which might be indicators for novel clades.
Should fragments of those candidate clades be really diverse, one could e.g. de-novo compute a phylogeny for them and try to graft it to the general Greengenes tree. How to combine branch length of graft and reference tree is currently unclear to me.

@taxonomic assignments:
Currently, I have two strategies to use the insertion tree for taxonomic assignment.

since the reference tree has some (not as many as in the Greengenes tables) taxonomic labels for internal nodes, I can simply traverse from each inserted fragment towards the root and collect all labels along this path.
the more conservative strategy is exposed as classify_otus-experimental and searches for every inserted fragment for the closest OTU-ID in the tree and uses this ID to look up the assigned taxonomic lineage information in the Greengenes tables.

Both methods have NOT been benchmarked! Recommendations for a thorough benchmarking setup are welcome, specifically hints for a good gold standard dataset.

Nicholas_Bokulich · May 29, 2018, 2:59pm

Hi @Stefan,

@gregcaporaso mentioned above how the :qiime2: classifiers were benchmarked:

The benchmark framework and test datasets used for :qiime2: are described in that paper, and intended for extensible use. We used a number of mock communities and simulated datasets that would be useful here. I would be happy to help out if you have questions, just send me a DM or email so we don't move this thread off-topic.

Stefan · June 6, 2018, 8:22pm

I started this post Filter feature table phylogenetically regarding the filtering functionallity and would be happy for some input to my open questions.

tanaes · June 13, 2018, 3:39pm

Re: recomputing novel clade structures from insertions, this is a problem that I believe Siavash Mirarab has a student actively working on. It might be efficient to wait to see if they come up with a release of SEPP that can handle this task, and then see if we can port the functionality over?