Fragment insertion - insertion tree seems too large

(Nick Scales) #1

Hi there,

I think I have successfully run the fragment insertion plugin on qiime2 v.2019.1:

qiime fragment-insertion sepp
–i-representative-sequences rep-seqs.qza
–i-reference-alignment alignment.qza
–i-reference-phylogeny rooted_tree.qza
–o-tree insertion-tree.qza
–o-placements insertion-placements.qza

qiime fragment-insertion filter-features
–i-table edited-table.qza
–i-tree insertion-tree.qza
–o-filtered-table filtered_table.qza
–o-removed-table removed_table.qza

qiime fragment-insertion classify-otus-experimental
–i-representative-sequences rep-seqs.qza
–i-tree insertion-tree.qza
–i-reference-taxonomy ref-taxonomy.qza
–o-classification taxonomy.qza

However the insertion tree output is absolutely ginormous when the reference tree input was really quite small. My understanding is that it takes my input tree, and places my sequences on that tree if they are similar enough, and rejects them if they aren’t - so how do I end up with a significantly larger tree than I began with?

The taxonomy output shows that perhaps around half have been placed only to genus level, could this be part of the answer? Another point is that the outgroup is still quite similar, so maybe the 75% cutoff is a little bit low for my purpose, can this be specified to be higher?

Feature ID Taxon
a7c303b7d6ee41311b734fcc9391d381 k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Actinomycetales; f__Microbacteriaceae; g__Curtobacterium
75badcbf1fcbb970dc9c606afead6056 k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Actinomycetales; f__Microbacteriaceae; g__Curtobacterium
cbfc0644207e9a6efd610ed31ee35ee7 k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Actinomycetales; f__Microbacteriaceae; g__Curtobacterium; s__MCPF17_052
7ced95620266c820397b1b6ef6a01d36 k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Actinomycetales; f__Microbacteriaceae; g__Curtobacterium
8e1b19982e67dc63cd428059831f1889 k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Actinomycetales; f__Microbacteriaceae; g__Curtobacterium; s__MCPF17_002
f639e27b3ae54961084e72dc9343fb9d k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Actinomycetales; f__Microbacteriaceae; g__Curtobacterium; s__MCPF17_031

In addition, I added the --verbose flag during the filter-features step to get some statistics about the number of reads lost per sample, but it appears to have skipped the middle section. It reported that the statistics are from 96 samples which is correct, but there is a … section in the middle and only 61 lines of statistics - is there a maximum number that it can report?

RaxML_reroot.nwk (3.5 KB)
(input tree)
insertion-tree.nwk (514.8 KB)
(output tree)
fragment-insertion-stats.csv (1.4 KB)
(statistics about tree placement)
insertion-placements.json.txt (4.8 MB)
(placements json in case that is helpful)

Thank you for your help!

-Nick

(Nicholas Bokulich) #2

Because many new sequences (your rep-seqs) are being inserted into the tree. So the tree grows substantially larger.

The classify-otus-experimental is not a very good taxonomy classifier according to our benchmarks; it is overly “cautious” and hence less prone to false positives but much much more prone to false negatives and incomplete classification. So the genus-level classification is probably related to using this classifier (you might be able to do better with a different classification method) but is probably not diagnostic of tree quality.

Sounds like there is probably a maximum number reported but I am not sure.

@Stefan can you please confirm these points?

(Nick Scales) #3

Hi @Nicholas_Bokulich,

Many thanks for your responses - do you know if it’s possible then to get something like the pplacer guppy fat output where instead of adding each new sequence to the tree it gives a thicker branch to show where the sequences were placed?

Thanks!

(Nicholas Bokulich) #4

Not in QIIME 2 (yet). Sounds like someone should put together a pplacer plugin for QIIME 2! :wink:

(Stefan Janssen) #5

The number of tips in insertion-tree.qza should be equal to the number of tips of the reference rooted_tree.qza + the number of unique fragments in rep-seqs.qza. Maybe a little lower - if some fragments have been rejected. What do you mean by significantly. Can you give some numbers to be on the safe side?

As @Nicholas_Bokulich already stated correctly, you should not use this classifier unless you deal with very very unknown taxa. Better use e.g. the RDP naive Bayes classifier. Results depend on the quality and quantity of ref-taxonomy.qza. If most lineages in this files do not go deeper than Genus classify-otus-experimental cannot be more specific.

I don’t know. You seem to be a real power user of SEPP - maybe it is time to leave the Qiime2 wrapper and directly call SEPP? You can now install it via conda https://anaconda.org/bioconda/sepp Thus, all available parameters https://github.com/smirarab/sepp are directly accessible and not hidden from you via the Qiime2 wrapper.

(Stefan Janssen) #6

I tried to compile pplacer from sources, but due to OCML, I failed so far. Furthermore, if I understand correctly, it is not actively maintained any more.

(Nick Scales) #7

Thanks for getting back to me, it makes a lot of sense now, I think I should indeed have a go at calling SEPP directly and see if this can get me to where I want to be. Thank you @Stefan and @Nicholas_Bokulich!

1 Like