However the insertion tree output is absolutely ginormous when the reference tree input was really quite small. My understanding is that it takes my input tree, and places my sequences on that tree if they are similar enough, and rejects them if they aren't - so how do I end up with a significantly larger tree than I began with?
The taxonomy output shows that perhaps around half have been placed only to genus level, could this be part of the answer? Another point is that the outgroup is still quite similar, so maybe the 75% cutoff is a little bit low for my purpose, can this be specified to be higher?
In addition, I added the --verbose flag during the filter-features step to get some statistics about the number of reads lost per sample, but it appears to have skipped the middle section. It reported that the statistics are from 96 samples which is correct, but there is a ... section in the middle and only 61 lines of statistics - is there a maximum number that it can report?
Because many new sequences (your rep-seqs) are being inserted into the tree. So the tree grows substantially larger.
The classify-otus-experimental is not a very good taxonomy classifier according to our benchmarks; it is overly "cautious" and hence less prone to false positives but much much more prone to false negatives and incomplete classification. So the genus-level classification is probably related to using this classifier (you might be able to do better with a different classification method) but is probably not diagnostic of tree quality.
Sounds like there is probably a maximum number reported but I am not sure.
Many thanks for your responses - do you know if it's possible then to get something like the pplacer guppy fat output where instead of adding each new sequence to the tree it gives a thicker branch to show where the sequences were placed?
The number of tips in insertion-tree.qza should be equal to the number of tips of the reference rooted_tree.qza + the number of unique fragments in rep-seqs.qza. Maybe a little lower - if some fragments have been rejected. What do you mean by significantly. Can you give some numbers to be on the safe side?
As @Nicholas_Bokulich already stated correctly, you should not use this classifier unless you deal with very very unknown taxa. Better use e.g. the RDP naive Bayes classifier. Results depend on the quality and quantity of ref-taxonomy.qza. If most lineages in this files do not go deeper than Genusclassify-otus-experimental cannot be more specific.
I don't know. You seem to be a real power user of SEPP - maybe it is time to leave the Qiime2 wrapper and directly call SEPP? You can now install it via conda Sepp | Anaconda.org Thus, all available parameters GitHub - smirarab/sepp: Ensemble of HMM methods (SEPP, TIPP, UPP) are directly accessible and not hidden from you via the Qiime2 wrapper.
I tried to compile pplacer from sources, but due to OCML, I failed so far. Furthermore, if I understand correctly, it is not actively maintained any more.
Thanks for getting back to me, it makes a lot of sense now, I think I should indeed have a go at calling SEPP directly and see if this can get me to where I want to be. Thank you @Stefan and @Nicholas_Bokulich!