Hi @Yin_Hui_Cheok,
Often I like to perform a variety of sequence and taxonomy filtering prior to inferring a phylogeny for use in downstream analyses. There are many ways to do this, but here are a few suggestions:
- Remove unwanted / poorly annotated taxonomic groups. The following example removes any sequence variants that do not have at least a phylum-level classification, removes eukaryotes, chloroplast, and mitochondria sequences, as well as any unclassified sequences. Feel free to modify to suite your needs.
qiime taxa filter-table \
--i-table ./dada2_table.qza \
--i-taxonomy ./taxonomy.qza \
--p-mode 'contains' \
--p-include 'p__' \
--p-exclude 'p__;,Eukaryota,Chloroplast,Mitochondria,Unassigned,Unclassified' \
--o-filtered-table ./table-no-ecmu.qza
- Further quality filtering. Although DADA2 / deblur, etc… incorporate some quality filtering, you can take some extra steps to remove more extraneous data. The example below, shows how to remove any sequences that do not have at least a 90 % match to the SILVA 138 database.
qiime quality-control exclude-seqs \
--i-query-sequences ./rep_set-no-ecmu.qza \
--i-reference-sequences ./references/silva-138-99-seqs-515-806.qza \
--p-method blast \
--p-perc-identity 0.90 \
--p-perc-query-aligned 0.90 \
--o-sequence-hits ./hits.qza \
--o-sequence-misses ./misses.qza
- You can remove potential host or other sequences using q2-quality-control actions:
qiime quality-control bowtie2-build
quality-control filter-reads
Other thoughts:
If you are making your tree with q2-fragment-insertion, you may not need to perform filtering on your data, as this approach will only retain sequence data that can be reliably inserted into an existing GreenGenes or SILVA reference tree. These SEPP files can be found here.
Note, after running such commands, it’s a goo idea to keep your data in sync. That is, after you filter your sequences/features you’ll want to remove those same features from the table and vice-versa. So check out our docs on filtering, specifically:
qiime feature-table filter-seqs
qiime feature-table filter-features
Finally, you can run through the Inferring Phylogenies tutorial, and proceed with your analyses.
One you have a de novo tree that has been produced with quality data, you should be fine if you are only removing a few samples here and there (i.e. you’ll likely not have to remake the tree). But if you are removing a substantial amount of samples or features then it’d probably be a good idea to remake the de novo tree. Though this may not apply if you are using the fragment insertion approach, as the tree is effectively ‘static’.
Again, there are many ways to go about this. These are just my suggestions to get you started. I’m sure others will have thoughts on this too. By the way, do not forget to checkout Empress.
-Cheers!
-Mike