Phylogenetic tree effect on downstream analysis

Hi all. During data input I uploaded16s DNA sequences of all my samples in a load into qiime and I created a phylogenetic tree based on the rep-sew.qza files generated from DADA2.

Prior to the following downstream analysis I filtered out my samples according to sample grouping I desired. The question is will it affect my diversity analysis result?

Hi @Yin_Hui_Cheok,

Often I like to perform a variety of sequence and taxonomy filtering prior to inferring a phylogeny for use in downstream analyses. There are many ways to do this, but here are a few suggestions:

  1. Remove unwanted / poorly annotated taxonomic groups. The following example removes any sequence variants that do not have at least a phylum-level classification, removes eukaryotes, chloroplast, and mitochondria sequences, as well as any unclassified sequences. Feel free to modify to suite your needs.
qiime taxa filter-table \
    --i-table ./dada2_table.qza \
    --i-taxonomy ./taxonomy.qza \
    --p-mode 'contains'  \
    --p-include 'p__' \
    --p-exclude 'p__;,Eukaryota,Chloroplast,Mitochondria,Unassigned,Unclassified' \
    --o-filtered-table ./table-no-ecmu.qza
  1. Further quality filtering. Although DADA2 / deblur, etc… incorporate some quality filtering, you can take some extra steps to remove more extraneous data. The example below, shows how to remove any sequences that do not have at least a 90 % match to the SILVA 138 database.
qiime quality-control exclude-seqs \
    --i-query-sequences ./rep_set-no-ecmu.qza \
    --i-reference-sequences ./references/silva-138-99-seqs-515-806.qza \
    --p-method blast \
    --p-perc-identity 0.90 \
    --p-perc-query-aligned 0.90 \
    --o-sequence-hits ./hits.qza \
    --o-sequence-misses ./misses.qza
  1. You can remove potential host or other sequences using q2-quality-control actions:
    • qiime quality-control bowtie2-build
    • quality-control filter-reads

Other thoughts:
If you are making your tree with q2-fragment-insertion, you may not need to perform filtering on your data, as this approach will only retain sequence data that can be reliably inserted into an existing GreenGenes or SILVA reference tree. These SEPP files can be found here.

Note, after running such commands, it’s a goo idea to keep your data in sync. That is, after you filter your sequences/features you’ll want to remove those same features from the table and vice-versa. So check out our docs on filtering, specifically:

  • qiime feature-table filter-seqs
  • qiime feature-table filter-features

Finally, you can run through the Inferring Phylogenies tutorial, and proceed with your analyses.

One you have a de novo tree that has been produced with quality data, you should be fine if you are only removing a few samples here and there (i.e. you’ll likely not have to remake the tree). But if you are removing a substantial amount of samples or features then it’d probably be a good idea to remake the de novo tree. Though this may not apply if you are using the fragment insertion approach, as the tree is effectively ‘static’.

Again, there are many ways to go about this. These are just my suggestions to get you started. I’m sure others will have thoughts on this too. By the way, do not forget to checkout Empress.

-Cheers!
-Mike

3 Likes

Hi @SoilRotifer, thanks for your response :relaxed:. I will go through your suggestions.
Before that, allow me to add on some details regarding my work. I was referring to moving picture tutorial when I performed my analysis in qiime. The phylogenetic tree was created by using FASTTREE after the removal of unwanted and low-quality sequences. Also, i filtered one-third of the samples-ID prior to downstream analysis. Will that move affect my result?

Thank you.

No problem. :slight_smile:

It might :man_shrugging:. Sorry to be so vague :wink:. But it doesn’t hurt to check :bar_chart:. Comparing these might even give you more insight into how your processing can affect your interpretation of the data. Also, view the tutorials as guidelines :spiral_notepad: that are set up to help you get used to using :qiime2:. There is no one way to process your data, as each data set is different.