Taxonomy table looked different in R than when it was opened in QIIME (missing taxon level prefix)?

Hi,

I ran QIIME2 for my raw reads via jupyter using %%bash in the shared Mac for heavy analysis (also a little Python for data viz). After that, we copied several outputs to build a phyloseq object in our personal computer (Win).

Previously, I also did that but found an interesting topic here to mix run Python and R in one Jupyter, so I tried it. Because I am not comfortable with jupyter yet, I only stopped at building phyloseq (also for others to use the shared Mac).

# in bash
# install brew first, check brew.sh
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# then install R and libgit2, either from jupyter or shell
brew install r
brew install libgit2 # this is for the R devtools package
# in Jupyter
%load_ext rpy2.ipython
%%R
if (!requireNamespace("devtools", quietly = TRUE)){install.packages("devtools", repos = "https://cloud.r-project.org")} 
devtools::install_github("jbisanz/qiime2R", quiet = TRUE, upgrade = FALSE)
%%R
library(qiime2R)
%%R
physeq <- qza_to_phyloseq(
    features="table.qza",
    tree="rooted-tree.qza",
    taxonomy="taxonomy.qza",
    metadata = "metadata.tsv"
    )
%%R
saveRDS(physeq, file = "ps_mother.rds")

It runs without any error notice, but no output. Is this normal?

So it went well and we continued the analysis with RStudio by ps0 <- readRDS("ps_mother.rds") to do some filtering. After finishing, we export it back to make both new taxonomy2.(qza|qzv)s.

Only to realize that there is a small difference in our taxonomy files. We are missing the taxonomy level prefixes.

Filenames Different taxa names
taxonomy.qzv d__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Rikenellaceae; g__Alistipes; s__uncultured_bacterium
taxonomy2.qzv d__Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Rikenellaceae;Alistipes;uncultured_bacterium

We then tried to investigate at what step does this change occurred by open ps_mother.rds first time in R.

ps0 <- readRDS("ps_mother.rds")
ps0 %>% tax_table() %>% data.frame() %>% filter(Species != "<NA>") %>% head(1)

                                     Kingdom         Phylum               Class         Order
f876bfbfd29d664d606243d9f95edf84 d__Bacteria Proteobacteria Alphaproteobacteria Rickettsiales
                                       Family        Genus              Species
f876bfbfd29d664d606243d9f95edf84 Mitochondria Mitochondria uncultured_bacterium

We found that it is already like this from the first phyloseq object. Do you perhaps know what is wrong here? We worried that other invisible changes might have also occurred.

Our current solution here is to just go back to our regular steps (i.e. build our phyloseq inside R), but we are curious about what happened here.

Any thoughts are appreciated.

Thank you very much.

R `sessionInfo()` R version 4.1.3 (2022-03-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] rlang_1.0.2 ggrepel_0.9.1 biomformat_1.22.0
[4] RColorBrewer_1.1-3 ggstatsplot_0.9.3 cowplot_1.1.1
[7] decontam_1.12.0 speedyseq_0.5.3.9018 phyloseq_1.38.0
[10] qiime2R_0.99.6 forcats_0.5.1 stringr_1.4.0
[13] dplyr_1.0.9 purrr_0.3.4 readr_2.1.2
[16] tidyr_1.2.0 tibble_3.1.7 ggplot2_3.3.6
[19] tidyverse_1.3.1 here_1.0.1

loaded via a namespace (and not attached):
[1] readxl_1.4.0 backports_1.4.1 Hmisc_4.7-0
[4] plyr_1.8.7 igraph_1.3.1 splines_4.1.3
[7] TH.data_1.1-1 GenomeInfoDb_1.30.1 digest_0.6.29
[10] foreach_1.5.2 htmltools_0.5.2 fansi_1.0.3
[13] magrittr_2.0.3 checkmate_2.1.0 paletteer_1.4.0
[16] cluster_2.1.3 tzdb_0.3.0 Biostrings_2.62.0
[19] modelr_0.1.8 vroom_1.5.7 sandwich_3.0-1
[22] jpeg_0.1-9 colorspace_2.0-3 rvest_1.0.2
[25] haven_2.5.0 xfun_0.31 crayon_1.5.1
[28] RCurl_1.98-1.6 jsonlite_1.8.0 zeallot_0.1.0
[31] survival_3.3-1 zoo_1.8-10 iterators_1.0.14
[34] ape_5.6-2 glue_1.6.2 gtable_0.3.0
[37] zlibbioc_1.40.0 emmeans_1.7.4-1 XVector_0.34.0
[40] statsExpressions_1.3.2 Rhdf5lib_1.16.0 BiocGenerics_0.40.0
[43] scales_1.2.0 mvtnorm_1.1-3 DBI_1.1.2
[46] Rcpp_1.0.8.3 performance_0.9.0 xtable_1.8-4
[49] htmlTable_2.4.0 bit_4.0.4 foreign_0.8-82
[52] Formula_1.2-4 stats4_4.1.3 DT_0.23
[55] truncnorm_1.0-8 datawizard_0.4.1 htmlwidgets_1.5.4
[58] httr_1.4.3 ellipsis_0.3.2 farver_2.1.0
[61] pkgconfig_2.0.3 NADA_1.6-1.1 nnet_7.3-17
[64] dbplyr_2.1.1 utf8_1.2.2 labeling_0.4.2
[67] tidyselect_1.1.2 reshape2_1.4.4 munsell_0.5.0
[70] cellranger_1.1.0 tools_4.1.3 cli_3.3.0
[73] generics_0.1.2 ade4_1.7-19 broom_0.8.0
[76] evaluate_0.15 fastmap_1.1.0 yaml_2.3.5
[79] rematch2_2.1.2 bit64_4.0.5 knitr_1.39
[82] fs_1.5.2 nlme_3.1-157 xml2_1.3.3
[85] correlation_0.8.1 compiler_4.1.3 rstudioapi_0.13
[88] png_0.1-7 zCompositions_1.4.0-1 reprex_2.0.1
[91] stringi_1.7.6 parameters_0.18.1 lattice_0.20-45
[94] Matrix_1.4-1 vegan_2.6-2 permute_0.9-7
[97] multtest_2.50.0 vctrs_0.4.1 pillar_1.7.0
[100] lifecycle_1.0.1 rhdf5filters_1.6.0 estimability_1.3
[103] data.table_1.14.2 bitops_1.0-7 insight_0.17.1
[106] patchwork_1.1.1 R6_2.5.1 latticeExtra_0.6-29
[109] gridExtra_2.3 IRanges_2.28.0 codetools_0.2-18
[112] MASS_7.3-57 assertthat_0.2.1 rhdf5_2.38.1
[115] rprojroot_2.0.3 withr_2.5.0 multcomp_1.4-19
[118] S4Vectors_0.32.4 GenomeInfoDbData_1.2.7 mgcv_1.8-40
[121] bayestestR_0.12.1 parallel_4.1.3 hms_1.1.1
[124] grid_4.1.3 rpart_4.1.16 coda_0.19-4
[127] rmarkdown_2.14 Biobase_2.54.0 lubridate_1.8.0
[130] base64enc_0.1-3

I'm not 100% sure what's going on here. I wonder if something is being lost during import.

For all the Phyloseq import functions, including qza_to_phyloseq(), all the input data is joined based on SampleIDs and FeatureIDs that overlap. This means that if some ASVs are missing from the tree, for example, then they would be dropped from the feature table and taxonomy table.

This 'inner join' during import silently drops all non-matching data!! :scream_cat:

I wonder if that's what happened here. Try importing these one at a time and check to see if all SampleIDs and FeatureIDs match.

1 Like

Hi @colinbrislawn,

Thank you for your response.

I think I know now why the taxon prefix is missing, it is because of parse_taxonomy() function inside the qza_to_phyloseq(). :sweat_smile:

However, to my understanding, this line in parse_taxonomy(), should also corrects leading d__ in taxonomy2.qza there. Is it something else?

I will try it.

1 Like

You are correct. As Mike reminded me:

there is a set of options within phyloseq to parse the taxonomic information. By default it expects GreenGenese labels: k__...; p__...; c__...; ..., this is why all of the labels but the d__ are missing.

This should not change featureIDs, so investigating the inner join is still a good idea!