installation conflict: QIIME 2 v2024.5 and q2-greengenes2

Hi @Nicholas_Bokulich @wasade,
Following this notification, I installed the 2024.5 version of QIIME2, which included the patched version of RESCRIPt 5.1. However, I am now unable to install the q2-greengenes plugin into this version of QIIME2 and the following error message is generated:

[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for scikit-bio
Failed to build scikit-bio
ERROR: Could not build wheels for scikit-bio, which is required to install pyproject.toml-based
projects

Please check the greengenes2_installation.log for more details (attached)
and there's no plugin as greengenes2 could be located under list.log (attached).greengenes2_installation.txt (10.5 KB)
qiime2-amplicon-2024.5_list.txt (155 Bytes)

However, the greengenes2 plugin installation is succesfull under the 2024.2 (affected version of RESCRIPt) version of qiime2 (list log attached).
qiime2-amplicon-2024.2_list.txt (220 Bytes)

Thanks and best,
D_S

Hi @D_S,

Thank you for the ping. I don't see an obvious explanation in the log file. I just performed a fresh installation of qiime2-amplicon-2024.5 on Mac OS, followed by pip install q2-greengenes2, and it appeared to work. I'm currently attempting a Linux build but the initial install of qiime2-amplicon-2024.5 failed, I suspect due to a dirty conda cache -- will try again after the cleaning process completes.

In the meantime, It looks like QIIME 2 2024.5 pins the iow library to version 1.0.5. That version does not support scikit-bio 0.6.0, and I suspect that is what's causing the downgrade in the build process, and the subsequent attempt to build a wheel for scikit-bio.

@Nicholas_Bokulich, can the pin on iow be relaxed?

Best,
Daniel

Hi @D_S,

I was successful installing on one of our Linux hosts (Centos 7.9). The default system compiler GCC 4.8.5 was unable to compile the libraries, but using GCC 9.3.0 worked fine.

Is there further output from pip? I was anticipating seeing information from compilation. The issue with GCC 4.8.5, which is quite old, is it doesn't directly support openmp whereas later versions of GCC do. If you do not have the log files, could you run the following commands and provide their output?

$ which gcc
$ gcc --version

Best,
Daniel

Hi @wasade,

Thanks a lot for the replies.

No, that was all from pip. So, I tried:

pip install q2-greengenes2 --verbose

This flag generated more information about the installation error, which can be seen in the attached verbose log file:
greengenes2_installation_verbose.txt (12.7 KB)

Interestingly, gcc --version initially gave the output that it couldn't find GCC and suggested installing it using:

sudo apt install gcc

I ran the above command, which successfully installed GCC. After this, I tried pip install q2-greengenes2, again, and now it is successfully installed. Here are the relevant logs and output:
greengenes2_installation_verbose_aftergcc.txt (92.2 KB)
qiime2-amplicon-2024.5_list_aftergcc.txt (222 Bytes)

Below are the outputs from the gcc commands:

Does everything seem fine to you?

Best,
D_S

That's great! And would certainly explain it -- the installation triggered a compilation, which implicitly assumes the presence of GCC.

Best,
Daniel

Hi @wasade,

Thank you for the clarification and your help!

Best,
D_S

1 Like

Hi @wasade,

Back again with another query,
So, I am trying to follow through with these commands after DADA-2 (16S V3-V4 region, Illumina 300x2, 341F/785R)

qiime rescript orient-seqs --i-sequences rep-denoise-trimmed-seqs.qza --i-reference-sequences 2022.10.backbone.full-length.fna.qza --o-oriented-seqs oriented-rep-denoise-trimmed-seqs.qza --o-unmatched-seqs unmatched-rep-denoise-trimmed-seqs.qza --p-threads 36

qiime feature-table merge-seqs --i-data oriented-rep-denoise-trimmed-seqs.qza --i-data unmatched-rep-denoise-trimmed-seqs.qza --o-merged-data rescript-rep-denoise-trimmed-seqs.qza

qiime feature-classifier classify-sklearn --i-reads rescript-rep-denoise-trimmed-seqs.qza --i-classifier 2022.10.backbone.full-length.nb.qza --o-classification sklrean-rescript-rep-denoise-trimmed-seqs.tax.qza --p-n-jobs 10

Plugin error from feature-classifier:

The scikit-learn version (0.24.1) used to generate this artifact does not match the current version of scikit-learn installed (1.4.2). Please retrain your classifier for your current deployment to prevent data-corruption errors.

Debug info has been saved to /tmp/qiime2-q2cli-err-7zxl3qzv.log
qiime2-q2cli-err-7zxl3qzv.txt (1.9 KB)
What could I do to remedy this?

Next, I tried the Greengenes2 plugin as shown below, and it seems to be working fine.

qiime greengenes2 non-v4-16s --i-table table-denoise-trimmed-seqs.qza --i-sequences rescript-rep-denoise-trimmed-seqs.qza --i-backbone 2022.10.backbone.full-length.fna.qza --o-mapped-table icu.gg2.biom.qza --o-representatives icu.gg2.fna.qza

qiime greengenes2 taxonomy-from-table --i-reference-taxonomy 2022.10.taxonomy.asv.nwk.qza --i-table icu.gg2.biom.qza --o-classification icu.gg2.taxonomy.qza

qiime metadata tabulate --m-input-file icu.gg2.taxonomy.qza --m-input-file icu.gg2.fna.qza --o-visualization gg2-before-filter-seqs.tax.qzv

qiime taxa barplot --i-table icu.gg2.biom.qza --i-taxonomy icu.gg2.taxonomy.qza --m-metadata-file 16S-seqs-metadata.tsv --o-visualization gg2-before-filter-vis-bar.qzv

qiime phylogeny align-to-tree-mafft-fasttree --i-sequences icu.gg2.fna.qza --o-alignment gg2-aligned-rep-trimmed-seqs.qza --o-masked-alignment gg2-masked-aligned-rep-trimmed-seqs.qza --o-tree gg2-unrooted-trimmed-tree.qza --o-rooted-tree gg2-rooted-trimmed-tree.qza

qiime diversity core-metrics-phylogenetic --i-phylogeny gg2-rooted-trimmed-tree.qza --i-table icu.gg2.biom.qza --m-metadata-file 16S-seqs-metadata.tsv --output-dir core-metrics-results

However, I noticed two specific differences with SILVA:

  1. All of the sequences were classified under bacteria for Greengenes2, unlike SILVA, which also assigned a very tiny percentage under _unassigned and _eukaryota.
  2. SILVA taxonomy still had much higher number of reads under each sample, even after filtering non-bacterial sequences, (see attachment below).
    SILVA vs. Greengenes2.txt (6.5 KB)

Do you think there are any discrepancies in the Greengenes-2 plugin commands I'm following, or does everything seem fine to you?

Best,
D_S

Hi @D_S,

QIIME 2 recently changed the version of scikit-learn it depends on. The new Naive Bayes classifiers can be found on the resources page.

If using non-v4-16s, the phylogenomic Greengenes2 phylogeny can be used rather than re-estimating from ASVs which is known to yield poor quality trees.

How were the ASVs mapped to SILVA?

Best,
Daniel

Hi @wasade,

Thanks again for your help.
The new gg2 nb classifier from the resources worked flawlessly with the qiime feature-classifier classify-sklearn command. I was able to follow the subsequent commands easily from the output files generated until running qiime diversity core-metrics-phylogenetic. At this point, I encountered the following error:
Plugin error from diversity:
** module 'skbio.diversity.alpha' has no attribute 'sobs'**
Debug info has been saved to /tmp/qiime2-q2cli-err-t2yrma6i.log
qiime2-q2cli-err-t2yrma6i.txt (3.5 KB)

I noticed an active thread (still unresolved) in the forum addressing this issue. I followed the suggestions but couldn't resolve it. Initially, I was working on qiime2-amplicon-2024.5, but just for the sake of it, tried the command with the same input files on qiime2-amplicon-2024.2 and it worked!!! Please check the screenshot below.

Regarding your recommendation on using non-v4-16s, I understand that the phylogenomic Greengenes2 phylogeny can be used rather than re-estimating from ASVs. However, I still have the same query: When using Qiime2 (qiime feature-classifier classify-sklearn) with either SILVA or Greengenes2 (2022.10.backbone.full-length.nb.sklearn-1.4.2), I observe a significantly higher number of reads per sample, even after filtering out non-bacterial/unassigned sequences, compared to gg2-non-v4-16s (see the attachment for comparison).
SILVA vs. Greengenes2.txt (8.7 KB)

How were the ASVs mapped to SILVA?
For mapping ASVs after DADA-2, I used the "classify-sklearn" method with a SILVA Naive Bayes classifier. The command used is as follows:
qiime feature-classifier classify-sklearn --i-reads rescript-rep-denoise-trimmed-seqs.qza --i-classifier classifier.qza --o-classification rescript-rep-denoise-trimmed-seqs.tax.qza

Thanks again for your help, and apologies if I am repeating myself on some points. Still new to many aspects here and trying to understand things better to get a clearer view of what I'm doing.

Best,
D_S

Hi @D_S,
I am just popping into address this question. I'll leave the rest for the expert @wasade :lab_coat:

This user on the forum talks about this: [Moving Picture Tutorial] Issues when running the "core-metrics-phylogeny" pipeline in "diversity" plugin - #23 by Iyarit.

Seems like installing the greengenes2 plugin might downgrade your sklearn causing this issue.

Hope this helps!

Hi @D_S

Thanks for the follow up! There are some unexpected dependency requirements in 2024.5 being discussed internally, and it looks like it may depend on an old version of scikit-bio.

There is a lot of technical difference here. The most comparable point would be to perform closed reference against SILVA using the q2-vsearch plugin at the same level of identity as performed with Greengenes2, and then compare. That would have the effect of applying a similar type of filter to the data.

Even then, a difference in the number of sequences per sample doesn't mean the sample-sample relationships are altered appreciably, or that different biological conclusions are necessarily drawn. I would anticipate that SILVA would have somewhat better recovery for marine environments, but I also anticipate SILVA to have more noise as its input data constraints are more relaxed than Greengenes2. Rather than focusing on the number of sequences though, I would advise accounting for the compositional nature of the data. For example, it is probable there is a high correlation between SILVA / Greengenes2 in, for example, the log ratio of two phyla common in your samples. Or, if you plot the rel. abund of the same taxa with SILVA on one axis and Greengenes2 on the opposite axis, I would guess the correlation would be quite good. Note though that mapping lineages between reference databases is unfortunately tedious.

Best,
Daniel

To close out the installation issue. It appears that for 2024.5, it is necessary to install cython then q2-greengenes2 with a constraint on the version of scikit-bio. Doing so avoids any downgrades of existing packages.

$ conda install "cython<1.0"
$ pip install q2-greengenes2 "scikit-bio>=0.6.0"
2 Likes