Analyzing whole-metagenomic shotgun sequencing data with V4 amplicon sequence data

Is it possible to analyze whole-metagenomic shotgun sequencing data with amplicon sequence data of the V4 region?

I am doing my thesis and I have two cohorts who have V4 amplicon sequence. So it was quite easy to merge them to analyze together. But, now that I have another cohort with whole-metagenomic shotgun sequencing data, I'm confused if I could use this data along with the previous two.

And if it is possible in the qiime2 pipeline? how could I make these whole-metagenomic shotgun sequencing data?

1 Like

Hi @turtle,

You may find these papers relevant:

There's a Greengenes2 plug-in for QIIME 2, you can learn more here:

I personally haven't used it (yet!) but it might be starting place.


1 Like

Please can you just tell me how could I use Woltka? Because they just told me to pre-process the data first. I just downloaded whole genome sequence data from the SRA database using the fondue plugin. But what I'm seeing in the Woltka tutorial they have used a .qza file against a taxonomy.qza file.

qiime woltka classify \
  --i-alignment align_sam.qza \
  --p-target-rank genus \
  --i-reference-taxonomy taxonomy.qza \
  --o-classified-table table.qza

But I don't have that file yet. Then how could I use Woltka to pre-process my shotgun sequence data?

1 Like

Hi @turtle,

I don't have a cookbook. Partially becasue while this is published, it doesn't mean its straightforward. Partially becasue as I said, I have't done this yet. And partially because there is never a "just" in microbiome bioinformatics and the answer is always dependent on your data and your goals.

When I look at the q2-Woltka tutorial, they link to a taxonomy and tree file. So, that might be somewhere to go. But, I think first you need to figure out the align_sam bit.




Let me tell you the whole scenario. I've downloaded sequence data from the SRA database with the accession number PRJNA672260 to test the q2-greengenes plugins. These sequences are 16S rRNA V3-V4 regions. So first I have run the following code using deblur to produce 150nt long sequences.

qiime quality-filter q-score \
  --i-demux paired_reads.qza \
  --o-filtered-sequences demux-filtered_PRJNA672260.qza \
  --o-filter-stats demux-filter-stats_PRJNA672260.qza
qiime deblur denoise-16S \
  --i-demultiplexed-seqs demux-filtered_PRJNA672260.qza \
  --p-trim-length 150 \
  --p-sample-stats \
  --o-representative-sequences rep-seqs-deblur_PRJNA672260_150.qza \
  --o-table table-deblur_PRJNA672260_150.qza \
  --o-stats deblur-stats_PRJNA672260_150.qza

Here is the files if you want to look into it,
deblur-stats_PRJNA672260_150.qza (33.4 KB)
rep-seqs-deblur_PRJNA672260_150.qza (102.5 KB)
table-deblur_PRJNA672260_150.qza (162.3 KB)
and their visualization files,
table-deblur_PRJNA672260_150.qzv (507.9 KB)
paired_reads.qzv (328.1 KB)
metadata.qzv (1.2 MB)

Then I considered table-deblur_PRJNA672260_150.qza file for the next q2-greengenes2 pipeline. Then I tried to follow the following instructions and write my code as follows,

qiime greengenes2 filter-features \
  --i-feature-table table-deblur_PRJNA672260_150.qza \
  --i-reference 2022.10.taxonomy.asv.nwk.qza \
  --o-filtered-feature-table table-deblur_PRJNA672260_150_gg2.qza
qiime greengenes2 taxonomy-from-table \
  --i-reference-taxonomy 2022.10.taxonomy.asv.nwk.qza \
  --i-table table-deblur_PRJNA672260_150_gg2.qza \
  --o-classification table-deblur_PRJNA672260_150_sequence_gg2.qza

But this shows an error report like this,

Traceback (most recent call last):
File "/home/turtle/miniforge3/envs/qiime2-2023.7/lib/python3.8/site-packages/q2cli/", line 478, in call
results = self._execute_action(
File "/home/turtle/miniforge3/envs/qiime2-2023.7/lib/python3.8/site-packages/q2cli/", line 539, in _execute_action
results = action(**arguments)
File "", line 2, in taxonomy_from_table
File "/home/turtle/miniforge3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/", line 342, in bound_callable
outputs = self.callable_executor(
File "/home/turtle/miniforge3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/", line 566, in callable_executor
output_views = self._callable(**view_args)
File "/home/turtle/miniforge3/envs/qiime2-2023.7/lib/python3.8/site-packages/q2_gg2/", line 778, in taxonomy_from_table
tree = _load_tree_and_cache(open(str(reference_taxonomy)), features)
File "/home/turtle/miniforge3/envs/qiime2-2023.7/lib/python3.8/site-packages/q2_gg2/", line 543, in _load_tree_and_cache
tree = tree.shear(names & features)
File "bp/_bp.pyx", line 758, in bp._bp.BP.shear
File "bp/_bp.pyx", line 800, in bp._bp.BP.shear
ValueError: No requested tips found

what I've managed to learn is that after the referencing with 2022.10.taxonomy.asv.nwk.qza file the resulting table-deblur_PRJNA672260_150_sequence_gg2.qza file lists all the samples in it.
table-deblur_PRJNA672260_150_gg2.qzv (394.5 KB)

So, can you now suggest me any solution? or why I'm having this much trouble? If you can't help me is there anyone with an understanding of this problem?

1 Like

Hi @turtle,

The processing of your 16S and metagenomic data are two seperate pipelines.

For your 16S data, you need to preprocess:

  • Remove primers
  • Quality filter
  • Denoise

You have the potential issue that Greengenes2 asssumes your 16S data was amplified specifcially with 515F primers and performs taxonomic classification based on that (V4). In your initial post, you said the data was V4 and not V34, so the standard Greengenes pipeline makes sense.
Because you start at a different primer position, the assumptions don't hold, and a simple deblurring and trimming way or may not work if you want to perfrom Greengenes classification. Your simplest best will be to do Greengenes for non-V4 amplicons.

The metagenomic data has to be processed seperate because, unfortunately, its a seperate data type.



Thanks, @jwdebelius. Specifically, the non-v4-16s action for closed reference to the backbone. Taxonomy classification can also be performed using the full length Naive Bayes model


1 Like