Analyzing whole-metagenomic shotgun sequencing data with V4 amplicon sequence data

turtle · June 3, 2024, 4:12pm

Is it possible to analyze whole-metagenomic shotgun sequencing data with amplicon sequence data of the V4 region?

I am doing my thesis and I have two cohorts who have V4 amplicon sequence. So it was quite easy to merge them to analyze together. But, now that I have another cohort with whole-metagenomic shotgun sequencing data, I'm confused if I could use this data along with the previous two.

And if it is possible in the qiime2 pipeline? how could I make these whole-metagenomic shotgun sequencing data?

jwdebelius · June 3, 2024, 4:41pm

Hi @turtle,

You may find these papers relevant:

Usyk et al, 2024: Comprehensive evaluation of shotgun metagenomics, amplicon sequencing, and harmonization of these platforms for epidemiological studies
McDonald et al, 2024, Greengenes2 unifies microbial data in a single reference tree

There's a Greengenes2 plug-in for QIIME 2, you can learn more here:

I personally haven't used it (yet!) but it might be starting place.

Best,
Justine

turtle · June 3, 2024, 5:43pm

Please can you just tell me how could I use Woltka? Because they just told me to pre-process the data first. I just downloaded whole genome sequence data from the SRA database using the fondue plugin. But what I'm seeing in the Woltka tutorial they have used a .qza file against a taxonomy.qza file.

qiime woltka classify \
  --i-alignment align_sam.qza \
  --p-target-rank genus \
  --i-reference-taxonomy taxonomy.qza \
  --o-classified-table table.qza

But I don't have that file yet. Then how could I use Woltka to pre-process my shotgun sequence data?

jwdebelius · June 3, 2024, 6:25pm

Hi @turtle,

I don't have a cookbook. Partially becasue while this is published, it doesn't mean its straightforward. Partially becasue as I said, I have't done this yet. And partially because there is never a "just" in microbiome bioinformatics and the answer is always dependent on your data and your goals.

When I look at the q2-Woltka tutorial, they link to a taxonomy and tree file. So, that might be somewhere to go. But, I think first you need to figure out the align_sam bit.

Best,
Justine

turtle · June 5, 2024, 11:23am

@jwdebelius

Let me tell you the whole scenario. I've downloaded sequence data from the SRA database with the accession number PRJNA672260 to test the q2-greengenes plugins. These sequences are 16S rRNA V3-V4 regions. So first I have run the following code using deblur to produce 150nt long sequences.

qiime quality-filter q-score \
  --i-demux paired_reads.qza \
  --o-filtered-sequences demux-filtered_PRJNA672260.qza \
  --o-filter-stats demux-filter-stats_PRJNA672260.qza
qiime deblur denoise-16S \
  --i-demultiplexed-seqs demux-filtered_PRJNA672260.qza \
  --p-trim-length 150 \
  --p-sample-stats \
  --o-representative-sequences rep-seqs-deblur_PRJNA672260_150.qza \
  --o-table table-deblur_PRJNA672260_150.qza \
  --o-stats deblur-stats_PRJNA672260_150.qza

Here is the files if you want to look into it,
deblur-stats_PRJNA672260_150.qza (33.4 KB)
rep-seqs-deblur_PRJNA672260_150.qza (102.5 KB)
table-deblur_PRJNA672260_150.qza (162.3 KB)
and their visualization files,
table-deblur_PRJNA672260_150.qzv (507.9 KB)
paired_reads.qzv (328.1 KB)
metadata.qzv (1.2 MB)

Then I considered table-deblur_PRJNA672260_150.qza file for the next q2-greengenes2 pipeline. Then I tried to follow the following instructions and write my code as follows,

Introducing Greengenes2 2022.10

Great! Now we have a handful of shotgun samples. How can we get the taxonomy out? First, we will filter-features like we did above, as a small set of WoLr2 records are not in Greengenes2. Then, we will use taxonomy-from-table. These commands below assume the wgets for the Greengenes2 data, noted in the above command blocks, have been run:
$ qiime greengenes2 filter-features \
>     --i-feature-table woltka.example.biom.qza \
>     --i-reference 2022.10.taxonomy.asv.nwk.qza \
>     --o-filtered-feature-table woltka_gg2.example.biom.qza
Saved FeatureTable[Frequency] to: woltka_gg2.example.biom.qza
$ qiime greengenes2 taxonomy-from-table \
>     --i-reference-taxonomy 2022.10.taxon
>     --i-reference-taxonomy 2022.10.taxonomy.asv.nwk.qza \
>     --i-table woltka_gg2.example.biom.qza \
>     --o-classification woltka_gg2.example.taxonomy.qza
Saved FeatureData[Taxonomy] to: woltka_gg2.example.taxonomy.qza
...and that's it!

qiime greengenes2 filter-features \
  --i-feature-table table-deblur_PRJNA672260_150.qza \
  --i-reference 2022.10.taxonomy.asv.nwk.qza \
  --o-filtered-feature-table table-deblur_PRJNA672260_150_gg2.qza
qiime greengenes2 taxonomy-from-table \
  --i-reference-taxonomy 2022.10.taxonomy.asv.nwk.qza \
  --i-table table-deblur_PRJNA672260_150_gg2.qza \
  --o-classification table-deblur_PRJNA672260_150_sequence_gg2.qza

But this shows an error report like this,

Traceback (most recent call last):
File "/home/turtle/miniforge3/envs/qiime2-2023.7/lib/python3.8/site-packages/q2cli/commands.py", line 478, in call
results = self._execute_action(
File "/home/turtle/miniforge3/envs/qiime2-2023.7/lib/python3.8/site-packages/q2cli/commands.py", line 539, in _execute_action
results = action(**arguments)
File "", line 2, in taxonomy_from_table
File "/home/turtle/miniforge3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/action.py", line 342, in bound_callable
outputs = self.callable_executor(
File "/home/turtle/miniforge3/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/action.py", line 566, in callable_executor
output_views = self._callable(**view_args)
File "/home/turtle/miniforge3/envs/qiime2-2023.7/lib/python3.8/site-packages/q2_gg2/_methods.py", line 778, in taxonomy_from_table
tree = _load_tree_and_cache(open(str(reference_taxonomy)), features)
File "/home/turtle/miniforge3/envs/qiime2-2023.7/lib/python3.8/site-packages/q2_gg2/_methods.py", line 543, in _load_tree_and_cache
tree = tree.shear(names & features)
File "bp/_bp.pyx", line 758, in bp._bp.BP.shear
File "bp/_bp.pyx", line 800, in bp._bp.BP.shear
ValueError: No requested tips found

what I've managed to learn is that after the referencing with 2022.10.taxonomy.asv.nwk.qza file the resulting table-deblur_PRJNA672260_150_sequence_gg2.qza file lists all the samples in it.
table-deblur_PRJNA672260_150_gg2.qzv (394.5 KB)

So, can you now suggest me any solution? or why I'm having this much trouble? If you can't help me is there anyone with an understanding of this problem?

jwdebelius · June 5, 2024, 1:05pm

Hi @turtle,

The processing of your 16S and metagenomic data are two seperate pipelines.

For your 16S data, you need to preprocess:

Remove primers
Quality filter
Denoise

You have the potential issue that Greengenes2 asssumes your 16S data was amplified specifcially with 515F primers and performs taxonomic classification based on that (V4). In your initial post, you said the data was V4 and not V34, so the standard Greengenes pipeline makes sense.
Because you start at a different primer position, the assumptions don't hold, and a simple deblurring and trimming way or may not work if you want to perfrom Greengenes classification. Your simplest best will be to do Greengenes for non-V4 amplicons.

The metagenomic data has to be processed seperate because, unfortunately, its a seperate data type.

Best,
Justine

wasade · June 5, 2024, 3:30pm

Thanks, @jwdebelius. Specifically, the non-v4-16s action for closed reference to the backbone. Taxonomy classification can also be performed using the full length Naive Bayes model

Best,
Daniel

system · July 6, 2024, 9:30pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.