QIIME 2 Meta-Analysis of Rhizosphere Microbiome with Mixed V3-V4/V4 Data: Best Practices for Downstream Analysis

Dear QIIME 2 Community,

I hope this message finds you well. We are currently doing a meta-analysis of rhizosphere microbiome data from banana plants using QIIME 2 and are seeking guidance on our current workflow and best practices for downstream analyses.

Project Overview:

  • Data Source: Raw reads were obtained from the NCBI SRA database.
  • Target Region: Our primary target region is the V3-V4 hypervariable region of the bacterial 16S rRNA gene.
  • Challenge: Due to the limited availability of Bioprojects on NCBI SRA, our dataset contains samples with varying target regions: some are V3-V4, and others are V4 only. Additionally, sequence lengths vary across samples.

Current Workflow:

  1. Data Import: Raw reads were imported as FASTQ files into QIIME 2 as artifacts. They are a mix of paired and single end reads.
  2. Preprocessing:
  • Following recommendations from this forum, we decided to use only the forward reads.
  • Primer and adapter sequences were trimmed using [Specify the tool/plugin used for trimming, e.g., cutadapt].
  • Denoising was performed using DADA2, resulting in a feature table and representative sequences.
  • All representative sequences were trimmed to a uniform length of 231 base pairs.
  • We have information on the primers used for the bioprojects from their published articles and they used different primers.

Specific Questions and Concerns:

  1. Validity of Using Mixed V3-V4 and V4 Data:
  • Given the inherent differences in the amplified regions, is it statistically sound to combine V3-V4 and V4 datasets for alpha and beta diversity analyses?
  • Are there specific considerations or potential biases we should be aware of?
  • We have one bioproject that only has the V4 region so we would lose these samples if we only analyze the bioprojects with V3-V4. Is it possible for us to only analyze the V4 region and trim the other samples as well?
  1. Uniform Sequence Length Trimming:
  • Trimming all sequences to 231 bp was done to ensure consistency. Is this the most appropriate approach, or are there alternative methods (e.g., using a minimum overlap during merging in DADA2) that might preserve more information?
  • What are the potential drawbacks of aggressively trimming to a short uniform length?
  1. Downstream Analysis Suitability:
  • We plan to perform the following downstream analyses:
    • Taxonomic classification with Greengenes2.
    • Possibly fragment-insertion as well, though we are still not sure how to incorporate this into the workflow, so any advice would help.
    • Alpha and beta diversity analyses to assess richness and abundance.
    • Differential abundance analysis for biomarker identification.
    • Co-occurrence network analysis.
    • Functional prediction (e.g., using PICRUSt2).
  • Given the potential biases from mixed target regions and uniform trimming, are these downstream analyses still reliable?
  • Are there specific plugins or parameters in QIIME 2 that we should pay close attention to for these analyses, especially given the mentioned data variations?
  1. Alternative Strategies:
  • Are there alternative preprocessing or analysis strategies you would recommend to mitigate the challenges posed by our dataset?
  • Is there a way to computationally correct for the difference in the amplified regions?
  1. Metadata Consideration:
  • Should the hypervariable target region be included as metadata to account for the variation in the statistical analysis?

We appreciate any insights, criticisms, and recommendations you can provide. We are eager to learn and ensure the robustness of our analysis.

Thank you for your time and expertise.

Sincerely,

John_Kim

1 Like

Hello and welcome to the forum!

That is a lot of text and questions! I read it yesterday but only today found some time to dedicate to it. I hope that I will be able to clarify some things, at least partially.

In our group we also performed metanalysis and ended up splitting the datasets by regions, trying to find similar trends in each region instead of merging them (V1-V2, V3-V4, V4).

This recommendation is valid if you fail to merge the reads during dada2. If merging stats look good, you can still use both reads to get longer sequences.

Now I understand why you decided to use only forward reads. But check if single end reads are actually single-end - some researchers merge forward and reverse reads and upload them like that.

Make sure to use the same primers! Even if different set of primers (for the same region) were used, you should create mock primers that will target the same region, then use cutadapt to trimm it, so sequences from the same region but sequenced with different biological primers would match.

Did you run dada2 separately for all studies/batches? Did you merge different regions at this step?

As I mentioned above, we faced the same issue and solved it by using mock primers that matched all the studies pooled in one analysis by region.

The issue is that ASVs obtained from different regions are a priori different. Comparing them would be the same as comparing apples to bananas - they are just different, independent of other factors such as metadata and experimental design.

Even a single nucleotide difference differentiates ASVs. With different regions, there will be no shared ASVs, which will affect PCoA and statistical analyses.

In my opinion, you can:

  • Option 1. Exclude V4.
  • Option 2. If the V4 dataset contains the same groups as V3-V4 and includes enough samples, you can run analyses in parallel for each region and see if you get the same results.
  • Option 3. Run both regions independently from each other, assign taxonomy, collapse tables to species level, merge tables from different regions and then calculate non-phylogenetic metrics. I am aware that taxonomy annotations at species level are not reliable with 16S data, but here, your purpose is to convert not shared ASVs to shared features such as taxonomy. But it should be noted that different regions will produce different taxonomy profiles (we tested it with the same samples that were sequenced targeting different regions).

I would prefer the option with mock primers, as I described above. The problem is that if you have different primers (especially the ending position of forward primers) targeting the same region, such trimming will not work since your ASVs will still be different.

For me, it is pointless - it will not solve the issue of different primers.

Looks good in general, but the abovementioned issues should be addressed before.

They will be reliable if you will account for different regions in your analyses.

I don't have anything in mind, but maybe other mods will add something.

I am not sure about the results, but you may also consider trying:

  • merge paired reads with VSEARCH
  • Construct mock primers that will match both v4 part of v3-v4 sequences and v4 sequences
  • trimm by mock primers
  • run Deblur instead of Dada2.

I would prefer not mixing different regions but account for different primers targeting the same region by creating mock primers that match all sequences from given region.

Why not to include it?

The bottom line: In your shoes, I would:

  • Step 1. Decide whether to exclude V4 or not.
  • Step 2. Design mock primers for V3-V4 that would match all datasets. The same goes for v4 if applicable.
  • Step 3. Trimm mock primers (separately for each study).
  • Step 4. Run dada2 (separately for each study) or deblur (separately for each region targeted)
  • Step 5. Merge feature tables and representative sequences by region targeted.

And then proceed to downstream analyses, separately for each region.

PS. I will leave it queued. To all, please feel free to add something or correct me if my recommendations are inappropriate.

Best,

3 Likes

Hi @timanix, I hope you're having a good day. First of all, thank you very much for taking the time to respond to each of my concerns. I greatly appreciate it. Now, let me share my team's thoughts.

In the past, our approach was to optimize "merging statistics." To achieve this, we used various trim and truncation parameters to improve merging and reduce chimeric sequences. However, we have since reviewed recommendations suggesting that using only forward reads is preferable when working with a mix of single-end and paired-end (forward) reads. Additionally, they recommended trimming all reads to the same length. Following this advice, we truncated our reads to 231 bp. Implementing these steps resulted in a significant reduction of merging percentage and an elevation of chimeric sequences (low non-chimeric percentage). Is this an acceptable trade-off?

We have also confirmed that the single reads obtained from NCBI SRA are indeed single-end reads.

To address the discrepancy in primers used by the sequence submitters, we created mock primers. We will now implement your suggestion using the following sequences:
Forward: 5'-CCTACGGGNBGCASCA-3'
Reverse: 5'-GRMYWMNVGGGTATCTAAT-3'

We also have a question regarding primer trimming with Cutadapt. For the p-front-r parameter, is it necessary to reverse complement the reverse mock primer? Please verify, and here is our code:
qiime cutadapt trim-paired
--i-demultiplexed-sequences PRJNA827236_diseased.qza
--p-front-f CCTACGGGNBGCASCA
--p-front-r GRMYWMNVGGGTATCTAAT
--p-discard-untrimmed
--p-match-read-wildcards
--p-cores 10
--o-trimmed-sequences trimmed/diseased/PRJNA827236_diseased_noncomp_trimmed-seq.qza
--verbose

Yes, we first denoised each read individually using DADA2, and subsequently merged the feature table and representative sequences.

Our current workflow involves the following steps:

  1. Separate trimming of reads using our curated mock primers.
  2. Separate denoising of reads to a uniform length, using only the forward reads for paired-end data.
  3. Merging all denoised forward reads and denoised single-end reads for the V3-V4 region.

We will update you on any improvements or problems we encounter. Please feel free, and we encourage the community, to share additional thoughts on this workflow. We deeply appreciate your and the community's assistance. Thank you, and stay safe.

1 Like

Hello John,

I agree with Timur's detailed commentary on your pipeline!

I wanted to briefly comment on this part:

However, we have since reviewed recommendations suggesting that using only forward reads is preferable when working with a mix of single-end and paired-end (forward) reads.

I've heard this too. I think the idea is that it may be more consistent to keep to use only forward reads because it avoids issues with merging so you can keep more reads. It's a tradeoff.

Additionally, they recommended trimming all reads to the same length.

Yeah, Qiita does this. Shorter reads means less data for taxonomy and similar ASVs may not differentiate themselves, but it's consistent and you can keep more reads. Same tradeoff.

Following this advice, we truncated our reads to 231 bp. Implementing these steps resulted in a significant reduction of merging percentage and an elevation of chimeric sequences (low non-chimeric percentage). Is this an acceptable trade-off?

Wait... isn't that the worst of both worlds?
The point was to get longer reads (by merging)
or
more reads per sample (by not merging).

2 Likes

Hi @colinbrislawn and @timanix ,

I hope this message finds you well. Our approach generally aims to have more reads per sample by only using the forward reads (no merging in denoising step).

UPDATE: We used mock primers with Cutadapt to trim the sequences.
Sample command:

qiime cutadapt trim-paired \
--i-demultiplexed-sequences PRJNA827236_diseased.qza \
--p-front-f CCTACGGGNBGCASCAG \
--p-adapter-r CCTGSTGCVNCCCGTAGG \
--p-front-r GRMYWMNVGGGTATCTAAT \
--p-adapter-f ATTAGATACCCBNKWRKYC \
--p-discard-untrimmed \
--p-match-adapter-wildcards \
--p-cores 10 \
--o-trimmed-sequences trimmed/diseased/PRJNA827236_diseased_trimmed-seq.qza \
--verbose

For single-end reads: We did not trim the primers further because sequence inspection indicated they were already removed.

Subsequently, we denoised all samples using DADA2 with the following parameters:

qiime dada2 denoise-single \
--i-demultiplexed-seqs paired_reads.qza \
--p-trunc-len 232 \
--p-min-fold-parent-over-abundance 8 \
--p-n-threads 10 \
--o-table denoised/02-table.qza \
--o-representative-sequences denoised/02-rep-seqs.qza \
--o-denoising-stats denoised/02-denoising-stats.qza

We then merged the feature tables and representative sequences from the bioprojects, grouping samples to compare healthy versus diseased states.

The sequence statistics for these two groups are as follows:
all_merged_table.qzv (1.0 MB)
all_merged_seqs.qzv (1.7 MB)
However, we are concerned about the observed sample frequencies:






Is this level of frequency variation normal? In previous analyses, where we merged tables without confirming proper primer removal, we typically observed much higher frequencies across all samples. We noted that samples with frequencies exceeding 80,000 reads originated from a single bioproject that contained only forward reads.

Regards,
John_Kim