QIIME 2 Meta-Analysis of Rhizosphere Microbiome with Mixed V3-V4/V4 Data: Best Practices for Downstream Analysis

Dear QIIME 2 Community,

I hope this message finds you well. We are currently doing a meta-analysis of rhizosphere microbiome data from banana plants using QIIME 2 and are seeking guidance on our current workflow and best practices for downstream analyses.

Project Overview:

  • Data Source: Raw reads were obtained from the NCBI SRA database.
  • Target Region: Our primary target region is the V3-V4 hypervariable region of the bacterial 16S rRNA gene.
  • Challenge: Due to the limited availability of Bioprojects on NCBI SRA, our dataset contains samples with varying target regions: some are V3-V4, and others are V4 only. Additionally, sequence lengths vary across samples.

Current Workflow:

  1. Data Import: Raw reads were imported as FASTQ files into QIIME 2 as artifacts. They are a mix of paired and single end reads.
  2. Preprocessing:
  • Following recommendations from this forum, we decided to use only the forward reads.
  • Primer and adapter sequences were trimmed using [Specify the tool/plugin used for trimming, e.g., cutadapt].
  • Denoising was performed using DADA2, resulting in a feature table and representative sequences.
  • All representative sequences were trimmed to a uniform length of 231 base pairs.
  • We have information on the primers used for the bioprojects from their published articles and they used different primers.

Specific Questions and Concerns:

  1. Validity of Using Mixed V3-V4 and V4 Data:
  • Given the inherent differences in the amplified regions, is it statistically sound to combine V3-V4 and V4 datasets for alpha and beta diversity analyses?
  • Are there specific considerations or potential biases we should be aware of?
  • We have one bioproject that only has the V4 region so we would lose these samples if we only analyze the bioprojects with V3-V4. Is it possible for us to only analyze the V4 region and trim the other samples as well?
  1. Uniform Sequence Length Trimming:
  • Trimming all sequences to 231 bp was done to ensure consistency. Is this the most appropriate approach, or are there alternative methods (e.g., using a minimum overlap during merging in DADA2) that might preserve more information?
  • What are the potential drawbacks of aggressively trimming to a short uniform length?
  1. Downstream Analysis Suitability:
  • We plan to perform the following downstream analyses:
    • Taxonomic classification with Greengenes2.
    • Possibly fragment-insertion as well, though we are still not sure how to incorporate this into the workflow, so any advice would help.
    • Alpha and beta diversity analyses to assess richness and abundance.
    • Differential abundance analysis for biomarker identification.
    • Co-occurrence network analysis.
    • Functional prediction (e.g., using PICRUSt2).
  • Given the potential biases from mixed target regions and uniform trimming, are these downstream analyses still reliable?
  • Are there specific plugins or parameters in QIIME 2 that we should pay close attention to for these analyses, especially given the mentioned data variations?
  1. Alternative Strategies:
  • Are there alternative preprocessing or analysis strategies you would recommend to mitigate the challenges posed by our dataset?
  • Is there a way to computationally correct for the difference in the amplified regions?
  1. Metadata Consideration:
  • Should the hypervariable target region be included as metadata to account for the variation in the statistical analysis?

We appreciate any insights, criticisms, and recommendations you can provide. We are eager to learn and ensure the robustness of our analysis.

Thank you for your time and expertise.

Sincerely,

John_Kim

2 Likes

Hello and welcome to the forum!

That is a lot of text and questions! I read it yesterday but only today found some time to dedicate to it. I hope that I will be able to clarify some things, at least partially.

In our group we also performed metanalysis and ended up splitting the datasets by regions, trying to find similar trends in each region instead of merging them (V1-V2, V3-V4, V4).

This recommendation is valid if you fail to merge the reads during dada2. If merging stats look good, you can still use both reads to get longer sequences.

Now I understand why you decided to use only forward reads. But check if single end reads are actually single-end - some researchers merge forward and reverse reads and upload them like that.

Make sure to use the same primers! Even if different set of primers (for the same region) were used, you should create mock primers that will target the same region, then use cutadapt to trimm it, so sequences from the same region but sequenced with different biological primers would match.

Did you run dada2 separately for all studies/batches? Did you merge different regions at this step?

As I mentioned above, we faced the same issue and solved it by using mock primers that matched all the studies pooled in one analysis by region.

The issue is that ASVs obtained from different regions are a priori different. Comparing them would be the same as comparing apples to bananas - they are just different, independent of other factors such as metadata and experimental design.

Even a single nucleotide difference differentiates ASVs. With different regions, there will be no shared ASVs, which will affect PCoA and statistical analyses.

In my opinion, you can:

  • Option 1. Exclude V4.
  • Option 2. If the V4 dataset contains the same groups as V3-V4 and includes enough samples, you can run analyses in parallel for each region and see if you get the same results.
  • Option 3. Run both regions independently from each other, assign taxonomy, collapse tables to species level, merge tables from different regions and then calculate non-phylogenetic metrics. I am aware that taxonomy annotations at species level are not reliable with 16S data, but here, your purpose is to convert not shared ASVs to shared features such as taxonomy. But it should be noted that different regions will produce different taxonomy profiles (we tested it with the same samples that were sequenced targeting different regions).

I would prefer the option with mock primers, as I described above. The problem is that if you have different primers (especially the ending position of forward primers) targeting the same region, such trimming will not work since your ASVs will still be different.

For me, it is pointless - it will not solve the issue of different primers.

Looks good in general, but the abovementioned issues should be addressed before.

They will be reliable if you will account for different regions in your analyses.

I don't have anything in mind, but maybe other mods will add something.

I am not sure about the results, but you may also consider trying:

  • merge paired reads with VSEARCH
  • Construct mock primers that will match both v4 part of v3-v4 sequences and v4 sequences
  • trimm by mock primers
  • run Deblur instead of Dada2.

I would prefer not mixing different regions but account for different primers targeting the same region by creating mock primers that match all sequences from given region.

Why not to include it?

The bottom line: In your shoes, I would:

  • Step 1. Decide whether to exclude V4 or not.
  • Step 2. Design mock primers for V3-V4 that would match all datasets. The same goes for v4 if applicable.
  • Step 3. Trimm mock primers (separately for each study).
  • Step 4. Run dada2 (separately for each study) or deblur (separately for each region targeted)
  • Step 5. Merge feature tables and representative sequences by region targeted.

And then proceed to downstream analyses, separately for each region.

PS. I will leave it queued. To all, please feel free to add something or correct me if my recommendations are inappropriate.

Best,

5 Likes

Hi @timanix, I hope you're having a good day. First of all, thank you very much for taking the time to respond to each of my concerns. I greatly appreciate it. Now, let me share my team's thoughts.

In the past, our approach was to optimize "merging statistics." To achieve this, we used various trim and truncation parameters to improve merging and reduce chimeric sequences. However, we have since reviewed recommendations suggesting that using only forward reads is preferable when working with a mix of single-end and paired-end (forward) reads. Additionally, they recommended trimming all reads to the same length. Following this advice, we truncated our reads to 231 bp. Implementing these steps resulted in a significant reduction of merging percentage and an elevation of chimeric sequences (low non-chimeric percentage). Is this an acceptable trade-off?

We have also confirmed that the single reads obtained from NCBI SRA are indeed single-end reads.

To address the discrepancy in primers used by the sequence submitters, we created mock primers. We will now implement your suggestion using the following sequences:
Forward: 5'-CCTACGGGNBGCASCA-3'
Reverse: 5'-GRMYWMNVGGGTATCTAAT-3'

We also have a question regarding primer trimming with Cutadapt. For the p-front-r parameter, is it necessary to reverse complement the reverse mock primer? Please verify, and here is our code:
qiime cutadapt trim-paired
--i-demultiplexed-sequences PRJNA827236_diseased.qza
--p-front-f CCTACGGGNBGCASCA
--p-front-r GRMYWMNVGGGTATCTAAT
--p-discard-untrimmed
--p-match-read-wildcards
--p-cores 10
--o-trimmed-sequences trimmed/diseased/PRJNA827236_diseased_noncomp_trimmed-seq.qza
--verbose

Yes, we first denoised each read individually using DADA2, and subsequently merged the feature table and representative sequences.

Our current workflow involves the following steps:

  1. Separate trimming of reads using our curated mock primers.
  2. Separate denoising of reads to a uniform length, using only the forward reads for paired-end data.
  3. Merging all denoised forward reads and denoised single-end reads for the V3-V4 region.

We will update you on any improvements or problems we encounter. Please feel free, and we encourage the community, to share additional thoughts on this workflow. We deeply appreciate your and the community's assistance. Thank you, and stay safe.

1 Like

Hello John,

I agree with Timur's detailed commentary on your pipeline!

I wanted to briefly comment on this part:

However, we have since reviewed recommendations suggesting that using only forward reads is preferable when working with a mix of single-end and paired-end (forward) reads.

I've heard this too. I think the idea is that it may be more consistent to keep to use only forward reads because it avoids issues with merging so you can keep more reads. It's a tradeoff.

Additionally, they recommended trimming all reads to the same length.

Yeah, Qiita does this. Shorter reads means less data for taxonomy and similar ASVs may not differentiate themselves, but it's consistent and you can keep more reads. Same tradeoff.

Following this advice, we truncated our reads to 231 bp. Implementing these steps resulted in a significant reduction of merging percentage and an elevation of chimeric sequences (low non-chimeric percentage). Is this an acceptable trade-off?

Wait... isn't that the worst of both worlds?
The point was to get longer reads (by merging)
or
more reads per sample (by not merging).

2 Likes

Hi @colinbrislawn and @timanix ,

I hope this message finds you well. Our approach generally aims to have more reads per sample by only using the forward reads (no merging in denoising step).

UPDATE: We used mock primers with Cutadapt to trim the sequences.
Sample command:

qiime cutadapt trim-paired \
--i-demultiplexed-sequences PRJNA827236_diseased.qza \
--p-front-f CCTACGGGNBGCASCAG \
--p-adapter-r CCTGSTGCVNCCCGTAGG \
--p-front-r GRMYWMNVGGGTATCTAAT \
--p-adapter-f ATTAGATACCCBNKWRKYC \
--p-discard-untrimmed \
--p-match-adapter-wildcards \
--p-cores 10 \
--o-trimmed-sequences trimmed/diseased/PRJNA827236_diseased_trimmed-seq.qza \
--verbose

For single-end reads: We did not trim the primers further because sequence inspection indicated they were already removed.

Subsequently, we denoised all samples using DADA2 with the following parameters:

qiime dada2 denoise-single \
--i-demultiplexed-seqs paired_reads.qza \
--p-trunc-len 232 \
--p-min-fold-parent-over-abundance 8 \
--p-n-threads 10 \
--o-table denoised/02-table.qza \
--o-representative-sequences denoised/02-rep-seqs.qza \
--o-denoising-stats denoised/02-denoising-stats.qza

We then merged the feature tables and representative sequences from the bioprojects, grouping samples to compare healthy versus diseased states.

The sequence statistics for these two groups are as follows:
all_merged_table.qzv (1.0 MB)
all_merged_seqs.qzv (1.7 MB)
However, we are concerned about the observed sample frequencies:






Is this level of frequency variation normal? In previous analyses, where we merged tables without confirming proper primer removal, we typically observed much higher frequencies across all samples. We noted that samples with frequencies exceeding 80,000 reads originated from a single bioproject that contained only forward reads.

Regards,
John_Kim

Hello!

I would check the number of reads before and after cutadapt to see how many reads were discarded due to undetected primers - you probably need to check them again or play with the mismatch rate.

I advised designing primers that match all the reads from the dataset you are going to run together to make them comparable. Different primers, even when removed, will introduce artificial variability in your dataset, biasing your statistical analyses.

Best,

2 Likes

Hi @timanix and @colinbrislawn

The forward primers have been removed from the bioproject with single-end reads, resulting in forward reads approximately 400 base pairs in length. This is longer than the trimmed paired reads, which are around 250 base pairs. Upon inspection, the sequences from the forward-read-only project were found to align with the beginning of the trimmed paired reads, indicating successful primer removal.

We've decided to leave out the study with only V4.

We are still unsure if we should be truncating the studies individually with the same parameters when using denoising-single in DADA2 or just truncate them independently of each other?

We think the single-orientation study was originally paired and then merged? Is there a workaround for this or would we be better off excluding it from our analysis and proceed with just analyzing all of our studies as paired reads and merging them?

If merging stats aren’t looking good for us, how do you suggest we perform downstream analysis from there?

Regards,
John_Kim

I would still run Dada2 separately for each study but with the same settings.

You can check it based on the quality plot (check figures 2 and 3 from this post) or by checking the original paper.

Check your dada2 settings and try to improve the output.

Considering that you decided to work only with V3-V4 region, I would choose between:
Option 1. Work only with paired reads. Then run dada2 for each study separately but with identical settings and merge the outputs.
Option 2. Merge paired reads with VSEARCH (included in qiime2). Then use Deblur instead of dada2 (I think you can even run all the studies together with Deblur).
Option 3. Use only forward reads and trim them at the same position, including datasets with single reads.

I like options 2 and 1. But check if the dataset with long single reads was sequenced with Illumina. If not, check how it can be denoised - if it is incompatible with Dada2 and Deblur, and you decide to keep it, then there will be option 4.

Option 4. Merge paired reads with VSEARCH. Proceed with VSEARCH to OTU clustering (100, 99 or 97%), chimeras removal and dereplication.

All mods who are reading (don't want to spamm with @): Please join the discussion if you have better options in mind. I will leave it queued for a while.

Best,
Timur

1 Like

Hello,

I hope this message finds you well.

Regarding this, we re-examined the paper and found that the study using long single reads actually employed paired-end reads, which the authors merged using FLASH. Additionally, all reads in our study were sequenced using Illumina. When we examined their interactive plot, it appeared as follows:

We'll update you as soon as possible with our progress. Thank you very much for guiding us!

Regards,
John_Kim

1 Like

Hi @timanix

I hope this message finds you well. I would like to update you with our current progress and verify if we're on the right track:

  1. We designed mock primers and trimmed them using cutadapt.
  2. We confirmed that the only dataset that was labeled as "single-end" read was primarily a paired-end read that was merged. So, we also merged all of the paired-end reads using VSEARCH.
  3. After that, we consistently truncated the merged reads at position 404 and denoised them using DEBLUR instead of DADA2.
  4. Then, we merged the feature tables and representative sequences together.
  5. We classified our representative sequences using "non-v4-16s taxonomy with greengenes2 2024.09 classifier."
  6. Using our merged feature table, we also filtered the tree from Greengenes 2024.09.phylogeny.id.nwk for us to have a phylogenetic tree that can be used for further downstream analyses.

Our upstream analyses end there, and we will proceed with the downstream analyses such as diversity analyses, differential abundance analysis, co-occurrence network analysis, etc.

Speaking of co-occurrence network analysis, can you recommend the latest or your preferred tools or packages that can make a co-occurrence matrix and visualize them, using our feature table from the upstream analyses as our input?

That's all for now. We would appreciate it if you have any comments and recommendations to further improve our study.

Regards,
John_Kim

1 Like

Hi!

That sounds good. By following those steps, are you satisfied with the number of reads you retained for the analyses? Also, do the taxonomies assigned make sense to you? I would keep that output if both answers are "Yes".

If all reads from different studies were trimmed using the same mock primers and merged with VSEARCH (except already merged) and then denoised with Deblur, it should make ASVs directly comparable between the studies that you selected. I would keep that information in the metadata file to check the effect on beta diversity. I would definitely check how PCoA plots look like, and account for study/original primer effects in stat analyses.

To be honest, I am not a big fan of network analyses, so my experience with that is not sufficient to recommend the best tool and practices.
But you can check :

  • one recommended here
  • SCNIC. This one even has the q2-plugin, but it is outdated, so it is better to use the standalone version.
  • And here is a guide.

I tried SCNIC a while ago, and it worked for me.

I will leave the topic queued for other mods with better experience with network analyses to add on it.

Best,

I think spiec-easi and flashweave are the most commonly used compositionally aware methods for microbiome network analyses. Both are standalone packages and are also accessible via this QIIME 2 plugin for network analysis:

1 Like

Good day @timanix and @Nicholas_Bokulich,

I hope this message finds you well. I appreciate all of your suggestions, and I'll be trying all of them. But before that, I have a curious question: Before proceeding to different downstream analyses (like diversity analyses, microbiome composition analyses, etc.), for normalization, do you prefer rarefaction as a method to account for the variability of sequencing depth and library size?

Regards,
John_Kim

Hi @John_Kim_Aligato
I still prefer rarefaction... There is a paper against it and another one that in contrary supports it. And probably many more if you will look for it.

For beta diversity, I often use Aitchison distances from Gemelli, which does not require rarefaction. I had some issues with that plugin in the latest qiime2 versions, so I keep qiime2 2023.9 for Gemelli.

Best,

1 Like