Need help in in silco PCR from WGS

Dear Qiime2 community.

I am a complete novice to the field of Bioinformatics and have recently managed to analyze bacterial diversity in sea surface aerosols using 16S amplicon sequencing with Qiime2, mostly thanks to valuable and useful information that this community constantly provides (seriously, many thanks for that!).

I am now looking to employ microbial source tracking to identify the contribution of various sources to the airborne diversity observed in my study. However, these kind of software typically requires reference communities (also known as “sources”), and unfortunately, we could not collect water samples during the sea aerosols sampling campaign.

I am considering using data from the TARA ocean sampling campaign, particularly the whole genome sequencing data from Sunagawa et al., 2015 (https://doi.org/10.1126/science.1261359). My concern however, lies in the compatibility of this whole genome sequencing data with my amplicon sequencing data (V3-V4 regions). Is there any possibly to adapt or “translate” these WGS to match the 16S V3-V4 for comparative analysis? I thought about in silco PCR on WGS using the primers from my amplicon sequencing, but I am unsure how to proceed nor if this is even possible.

Alternatively, would it be more practical and relevant to directly use amplicon sequencing data from a different 16S region, such as the V4-V5 region found in Lang-Yona et al., (Terrestrial and marine influence on atmospheric bacterial diversity over the north Atlantic and Pacific Oceans | Communications Earth & Environment)?

I would greatly appreciate your insights on this matter, as well as any other suggestions that could assist in resolving my “source tracking” issue.

Thank you in advance for your guidance.

Best,

F

Hey @FloRos,

welcome to the forum!

This is a "partial" answer to the first half of your question about simulating PCR in silico. We have put together a little QIIME 2 plugin to do exactly that - you can find it under q2-exonerate. You should be able to install it in pretty much any QIIME 2 environment - just follow the instructions in the README. You will also find there a usage example - here, just two things concerning the inputs to keep in mind:

  • your template sequences need to be provided as a single artifact of FeatureData[Sequence] type
  • all the information about your PCR needs to be provided in another experiment artifact - see the README for details.

:warning: Note: please be aware that this plugin has not been extensively tested. If you encounter any issues/bugs/unexpected behaviour please do reach out!

Hope that helps a bit :slight_smile:

Cheers,
Michal

4 Likes

Hello @misialq,

Thank you very much for your suggestion. I will try it as soon as I can. I will be sure to update you on my progress and reach out if I encounter any issues.

Best regards,

F

Hi @FloRos,

Integrating WGS and 16S data, for source tracking, has as far as I know not been demonstrated. One challenge is having a common feature space to represent the differing data. The framework provided by Greengenes2 may provide a means here, as it can be used to normalize the features used.

What I recommend specifically is:

  1. Map the short read shotgun data against Web of Life 2 (WoL2); publication. We've only attempted bowtie2 with the SHOGUN parameter set (expressed here) but I would assume the results are reasonably robust to aligners. We use WoL2 as that is the genome set which backs Greengenes2
  2. Filter the resulting feature table against Greengenes2 to omit genomes in WoL2 which lack 16S using the filter-features action of q2-greengenes2
  3. Map the 16S V3-4 data against Greengenes2 using the non-v4-16s action
  4. Collapse both the shotgun data and 16S data to either species or genus using q2-taxa

More information on using Greengenes2 can be found in this tutorial.

At this point, the feature identifiers within both tables will use a common namespace -- the Greengenes2 taxonomy. These tables can then be merged, and subsequent analysis (or sourcetracking performed). We performed a variety of integration assessments including with environmental data, although we did not explicitly attempt source tracking. The methods text (and actual code) may be if interest. However, there are a few important caveats to be aware of which I don't have exact guidance on:

  • Short read mapping to genomes has high false positive. Removal based on coverage, and/or filtering low relative abundance / low prevalence features may be important for some downstream analysis
  • The dynamic range of the 16S and WGS read counts will likely be quite different. If an analysis assumes rarefaction, it may be best to rarefy each table separately, unit normalize, and then merge. Analyses which do not require rarefaction may still be sensitive to library depth
  • As a general rule of thumb, the genomic reference databases are weak for environmental samples (see e.g., read recruitment in the EMP 500).

I wasn't entirely clear from the description if you have or are planning on obtaining V4 16S data. That matters in so much as more precise phylogenetic coordinates can be obtained currently, particularly with the EMP 16S primers, but I don't know if that has an impact on model performance or biological conclusions.

There are some incredible advances in formal sourcetracking techniques as well, like STENSL, if you aren't already aware.

A word of caution on in silico PCR and 16S: these regions are notoriously difficult to assemble. Extracting variable regions from assembled WGS data may exhibit unusual fragments.

Really curious to hear how this analysis goes!

All the best,
Daniel

Hey @wasade,

Thanks a lot for the detailed instructions! I've been kinda stuck trying to figure out the best way to handle the metagenome assembly (mostly given my inexperience in the topic), and to be honest, I wasn’t really sure about the outputs after the in silico PCR. You really cleared up my confusion about it.

About your question, I've got my aerosol samples sequenced using an Illumina platform, targeting the V3-V4 region (341F-806R). The data were processed through a standard 16S rRNA gene diversity analysis pipeline using DADA2, followed by taxonomic classification using a pre-trained Naive Bayes classifiers from the SILVA 138 SSU_NR99 reference database.

I'm looking forward to trying out your suggestions. I'll definitely circle back with some feedback as soon as I've given it a go!

Thanks again!

Regards,

F

1 Like