Need help in in silco PCR from WGS

Hi @FloRos,

Integrating WGS and 16S data, for source tracking, has as far as I know not been demonstrated. One challenge is having a common feature space to represent the differing data. The framework provided by Greengenes2 may provide a means here, as it can be used to normalize the features used.

What I recommend specifically is:

  1. Map the short read shotgun data against Web of Life 2 (WoL2); publication. We've only attempted bowtie2 with the SHOGUN parameter set (expressed here) but I would assume the results are reasonably robust to aligners. We use WoL2 as that is the genome set which backs Greengenes2
  2. Filter the resulting feature table against Greengenes2 to omit genomes in WoL2 which lack 16S using the filter-features action of q2-greengenes2
  3. Map the 16S V3-4 data against Greengenes2 using the non-v4-16s action
  4. Collapse both the shotgun data and 16S data to either species or genus using q2-taxa

More information on using Greengenes2 can be found in this tutorial.

At this point, the feature identifiers within both tables will use a common namespace -- the Greengenes2 taxonomy. These tables can then be merged, and subsequent analysis (or sourcetracking performed). We performed a variety of integration assessments including with environmental data, although we did not explicitly attempt source tracking. The methods text (and actual code) may be if interest. However, there are a few important caveats to be aware of which I don't have exact guidance on:

  • Short read mapping to genomes has high false positive. Removal based on coverage, and/or filtering low relative abundance / low prevalence features may be important for some downstream analysis
  • The dynamic range of the 16S and WGS read counts will likely be quite different. If an analysis assumes rarefaction, it may be best to rarefy each table separately, unit normalize, and then merge. Analyses which do not require rarefaction may still be sensitive to library depth
  • As a general rule of thumb, the genomic reference databases are weak for environmental samples (see e.g., read recruitment in the EMP 500).

I wasn't entirely clear from the description if you have or are planning on obtaining V4 16S data. That matters in so much as more precise phylogenetic coordinates can be obtained currently, particularly with the EMP 16S primers, but I don't know if that has an impact on model performance or biological conclusions.

There are some incredible advances in formal sourcetracking techniques as well, like STENSL, if you aren't already aware.

A word of caution on in silico PCR and 16S: these regions are notoriously difficult to assemble. Extracting variable regions from assembled WGS data may exhibit unusual fragments.

Really curious to hear how this analysis goes!

All the best,
Daniel