Correlation analysis problem

Hello,
I have been working through qiime2, and with the help of this forum, I think Ive got most of it down, so thank you! I do have some specific questions about the gneiss tutorial.

what Im trying to do now is 1) limit the OTUs in a specific way, for example, OTUs that are only found in 50% of the samples, I was unable to find a tutorial that said how to do this.
Once I have that done, I would like to perform correlation analysis, and identify species that are found together, with stat significance. I have gone through gneiss, but was unable to generate this type of data, speciifcally, is this possible??.. does that make sense?

Thank you so much for your help!

Scott

Hi Scott,

I’m working on performing a co-occurrence analysis myself, I have not reached the definitive answers to your questions.
On your question 1, I’d like to add one consideration. How many features you have in total? Currently I’m playing with two different ways of filtering data: a) using a Python script to set the filtering criteria, it outputs a otu table; b) Using Bokulich filtering approach, aka discard any features less than 0.005% of total frequency, I use qiime1 script:
filter_otus_from_otu_table.py -i otu_table.biom -o otu_table.flt.biom --min_count_fraction 0.00005

For the analysis itself, I’m working with SpiecEasi (https://github.com/zdk123/SpiecEasi/blob/master/R/fitdistr.R) in R.
A recent alternative as q2 plug in is:
q2-SCNIC: A tool for making correlation networks, finding modules of observations and summarizing them

(There is a filtering plug in as well included if you want to use their filtering settings).

Now, I’d like to add my concern on the filtering, with the hope that @Nicholas_Bokulich or @michael.shaffer my help us. When you filter the dataset, on the assumption you use a method considering the compositional nature of the data for the analysis as SpiecEasi or SCNIC above, you change the total sum of count per sample. My understanding is that this therefore put a bias in the dataset. Is this correct? Do any of you can suggest a method to filter the data with your own criteria but keeping the total of count per sample unchanged?

Luca

Check out filter-features, which has a min-samples parameter. Just what you need!

Try out the q2-SCNIC plugin that @llenzi linked to. Sounds like just what you are after.

filter-samples will replicate this functionality in QIIME 2, but you must know the total frequency in your feature table (e.g., use qiime feature-table summarize to easily figure that out). Currently there is not a min_count_fraction parameter.

That's a very interesting question. I do not know the answer, but will say that filtering out features is fairly standard practice for any kind of co-occurrence analysis. This includes all sorts of statistical methods for compositional analysis (e.g., ANCOM), so I am guessing that this is statistically admissible and does not bias the results since the counts of other features are not disturbed (we are working with raw abundances here, not relative abundances). @mortonjt @michael.shaffer do you have any thoughts on this?

thank you both so much for the info. I will try this out today!!

hello again,

I was able to use the filter samples without issue, thank you!!
I am not able to install SCNIC though, although Im very excited to use it,

I did what was said…
conda install -c conda-forge blas=1.1
then
conda install -c lozuponelab q2-SCNIC

but I am getting this error …

ackagesNotFoundError: The following packages are not available from current channels:

  • q2-scnic
  • scnic[version=’>=0.5.3’]

Current channels:

Do I need to have the q2-SCNIC folder somewhere specific? sorry for this, I realise its super basic stuff, but I just dont have this skillset, thank you!!

Scott

Hey @swillyb,

Thanks for trying out q2-SCNIC! The issue here is that you don’t have bioconda in your channel list and SCNIC is available through the bioconda channel. You can add bioconda to your channel list using:

conda config --append channels bioconda

And then your install command should work.

Let me know if you have any other issues!

Mike

Hi Mike,

Thank you for the fast reply, much appreciated!

I added bioconda successfully, and now i see this when I run conda install -c lozuponelab q2-SCNIC…

Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  • q2-scnic
  • scnic[version=’>=0.5.3’]
  • fastspar
  • armadillo[version=’>=6.7’]
  • q2-scnic
  • scnic[version=’>=0.5.3’]
  • fastspar
  • gsl=1.16
  • q2-scnic
  • scnic[version=’>=0.5.3’]
  • fastspar
  • openmp[version=’>=4.0’]

Current channels:

and when I checked the install of bioconda package, that is I tried to install again as I thought maybe it didnt install correctly, it says… ‘bioconda’ already in ‘channels’ list, moving to the bottom… this suggests to me that its been added to my list pf packages…right??

Thanks for the help!!

@llenzi and @Nicholas_Bokulich

For the effects of filtering on compositionality. I agree it is an interesting question. For SCNIC I am using the SparCC correlation metric by default and so the filtering I provide is naively using the filtering parameters that were used in the original SparCC manuscript. I know that SparCC is using a Dirichlet multinomial distribution in order to estimate true relative abundances and I think this process at least tries to account for this. (I’d recommend digging into the Estimation of component fractions section of the original SparCC paper, it’s pretty interesting (https://doi.org/10.1371/journal.pcbi.1002687)). Ideally the features that are being removed should be so lowly abundant that they do not affect this estimation step.

Hello again @swillyb,

Let’s try adding one more channel. Do these two commands:

conda config --add channels conda-forge
conda config --add channels defaults

And then try the conda install again.

This worked! thank you so much!!!

I am now having an issue with the filtering (of course I am), I am able to run the filtering command... something along these lines

qiime feature-table filter-features
--i-table table.qza
--p-min-samples 2
--p-max-samples 18
--m-metadata-file sample-metadata.tsv
--o-filtered-table table-filtered2max18-sample.qza

however when I generate a bar plot from this, there is no data there. I have tried many iterations of this, with out a max, with minimum samples as 1, just to see, and no matter wgat I do, I get a bar plot with nothing ...for example...

taxa-bar-plots-table-no-Rhizobium-no-mito-filtered5-freq.qzv (316.2 KB)
taxa-bar-plots-table-filtered1nomax-feature.qzv (312.1 KB)
taxa-bar-plots-table-filtered2max18-feature.qzv (312.1 KB)

am I running this command incorrectly? Again, thank you so much for all your help!! you guys are the best!

Scott

Maybe you should feature-table summarize the post-filtered table, to see what the filtering is doing to the table.

this is great advice, thank you!

When I look at my unfiltered table.qzv, it says that features are only observed in 1 or 2 samples, so clearly if I was filtering things not in 5 everything would go away, and it did. However, when I filter everything observed in at least 1, just to check how the filtering command is working, everything also goes away, which confuses me.

I suppose Im also confused about how a feature is determined. I know from my bar plots that there are many species found in all samples, or in 10 samples, for example, are these species not features? Im very confused by this, sorry for the basic question.

And lastly, how do I convert the output of q2-SCNIC, the corr_net.qza or the membership.qza, fpr cytoscape.

Sorry for all the questions, and thank you so much for the help!

Scott

If you filter out any feature observed in at least one sample, that would remove everything! Because if a feature is observed in 0 samples, well, it would not exist in your feature table (and current default behavior for the filtering commands is to drop samples/features with 0 frequency after filtering).

Great question — see this post. Feature can mean any kind of observation in a feature table... in your case I assume feature = sequence variant, but it can also mean taxa if you are using a feature table where taxa are the observations.

No clue — you should consult the cytoscape docs for more details, but maybe @michael.shaffer knows?

You can use:

qiime tools export \
  --input-path corr_net.qza \
  --output-path corr_net

And in the corr_net directory will be a file called network.gml which can be use an an input for cytoscape.

1 Like

this is all very helpful! thank you so much!!

If the features are sequence variants, then based on my bar plots I should have many features in each sample, not just 1 or 2… is that correct?

Is there a way to specifically build a table that uses sequence variants (or genera or species, something like this)

Thanks for the help!!

The bar plots collapse feature variants based on taxonomic affiliation — so the bar plots have no relation to the total # of SVs.

A feature table of SVs (or OTUs, depending on how you processed your data) is what will be produced by default by dada2, deblur, or OTU clustering methods. You can use qiime taxa collapse to collapse these features based on taxonomic affiliation, producing a new feature table where the features are unique taxonomic groups at whatever taxonomic rank (level) you select.

Good luck!