Gemelli argmin error

jwdebelius · January 5, 2022, 10:51pm

Hi Gemelli team (probably @cmartino),

I've uncovered a new and interesting error! I had a table and metadata file which were absolutely able to generate an ordination a month ago. But, I had to update my computer thanks to some institutional IT policies. When I ran the command in my 2020.6 environment, I got an error:

tf_res = \
    q2_gemelli.ctf(table=table_q2,
                   sample_metadata=meta_q2,
                   individual_id_column='host_subject_id',
                   state_column='gradient',
                   # min_feature_count=100,
                   )

And here's my error.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_1628521/3140904666.py in <module>
      1 ctf_res = \
----> 2     q2_gemelli.ctf(table=table_q2,
      3                    sample_metadata=meta_q2,
      4                    individual_id_column='host_subject_id',
      5                    state_column='tissue_num',

<decorator-gen-559> in ctf(table, sample_metadata, individual_id_column, state_column, n_components, min_sample_count, min_feature_count, max_iterations_als, max_iterations_rptm, n_initializations, feature_metadata)

~/.conda/envs/micc-2021.11/lib/python3.8/site-packages/qiime2/sdk/action.py in bound_callable(*args, **kwargs)
    243 
    244                 # Execute
--> 245                 outputs = self._callable_executor_(scope, callable_args,
    246                                                    output_types, provenance)
    247 

~/.conda/envs/micc-2021.11/lib/python3.8/site-packages/qiime2/sdk/action.py in _callable_executor_(self, scope, view_args, output_types, provenance)
    389 
    390     def _callable_executor_(self, scope, view_args, output_types, provenance):
--> 391         output_views = self._callable(**view_args)
    392         output_views = tuplize(output_views)
    393 

~/.conda/envs/micc-2021.11/lib/python3.8/site-packages/gemelli/ctf.py in ctf(table, sample_metadata, individual_id_column, state_column, n_components, min_sample_count, min_feature_count, max_iterations_als, max_iterations_rptm, n_initializations, feature_metadata)
     28                                                  DataFrame):
     29     # run CTF helper and parse output for QIIME
---> 30     state_ordn, ord_res, dists, straj, ftraj = ctf_helper(table,
     31                                                           sample_metadata,
     32                                                           individual_id_column,

~/.conda/envs/micc-2021.11/lib/python3.8/site-packages/gemelli/ctf.py in ctf_helper(table, sample_metadata, individual_id_column, state_columns, n_components, min_sample_count, min_feature_count, max_iterations_als, max_iterations_rptm, n_initializations, feature_metadata)
    123 
    124     # factorize
--> 125     TF = TensorFactorization(
    126         n_components=n_components,
    127         max_als_iterations=max_iterations_als,

~/.conda/envs/micc-2021.11/lib/python3.8/site-packages/gemelli/factorization.py in fit(self, tensor, y)
    169 
    170         self.tensor = tensor.copy()
--> 171         self._fit()
    172         return self
    173 

~/.conda/envs/micc-2021.11/lib/python3.8/site-packages/gemelli/factorization.py in _fit(self)
    204 
    205         # run the tensor factorization method [2]
--> 206         loads, s, dist = tenals(tensor,
    207                                 mask,
    208                                 n_components=self.n_components,

~/.conda/envs/micc-2021.11/lib/python3.8/site-packages/gemelli/factorization.py in tenals(tensor, mask, n_components, n_initializations, max_als_iterations, max_rtpm_iterations, tol_als, tol_rtpm, fillna)
    532             n_, m_ = obs_tmp.shape
    533             eps_tmp = total_nonzeros / np.sqrt(n_ * m_)
--> 534             if rank_estimate(obs_tmp, eps_tmp) >= (min(obs_tmp.shape) - 1):
    535                 warnings.warn('A component of your data may be high-rank.',
    536                               RuntimeWarning)

~/.conda/envs/micc-2021.11/lib/python3.8/site-packages/gemelli/optspace.py in rank_estimate(obs, eps, k, lam, min_rank, max_iter)
    460             cost.append(lam * max(s_one_[idx:]) + idx)
    461         # estimate the rank
--> 462         r_one = np.argmin(cost)
    463         lam += lam
    464         iter_ += 1

<__array_function__ internals> in argmin(*args, **kwargs)

~/.conda/envs/micc-2021.11/lib/python3.8/site-packages/numpy/core/fromnumeric.py in argmin(a, axis, out)
   1274 
   1275     """
-> 1276     return _wrapfunc(a, 'argmin', axis=axis, out=out)
   1277 
   1278 

~/.conda/envs/micc-2021.11/lib/python3.8/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     52     bound = getattr(obj, method, None)
     53     if bound is None:
---> 54         return _wrapit(obj, method, *args, **kwds)
     55 
     56     try:

~/.conda/envs/micc-2021.11/lib/python3.8/site-packages/numpy/core/fromnumeric.py in _wrapit(obj, method, *args, **kwds)
     41     except AttributeError:
     42         wrap = None
---> 43     result = getattr(asarray(obj), method)(*args, **kwds)
     44     if wrap:
     45         if not isinstance(result, mu.ndarray):

ValueError: attempt to get argmin of an empty sequence

I thought maybe that it was a version problem, so I made a new environment and upgraded to 2021.11. (I also noticed the paired PR got merged, hopefully this is true ). I installed gemilli, deicode, maybe empress, and then openpxyl because my metadata starts in an excel file, and jupyter lab because jupyter lab is self love. I think most were pip, but I don't fully remember.

Same error.

Then, I tried on the linux server at work, since it might have been a mac issue (IT file system "upgrade", Switch to Big Sur, etc).
New system, new environment, same error.

It's totally possible that I have done something stupid and obvious; but i'm really struggling to figure out what went wrong.

Thanks!
Justine

cmartino · January 10, 2022, 5:38pm

Hi @jwdebelius,

Thanks for reporting this bug! It is a known issue and should be addressed in the newest installation. See this issue from @Mehrbod_Estaki that was closed. That was merged in PR#40 and so a fresh pip install should solve the problem. Let me know if it does not. We are planning a new version release soon with several updates/fixes.

Thanks,

Cameron

jwdebelius · January 11, 2022, 8:39am

Thanks @cmartino!

I'll try pip instead of the conda build and report back.

Best,
Justine

jwdebelius · January 17, 2022, 3:47pm

Just for future people, (hopefully only short term): you have to pip isntall the git repository, rather than a direct pip install.

I might have a new problem, @cmartino. The version 0.7.0 I've got in the final merged branch is giving me different results from the earlier git branch, even working off the same set of upstream tables.

The tables are filtered slightly differently before they enter the pipeline, and then re-filtered. The final tables contain identical samples, feature, and counts, I checked. However, the order of features between the tables is not the same.

In the same environment, using two tables with features in a different order, I get a different ordination. This seems like a problem (oversight? Bug ?). I'm really not sure what to do about it

cmartino · January 19, 2022, 4:46pm

Hi @jwdebelius,

Thank you for digging into this. I am certain this is due to a bug in the ordering of the features/samples in the structuring of the tensor before the rclr transformation (impacts CTF but not RPCA). A fix is coming to the master branch before the new version (v. 0.0.7) is being uploaded to PyPi. I had not seen or had reports of it changing results drastically but if you are seeing that, then I will push for this fix to get merged ASAP. It may also be the case this is more of a problem in the study design that would generate the error from the previous version (the error you reported originally). I will report back on this thread when that is done.

Thanks again!

jwdebelius · January 19, 2022, 5:09pm

Thanks @cmartino! I might make a point to specifically order the taxa for reproducibility in the current version (maybe filtering by taxonomic metadata ). In the two-sample data I'm working with, I calculated a set of cohen's ds between the two groups in my gradient. Sort 1 placed the primary emphasis on PC 2/PC 3; sort 2 moved it all to PC 1/PC 2. I don't really care which is "right", but I'd like to be consistent.

Thanks!