Gemelli argmin error

Hi Gemelli team (probably @cmartino),

I've uncovered a new and interesting error! I had a table and metadata file which were absolutely able to generate an ordination a month ago. But, I had to update my computer thanks to some institutional IT policies. When I ran the command in my 2020.6 environment, I got an error:

tf_res = \
    q2_gemelli.ctf(table=table_q2,
                   sample_metadata=meta_q2,
                   individual_id_column='host_subject_id',
                   state_column='gradient',
                   # min_feature_count=100,
                   )

And here's my error.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_1628521/3140904666.py in <module>
      1 ctf_res = \
----> 2     q2_gemelli.ctf(table=table_q2,
      3                    sample_metadata=meta_q2,
      4                    individual_id_column='host_subject_id',
      5                    state_column='tissue_num',

<decorator-gen-559> in ctf(table, sample_metadata, individual_id_column, state_column, n_components, min_sample_count, min_feature_count, max_iterations_als, max_iterations_rptm, n_initializations, feature_metadata)

~/.conda/envs/micc-2021.11/lib/python3.8/site-packages/qiime2/sdk/action.py in bound_callable(*args, **kwargs)
    243 
    244                 # Execute
--> 245                 outputs = self._callable_executor_(scope, callable_args,
    246                                                    output_types, provenance)
    247 

~/.conda/envs/micc-2021.11/lib/python3.8/site-packages/qiime2/sdk/action.py in _callable_executor_(self, scope, view_args, output_types, provenance)
    389 
    390     def _callable_executor_(self, scope, view_args, output_types, provenance):
--> 391         output_views = self._callable(**view_args)
    392         output_views = tuplize(output_views)
    393 

~/.conda/envs/micc-2021.11/lib/python3.8/site-packages/gemelli/ctf.py in ctf(table, sample_metadata, individual_id_column, state_column, n_components, min_sample_count, min_feature_count, max_iterations_als, max_iterations_rptm, n_initializations, feature_metadata)
     28                                                  DataFrame):
     29     # run CTF helper and parse output for QIIME
---> 30     state_ordn, ord_res, dists, straj, ftraj = ctf_helper(table,
     31                                                           sample_metadata,
     32                                                           individual_id_column,

~/.conda/envs/micc-2021.11/lib/python3.8/site-packages/gemelli/ctf.py in ctf_helper(table, sample_metadata, individual_id_column, state_columns, n_components, min_sample_count, min_feature_count, max_iterations_als, max_iterations_rptm, n_initializations, feature_metadata)
    123 
    124     # factorize
--> 125     TF = TensorFactorization(
    126         n_components=n_components,
    127         max_als_iterations=max_iterations_als,

~/.conda/envs/micc-2021.11/lib/python3.8/site-packages/gemelli/factorization.py in fit(self, tensor, y)
    169 
    170         self.tensor = tensor.copy()
--> 171         self._fit()
    172         return self
    173 

~/.conda/envs/micc-2021.11/lib/python3.8/site-packages/gemelli/factorization.py in _fit(self)
    204 
    205         # run the tensor factorization method [2]
--> 206         loads, s, dist = tenals(tensor,
    207                                 mask,
    208                                 n_components=self.n_components,

~/.conda/envs/micc-2021.11/lib/python3.8/site-packages/gemelli/factorization.py in tenals(tensor, mask, n_components, n_initializations, max_als_iterations, max_rtpm_iterations, tol_als, tol_rtpm, fillna)
    532             n_, m_ = obs_tmp.shape
    533             eps_tmp = total_nonzeros / np.sqrt(n_ * m_)
--> 534             if rank_estimate(obs_tmp, eps_tmp) >= (min(obs_tmp.shape) - 1):
    535                 warnings.warn('A component of your data may be high-rank.',
    536                               RuntimeWarning)

~/.conda/envs/micc-2021.11/lib/python3.8/site-packages/gemelli/optspace.py in rank_estimate(obs, eps, k, lam, min_rank, max_iter)
    460             cost.append(lam * max(s_one_[idx:]) + idx)
    461         # estimate the rank
--> 462         r_one = np.argmin(cost)
    463         lam += lam
    464         iter_ += 1

<__array_function__ internals> in argmin(*args, **kwargs)

~/.conda/envs/micc-2021.11/lib/python3.8/site-packages/numpy/core/fromnumeric.py in argmin(a, axis, out)
   1274 
   1275     """
-> 1276     return _wrapfunc(a, 'argmin', axis=axis, out=out)
   1277 
   1278 

~/.conda/envs/micc-2021.11/lib/python3.8/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     52     bound = getattr(obj, method, None)
     53     if bound is None:
---> 54         return _wrapit(obj, method, *args, **kwds)
     55 
     56     try:

~/.conda/envs/micc-2021.11/lib/python3.8/site-packages/numpy/core/fromnumeric.py in _wrapit(obj, method, *args, **kwds)
     41     except AttributeError:
     42         wrap = None
---> 43     result = getattr(asarray(obj), method)(*args, **kwds)
     44     if wrap:
     45         if not isinstance(result, mu.ndarray):

ValueError: attempt to get argmin of an empty sequence

I thought maybe that it was a version problem, so I made a new environment and upgraded to 2021.11. (I also noticed the paired PR got merged, hopefully this is true :crossed_fingers:). I installed gemilli, deicode, maybe empress, and then openpxyl because my metadata starts in an excel file, and jupyter lab because jupyter lab is self love. I think most were pip, but I don't fully remember.

Same error.

Then, I tried on the linux server at work, since it might have been a mac issue (IT file system "upgrade", Switch to Big Sur, etc).
New system, new environment, same error.

It's totally possible that I have done something stupid and obvious; but i'm really struggling to figure out what went wrong.

Thanks!
Justine

4 Likes

Hi @jwdebelius,

Thanks for reporting this bug! It is a known issue and should be addressed in the newest installation. See this issue from @Mehrbod_Estaki that was closed. That was merged in PR#40 and so a fresh pip install should solve the problem. Let me know if it does not. We are planning a new version release soon with several updates/fixes. :slight_smile:

Thanks,

Cameron

4 Likes

Thanks @cmartino!

I'll try pip instead of the conda build and report back.

Best,
Justine

1 Like

Just for future people, (hopefully only short term): you have to pip isntall the git repository, rather than a direct pip install.

I might have a new problem, @cmartino. The version 0.7.0 I've got in the final merged branch is giving me different results from the earlier git branch, even working off the same set of upstream tables. :scream_cat:

The tables are filtered slightly differently before they enter the pipeline, and then re-filtered. The final tables contain identical samples, feature, and counts, I checked. However, the order of features between the tables is not the same.

In the same environment, using two tables with features in a different order, I get a different ordination. This seems like a problem (oversight? Bug :bug:?). I'm really not sure what to do about it

2 Likes

Hi @jwdebelius,

Thank you for digging into this. I am certain this is due to a bug in the ordering of the features/samples in the structuring of the tensor before the rclr transformation (impacts CTF but not RPCA). A fix is coming to the master branch before the new version (v. 0.0.7) is being uploaded to PyPi. I had not seen or had reports of it changing results drastically but if you are seeing that, then I will push for this fix to get merged ASAP. It may also be the case this is more of a problem in the study design that would generate the error from the previous version (the error you reported originally). I will report back on this thread when that is done.

Thanks again!

2 Likes

Thanks @cmartino! I might make a point to specifically order the taxa for reproducibility in the current version (maybe filtering by taxonomic metadata :crossed_fingers:). In the two-sample data I'm working with, I calculated a set of cohen's ds between the two groups in my gradient. Sort 1 placed the primary emphasis on PC 2/PC 3; sort 2 moved it all to PC 1/PC 2. I don't really care which is "right", but I'd like to be consistent.

Thanks!

1 Like