Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Recommended workflow for storing results returned as numpy.ndarray? #544

Open
joschaschmiedt opened this issue Feb 27, 2023 · 4 comments
Labels
enhancement Editing an existing module, improving something new functionality New modules, functions

Comments

@joschaschmiedt
Copy link

joschaschmiedt commented Feb 27, 2023

Several functions that operate on AnalogSignal data return simple numpy.ndarrays, e.g.

Most users will probably want to store the results in some way. For the AnalogSignal outputs, Neo provides easy-to-use saving to disk. For the spectral and correlation measures however, Neo does not offer this (yet).

Are there any best practices what to do with these "pure" results, or plans towards implementing a Spectrum data model including I/O?

@mdenker
Copy link
Member

mdenker commented Mar 2, 2023

Hi Joscha,
thanks for your input. Indeed the returns of analysis functions is a hot topic on our agenda. For some analysis functions, we used Neo objects as return types for precisely the reasons you mentioned. However, this situation quickly comes to its limits. For example, a time histogram could be interpreted as an analog signal, but then again, in a way its more than just that -- it has a concept of bin width, for example.

Therefore, we are in the planning to move to an alternative representation building something like Neo, not for input data but for analysis results. The idea here is that a minimal number of objects are able to represent the analysis results, certain key metadata and additional info like Neo annotations, and of course a serialization to disk (maybe even the option to temporarily dump objects to disk, similar to Neo's lazy loading, to deal with large analysis results). These objects would not become part of Neo since structurally this would not fit, however, it is possible to draw links between the tools nevertheless. An early prototype of how this could look for a TimeHistogram object you can find here:
https://github.com/INM-6/elephant/blob/feature/basic_provenance/elephant/buffalo/objects/histogram.py

Implementing such objects would further simplify the interoperability with a companion project, alpaca (first release pending within the next weeks, https://alpaca-prov.readthedocs.io/en/latest/), to capture the provenance of an analysis workflow. This work on provenance we had prioritized over the data objects, however, we are confident that the data objects will be on the agenda this year (together with a new object to represent an experimental trial, which is what we are currently working on.)

I hope this goes in the direction of what you are thinking. Of course, we are very open for any ideas, suggestions and contributions to this topic!

@mdenker mdenker added enhancement Editing an existing module, improving something new functionality New modules, functions labels Mar 2, 2023
@joschaschmiedt
Copy link
Author

Hi @mdenker, great to hear that there are some plans for this. I agree that this is probably out-of-scope for Neo.

In general I like the direction that the AnalysisObject is taking, and alpaca looks very interesting!

As data analysis in electrophysiology is often exploratory and constantly changing, it may already be great to offer something a little more rigid than saving a complete workspace in MATLAB, but not too much. Often an analysis result is not much more than a couple of numpy arrays plus meta-data, which could be stored in simple, future-proof formats such as JSON and (flat) HDF or NPY. If I understood it correctly, alpaca is basically already almost doing that, except serializing the arrays. Correct?

From an architectural point of view, I'm not sure if each analysis method needs to implement its own result class inheriting from the AnalysisObject. This may be useful to achieve forward-compatibility of the stored analysis results, but I'm not sure sure that's achievable or necessary. I think, AnalysisObject could be treated as a flexible container that stores as much metadata as possible (auto-magically) together with the arrays, and serializes results using simple, flat data formats.

@joschaschmiedt
Copy link
Author

joschaschmiedt commented Mar 3, 2023

Thinking about it, maybe a dataclass, which tells the user and developer what attributes should be there, in combination with a metadata-enhanced serializer is robust enough. The serializer could iterate over the dataclass and store all numpy.ndarrays in binary form (H5/NPY) and everything else as ASCII (JSON/...).

Edit: I stumbled upon https://github.com/lidatong/dataclasses-json, which may be useful in this context.

@mdenker
Copy link
Member

mdenker commented Mar 6, 2023

Hi, and thanks for all your great comments ideas and suggestions. I agree that your idea of having a generic AnalysisObject type container that "always works" is very interesting concept that could already help a lot. At the same time it may still be beneficial to have more specialized (e.g., subclassed) objects to describe certain recurring types of analysis results that define the structure of the analysis at greater depth and help -- in the long run -- with interoperability and clarity of code. I think both concepts could work well together.

(Regarding alpaca, it's aimed at merely tracking provenance and data flow of inputs and outputs during a script execution, but does not get involved with the structure of serialization of data as such. However, such approaches could be seen synergistic in this discussion.)

@Moritz-Alexander-Kern Moritz-Alexander-Kern added this to the v0.14.0 milestone Jul 24, 2023
@Moritz-Alexander-Kern Moritz-Alexander-Kern removed this from the v0.14.0 milestone Sep 18, 2023
@Moritz-Alexander-Kern Moritz-Alexander-Kern changed the title Recommended workflow for storing results returned as numpy.ndarray? [Feature] Recommended workflow for storing results returned as numpy.ndarray? Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Editing an existing module, improving something new functionality New modules, functions
Projects
None yet
Development

No branches or pull requests

3 participants