-
Notifications
You must be signed in to change notification settings - Fork 1k
feat: Implement experimental DataCollector API #2013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
class DataCollector: | ||
""" | ||
Example: a model consisting of a hybrid of Boltzmann wealth model and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aka what if during the "we are the 99%" protest, people are constantly gifting money randomly, to the point that there are emergent 1% within the protesters.
Performance benchmarks:
|
mesa/experimental/observer.py
Outdated
Example: a model consisting of a hybrid of Boltzmann wealth model and | ||
Epstein civil violence. | ||
``` | ||
def get_citizen(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a problem with this line: since this function is defined within a model's __init__
, then there is no way to refer using the reference of this function later on, when doing further analysis. The only way that makes sense is to define a group dict
{"citizen": lambda: model.get_agents_of_type(Citizen)}
that DataCollector.collect
can use to resolve the named group.
The implementation may not use the observer pattern, but at least it allows parallel evolution of the API design, so that we can merge this once there is a consensus, and implement #1145 on top of the new API. |
Thanks for picking this up. Leaving the API aside for now, I notice that in your code, you try to solve everything within a single datacollector class. This is different from my thinking. Let me try to articulate it here. Once I have some time, I'll also try to give a draft implementation.
I doubt solving the data collection problems within a single class is possible. It is bound to violate the single responsibility principle and produce code that is difficult to read. |
This PR is mainly discussing about the API. The backend implementation can be refactored later.
I have commented on this in the PR description:
Because the latter is more concise. |
I personally find nested dicts virtually unreadable. In my original proposal, I saw each I agree that you can potentially end up in a situation where you want to collect multiple things from the same object. But that is relatively easy to handle within a Collector class. At least as long as you only want to retrieve attributes. Worrying about conciseness is relevant, but not at the expense of clarity and at least some consideration of the underlying implementation. |
For consideration about clarity, let's compare them side by side. collectors = {
model: {
"n_quiescent": lambda model: len(
model.agents.select(
agent_type=Citizen, filter_func=lambda a: a.condition == "Quiescent"
)
),
"gini": lambda model: calculate_gini(model.agents.get("wealth")),
},
get_citizen: {"condition": "condition"},
"agents": {"wealth": "wealth"},
} With collect collectors = {
"n_quiescent": collect(
model,
lambda model: len(
model.agents.select(
agent_type=Citizen, filter_func=lambda a: a.condition == "Quiescent"
)
),
),
"gini": collect(model, lambda model: calculate_gini(model.agents.get("wealth"))),
"condition": collect(get_citizen(), "condition"),
"wealth": collect(model.agents, "wealth"),
} In what way is the latter clearer? The storage of both cases are still considered as a DF with 2 indexes: the group/collection and the name. |
For me, the second is more straightforward to read because it is flat. |
In case you are concerned about the flat/nested structure, how about collectors = {
("n_quiescent", model): lambda model: len(
model.agents.select(
agent_type=Citizen, filter_func=lambda a: a.condition == "Quiescent"
)
),
("gini", model): lambda model: calculate_gini(model.agents.get("wealth")),
("condition", get_citizen): "condition",
("wealth", lambda: model.agents): "wealth",
} ? |
Proposal 4: separating between group selectors and collectors groups = {
"quiescents": lambda: model.agents.select(
agent_type=Citizen, filter_func=lambda a: a.condition == "Quiescent"
),
"citizens": lambda: model.get_agents_of_type(Citizen),
}
collectors = {
("n_quiescent", "quiescents"): len,
("gini", model): lambda model: calculate_gini(model.agents.get("wealth")),
# Edit: a better way to do the former:
("gini", "agents"): lambda agents: calculate_gini(agents.get("wealth")),
("condition", "citizens"): "condition",
("wealth", "agents"): "wealth",
} |
I have to think on that. For example, I am unsure how to read the last lambda statement. Note, however, where we agree. Data collection involves
So it seems we need at least a Collector class: class Collector:
def __init__(self, name : str, obj : Any, attrs: str | List[str], func: Callable = None ):
... If we want brevity, we might make the name optional and default the name to the attribute name if name is not provided. To be clear, this Collector class is a building block in the overall architecture. Not the actual API that the user would need to interact with. added after seeing proposal 4: Yes, I think you are on to something here. Basically, your groups are contextual objects (i.e., they change over time) that you want the data collector to operate on. So yes, we might need this. |
The reason why I added lambda instead of straight
My proposal 4 splits further the Collector class into a GroupSelector and a collector function. The reason is that I want to reuse the GroupSelector in various collectors, without having to define a named function, because it'd be less concise. |
Meta: we are having 4 different terms now:
I was wondering where I should have called it collector and collection, instead of collector and group |
Fair enough, I guess group is indeed a set or a collection. another question related to your remark
At the moment, data collection is defined within the model init. For me, this has always been a bit strange. A model runs. Data collection is conceptually external to this. This is just a weird idea for discussion's sake but what about model = SomeModel()
# Setup data collection
data_collector = DataCollector("whatever the API will look like")
for _ in range(100):
model.step()
data_collector.collect() At least in this way, you have a clean separation of concerns. |
@quaquel I tried to implement #2013 (comment) (i.e. separate data collector spec and object from model self.datacollector = mesa.DataCollector(
{"happy": "happy"}, # Model-level count of happy agents
) That said, I find it reasonable to separate the data collector spec for larger models. |
I also had this idea somewhere that similar to our |
In some situations, the viz module has to depend on the data collection output. The Schelling's happy agent count is taken from the data collector output, not the model's direct attribute. This means that the structure for small model and large model has to diverge. For small model, it's OK to solder the data collection, and have JupyterViz detects if the model contains a |
Any objections to proposal 4? |
I disagree with this. In my view, we need to develop a design that scales from small to large models rather than stimulate using (dirty) shortcuts in small models. The ongoing discussion is really useful for this. So, conceptually, data collection and visualization are separate. The current practice of hijacking the data collector thus needs to change. It also gives rise to a problem I recently ran into: I had a Jupyter visualization that ran slower and slower because the data frame being displayed in one of the graphs became bigger and bigger. Ideally, you only want to add new data to a visual element rather than replace historic data. I, however, agree with @rht that sometimes there are model statistics that we both want to store for later analysis and display in a GUI. So, one possibility would be to have some kind of Statistic class. This class is only responsible for tracking some state variables within the model (and possibly doing operations on them). The datacollector mechanism could query these statistics objects on each collect call. That is, the persistent storage of statistics over time is the responsibility of the datacollector. A GUI, likewise, could query these statistics objects for display purposes. If a graph shows the dynamics over time, it is the responsibility of the GUI element to handle that. Not the responsibility of the Statistics object. |
Definitely not my best work, but I want to throw this PR in, for inspiration: |
This is akin to saying Python's But in the end, it just boils down to how the datacollector's collected data should be accessed, either from the model attribute or from an argument passed to JupyterViz. Is mainly a convention and doesn't affect the backend implementation that much. I can at least remove the hardcoding of
The canned statistics functions in #1145 can help process the simulation state in a way that is single use only (or not stored unless by a state replayer). |
No, Print is built on top of |
At least we are on the same page with the convenience of accessing a data collector via a model, for simple examples.
This is not the full story. While I don't know how the concept of DataFrame came to be in R, at least I can see that the API of the DataFrame in pandas grew organically to suit usage needs, and that eventually faster backends (pyarrow instead of NumPy) were implemented. API and backend architecture may evolve semi-independently (as I said in #2013 (comment)). What matters is that the API should be designed in a way that doesn't restrict the backend possibilities. If mistakes for the API are made, there is always Mesa 4.0 for further iterations (to begin with, the semver is devised to deal with API breaking changes, instead of architecture). |
I would not do that, even for simple models. I want a convenient high-level API to specify data collection for simple cases (e.g., attributes from sets of agents and the model, end of life data from agents, simple callables operating on agents), which is built on a powerful set of classes and functions that users can use and extent for their own more complex models. I also want to have a clean separation between data collection for later analysis and anything to do with user interfaces. |
As described in projectmesa#2013 (comment)
I implemented proposal 4, #2013 (comment), because it is strictly better than the original proposal in terms of deduplicating the way user specify the groups. The example in the docstring has been updated accordingly. |
As described in projectmesa#2013 (comment)
Since we have concentrated the discussion in #1944, should we close this PR? |
Since we have concentrated the discussion in #1944, should we close this PR? |
This is an attempt to implement the API as discussed in #1944. I figure it is easier to comment on a PR than on a linear GH thread.
I have constrained the implementation to be as simple as possible, as such, feature like retrieving multiple attributes
is not implemented, because the API then unnecessarily gets bigger, needs more testing, and has bigger surface area for bugs and gotchas. Edit1: at least not until the initial small API has become well tested.
In this implementation, instead of
{name1: collect(collection, func1), name2: collect(collection, func2)
, it is{collection: {name1: func1, name2: func2}}
.Note: has edit1.