feat: Implement experimental DataCollector API #2013

rht · 2024-01-28T08:55:20Z

This is an attempt to implement the API as discussed in #1944. I figure it is easier to comment on a PR than on a linear GH thread.

I have constrained the implementation to be as simple as possible, as such, feature like retrieving multiple attributes

"pos":collect(model.agents, ["x", "y"],), # retrieve multiple attributes

is not implemented, because the API then unnecessarily gets bigger, needs more testing, and has bigger surface area for bugs and gotchas. Edit1: at least not until the initial small API has become well tested.

In this implementation, instead of {name1: collect(collection, func1), name2: collect(collection, func2), it is {collection: {name1: func1, name2: func2}}.

Note: has edit1.

rht · 2024-01-28T08:57:47Z

mesa/experimental/observer.py

+
+class DataCollector:
+    """
+    Example: a model consisting of a hybrid of Boltzmann wealth model and


aka what if during the "we are the 99%" protest, people are constantly gifting money randomly, to the point that there are emergent 1% within the protesters.

github-actions · 2024-01-28T08:59:39Z

Performance benchmarks:

Model	Size	Init time [95% CI]	Run time [95% CI]
Schelling	small	🔵 -0.1% [-0.4%, +0.2%]	🔵 +0.1% [-0.1%, +0.2%]
Schelling	large	🔵 -0.3% [-1.0%, +0.5%]	🔵 -0.8% [-1.7%, +0.1%]
WolfSheep	small	🔵 +0.4% [+0.0%, +0.8%]	🔵 +0.3% [+0.2%, +0.4%]
WolfSheep	large	🔵 +0.3% [-0.9%, +1.4%]	🔵 +0.6% [-0.3%, +1.4%]
BoidFlockers	small	🔵 +0.1% [-0.5%, +0.7%]	🔵 +1.0% [+0.4%, +1.5%]
BoidFlockers	large	🔵 -0.9% [-1.2%, -0.5%]	🔵 +0.5% [-0.0%, +1.0%]

rht · 2024-01-28T09:23:06Z

mesa/experimental/observer.py

+    Example: a model consisting of a hybrid of Boltzmann wealth model and
+    Epstein civil violence.
+    ```
+    def get_citizen():


There is a problem with this line: since this function is defined within a model's __init__, then there is no way to refer using the reference of this function later on, when doing further analysis. The only way that makes sense is to define a group dict
{"citizen": lambda: model.get_agents_of_type(Citizen)} that DataCollector.collect can use to resolve the named group.

rht · 2024-01-28T11:42:34Z

The implementation may not use the observer pattern, but at least it allows parallel evolution of the API design, so that we can merge this once there is a consensus, and implement #1145 on top of the new API.

quaquel · 2024-01-28T19:15:27Z

Thanks for picking this up. Leaving the API aside for now, I notice that in your code, you try to solve everything within a single datacollector class. This is different from my thinking. Let me try to articulate it here. Once I have some time, I'll also try to give a draft implementation.

I want a DataCollector class. This is effectively a container of Collectors and the primary point of interaction for the user.
I want a Collector class. This class would be responsible for gathering the data from a single object. This class would also be responsible for returning a data frame/series when requested.
It might be necessary to have multiple Collector classes because, for example, how you interact with the model object is different from how you interact with an AgentSet.
I was considering using a factory method, collect for constructing the appropriate Collector instance based on the provided arguments.
It might be possible to have collectors who operate on other collectors, but I am unsure about this.

I doubt solving the data collection problems within a single class is possible. It is bound to violate the single responsibility principle and produce code that is difficult to read.

rht · 2024-01-28T20:12:20Z

This PR is mainly discussing about the API. The backend implementation can be refactored later.

I was considering using a factory method, collect for constructing the appropriate Collector instance based on the provided arguments.

I have commented on this in the PR description:

In this implementation, instead of {name1: collect(collection, func1), name2: collect(collection, func2), it is {collection: {name1: func1, name2: func2}}.

Because the latter is more concise.

quaquel · 2024-01-28T20:30:03Z

I personally find nested dicts virtually unreadable.

In my original proposal, I saw each Collector as having a name. The user can retrieve each collector by this name from the CollectorContainer/DataCollector. Next, a Collector is nothing but the retrieval of one or more things from an object. This object can be the model, an agentset, or whatever. Hence, I wanted to keep the object and what is being collected together.

I agree that you can potentially end up in a situation where you want to collect multiple things from the same object. But that is relatively easy to handle within a Collector class. At least as long as you only want to retrieve attributes.

Worrying about conciseness is relevant, but not at the expense of clarity and at least some consideration of the underlying implementation.

rht · 2024-01-28T20:45:28Z

Worrying about conciseness is relevant, but not at the expense of clarity and at least some consideration of the underlying implementation.

For consideration about clarity, let's compare them side by side.
Without collect (the mental model is that you specify a collector by 2 keys: the collection/group, and the name)

collectors = {
    model: {
        "n_quiescent": lambda model: len(
            model.agents.select(
                agent_type=Citizen, filter_func=lambda a: a.condition == "Quiescent"
            )
        ),
        "gini": lambda model: calculate_gini(model.agents.get("wealth")),
    },
    get_citizen: {"condition": "condition"},
    "agents": {"wealth": "wealth"},
}

With collect

collectors = {
    "n_quiescent": collect(
        model,
        lambda model: len(
            model.agents.select(
                agent_type=Citizen, filter_func=lambda a: a.condition == "Quiescent"
            )
        ),
    ),
    "gini": collect(model, lambda model: calculate_gini(model.agents.get("wealth"))),
    "condition": collect(get_citizen(), "condition"),
     "wealth": collect(model.agents, "wealth"),
}

In what way is the latter clearer? The storage of both cases are still considered as a DF with 2 indexes: the group/collection and the name.

quaquel · 2024-01-28T20:49:30Z

For me, the second is more straightforward to read because it is flat.

rht · 2024-01-28T20:49:48Z

In case you are concerned about the flat/nested structure, how about

collectors = {
    ("n_quiescent", model): lambda model: len(
        model.agents.select(
            agent_type=Citizen, filter_func=lambda a: a.condition == "Quiescent"
        )
    ),
    ("gini", model): lambda model: calculate_gini(model.agents.get("wealth")),
    ("condition", get_citizen): "condition",
    ("wealth", lambda: model.agents): "wealth",
}

?

rht · 2024-01-28T21:04:12Z

Proposal 4: separating between group selectors and collectors

groups = {
    "quiescents": lambda: model.agents.select(
        agent_type=Citizen, filter_func=lambda a: a.condition == "Quiescent"
    ),
    "citizens": lambda: model.get_agents_of_type(Citizen),
}
collectors = {
    ("n_quiescent", "quiescents"): len,
    ("gini", model): lambda model: calculate_gini(model.agents.get("wealth")),
    # Edit: a better way to do the former:
    ("gini", "agents"): lambda agents: calculate_gini(agents.get("wealth")),
    ("condition", "citizens"): "condition",
    ("wealth", "agents"): "wealth",
}

quaquel · 2024-01-28T21:04:34Z

I have to think on that. For example, I am unsure how to read the last lambda statement.

Note, however, where we agree. Data collection involves

an object
something to collect from this object
and/or an operation to apply to the object / what was collected in 2
a name by which whatever is collected will be known.

So it seems we need at least a Collector class:

class Collector:
    def __init__(self, name : str, obj : Any, attrs: str | List[str], func: Callable = None ):
        ...

If we want brevity, we might make the name optional and default the name to the attribute name if name is not provided.

To be clear, this Collector class is a building block in the overall architecture. Not the actual API that the user would need to interact with.

added after seeing proposal 4:

Yes, I think you are on to something here. Basically, your groups are contextual objects (i.e., they change over time) that you want the data collector to operate on. So yes, we might need this.

rht · 2024-01-28T21:09:54Z

For example, I am unsure how to read the last lambda statement.

The reason why I added lambda instead of straight model.agents is because the latter returns an immutable AgentSet. model.agents might have more/fewer agents since the previous data collection.

So it seems we need at least a Collector class:

My proposal 4 splits further the Collector class into a GroupSelector and a collector function. The reason is that I want to reuse the GroupSelector in various collectors, without having to define a named function, because it'd be less concise.

rht · 2024-01-28T21:16:31Z

Meta: we are having 4 different terms now:

space: Cell & "Collection"
time: Agent & "Set"
observation: Collector & "Group"
parallel universe: Runner & "Batch" (or configuration?)

I was wondering where I should have called it collector and collection, instead of collector and group

quaquel · 2024-01-28T21:23:36Z

Fair enough, I guess group is indeed a set or a collection.

another question related to your remark

There is a problem with this line: since this function is defined within a model's __init__,

At the moment, data collection is defined within the model init. For me, this has always been a bit strange. A model runs. Data collection is conceptually external to this. This is just a weird idea for discussion's sake but what about

model = SomeModel()

# Setup data collection
data_collector = DataCollector("whatever the API will look like")

for _ in range(100):
    model.step()
    data_collector.collect()

At least in this way, you have a clean separation of concerns.

rht · 2024-01-28T21:40:34Z

@quaquel I tried to implement #2013 (comment) (i.e. separate data collector spec and object from model __init__) in the experimental Schelling example, but encountered a road block where the current JupyterViz accepts only the model class -- it assumes that the data collector spec is soldered into the model spec.
However, the data collector spec is rather small:

        self.datacollector = mesa.DataCollector(
            {"happy": "happy"},  # Model-level count of happy agents
        )

That said, I find it reasonable to separate the data collector spec for larger models.

Corvince · 2024-01-28T21:52:38Z

Fair enough, I guess group is indeed a set or a collection.

another question related to your remark

There is a problem with this line: since this function is defined within a model's __init__,

At the moment, data collection is defined within the model init. For me, this has always been a bit strange. A model runs. Data collection is conceptually external to this. This is just a weird idea for discussion's sake but what about
model = SomeModel()

# Setup data collection
data_collector = DataCollector("whatever the API will look like")

for _ in range(100):
    model.step()
    data_collector.collect()
At least in this way, you have a clean separation of concerns.

I also had this idea somewhere that similar to our batch_run function we add a model_run function, where you can attach a datacollector and/or stop condition to the model. I agree that data collector should be separate to the model definition. I disagree that our visualisation module should depend on a data collector, for me data collection runs and visualisation runs are conceptually different and usually don't depend on the same variables (e.g. mesa-interactive has no such dependency).

rht · 2024-01-28T22:56:47Z

I disagree that our visualisation module should depend on a data collector, for me data collection runs and visualisation runs are conceptually different and usually don't depend on the same variables (e.g. mesa-interactive has no such dependency).

In some situations, the viz module has to depend on the data collection output. The Schelling's happy agent count is taken from the data collector output, not the model's direct attribute.

This means that the structure for small model and large model has to diverge. For small model, it's OK to solder the data collection, and have JupyterViz detects if the model contains a datacollector attribute. Otherwise, it looks for the optional argument to fetch the DF from the data collector object(s).

rht · 2024-01-28T23:06:46Z

Any objections to proposal 4?

quaquel · 2024-01-29T07:02:03Z

This means that the structure for small model and large model has to diverge.

I disagree with this. In my view, we need to develop a design that scales from small to large models rather than stimulate using (dirty) shortcuts in small models. The ongoing discussion is really useful for this.

So, conceptually, data collection and visualization are separate. The current practice of hijacking the data collector thus needs to change. It also gives rise to a problem I recently ran into: I had a Jupyter visualization that ran slower and slower because the data frame being displayed in one of the graphs became bigger and bigger. Ideally, you only want to add new data to a visual element rather than replace historic data.

I, however, agree with @rht that sometimes there are model statistics that we both want to store for later analysis and display in a GUI. So, one possibility would be to have some kind of Statistic class. This class is only responsible for tracking some state variables within the model (and possibly doing operations on them). The datacollector mechanism could query these statistics objects on each collect call. That is, the persistent storage of statistics over time is the responsibility of the datacollector. A GUI, likewise, could query these statistics objects for display purposes. If a graph shows the dynamics over time, it is the responsibility of the GUI element to handle that. Not the responsibility of the Statistics object.

EwoutH · 2024-01-29T07:54:35Z

Definitely not my best work, but I want to throw this PR in, for inspiration:

Aggegrated agent metric in DataCollection, graph in ChartModule #1145

rht · 2024-01-29T12:23:10Z

I disagree with this. In my view, we need to develop a design that scales from small to large models rather than stimulate using (dirty) shortcuts in small models. The ongoing discussion is really useful for this.

This is akin to saying Python's print function shouldn't be part of builtins, that the user has to do a ceremony like in C/Java #include<stdio.h> because it is I/O, not programming logical building blocks. While in reality, Python does have sys.stdout.write in addition to print, signifying a beginner friendly interface by the latter.

But in the end, it just boils down to how the datacollector's collected data should be accessed, either from the model attribute or from an argument passed to JupyterViz. Is mainly a convention and doesn't affect the backend implementation that much. I can at least remove the hardcoding of model.datacollector.data_collection.to_df() later on.

It also gives rise to a problem I recently ran into: I had a Jupyter visualization that ran slower and slower because the data frame being displayed in one of the graphs became bigger and bigger. Ideally, you only want to add new data to a visual element rather than replace historic data.

The canned statistics functions in #1145 can help process the simulation state in a way that is single use only (or not stored unless by a state replayer).

quaquel · 2024-01-29T12:40:12Z

This is akin to saying Python's print function shouldn't be part of builtins,

No, Print is built on top of sys.stdout.write for convenience. I am advocating for doing exactly the same: building a proper structure first and then providing convenience functions that cover commonly encountered use cases that are built on top of the proper structure.

rht · 2024-01-29T18:55:24Z

No, Print is built on top of sys.stdout.write for convenience.

At least we are on the same page with the convenience of accessing a data collector via a model, for simple examples.

I am advocating for doing exactly the same: building a proper structure first and then providing convenience functions that cover commonly encountered use cases that are built on top of the proper structure.

This is not the full story. While I don't know how the concept of DataFrame came to be in R, at least I can see that the API of the DataFrame in pandas grew organically to suit usage needs, and that eventually faster backends (pyarrow instead of NumPy) were implemented. API and backend architecture may evolve semi-independently (as I said in #2013 (comment)). What matters is that the API should be designed in a way that doesn't restrict the backend possibilities.

If mistakes for the API are made, there is always Mesa 4.0 for further iterations (to begin with, the semver is devised to deal with API breaking changes, instead of architecture).

quaquel · 2024-01-29T19:53:43Z

At least we are on the same page with the convenience of accessing a data collector via a model, for simple examples.

I would not do that, even for simple models.

I want a convenient high-level API to specify data collection for simple cases (e.g., attributes from sets of agents and the model, end of life data from agents, simple callables operating on agents), which is built on a powerful set of classes and functions that users can use and extent for their own more complex models.

I also want to have a clean separation between data collection for later analysis and anything to do with user interfaces.

As described in projectmesa#2013 (comment)

rht · 2024-02-03T08:25:04Z

I implemented proposal 4, #2013 (comment), because it is strictly better than the original proposal in terms of deduplicating the way user specify the groups. The example in the docstring has been updated accordingly.

As described in projectmesa#2013 (comment)

quaquel · 2024-02-24T13:09:15Z

Since we have concentrated the discussion in #1944, should we close this PR?

quaquel · 2024-02-24T13:09:38Z

Since we have concentrated the discussion in #1944, should we close this PR?

rht commented Jan 28, 2024

View reviewed changes

rht force-pushed the exp_dc branch from fe89a10 to 94be87f Compare January 28, 2024 18:56

rht requested a review from EwoutH February 1, 2024 00:13

rht added 2 commits February 3, 2024 03:23

feat: Implement experimental DataCollector API

bc8b40e

Rename _collect_element to _collect_group

b4afc6e

rht added a commit to rht/mesa that referenced this pull request Feb 3, 2024

feat: Implement proposal 4

d72715f

As described in projectmesa#2013 (comment)

rht force-pushed the exp_dc branch from 94be87f to d72715f Compare February 3, 2024 08:23

feat: Implement proposal 4

fe7a53d

As described in projectmesa#2013 (comment)

rht force-pushed the exp_dc branch from d72715f to fe7a53d Compare February 3, 2024 08:30

rht mentioned this pull request Feb 4, 2024

feat: Implement experimental DataCollector API 2 #2024

Closed

rht closed this Feb 24, 2024

rht deleted the exp_dc branch February 24, 2024 17:24

feat: Implement experimental DataCollector API #2013

feat: Implement experimental DataCollector API #2013

Uh oh!

Conversation

rht commented Jan 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rht Jan 28, 2024

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 28, 2024

Uh oh!

rht Jan 28, 2024

Choose a reason for hiding this comment

Uh oh!

rht commented Jan 28, 2024

Uh oh!

quaquel commented Jan 28, 2024

Uh oh!

rht commented Jan 28, 2024

Uh oh!

quaquel commented Jan 28, 2024

Uh oh!

rht commented Jan 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quaquel commented Jan 28, 2024

Uh oh!

rht commented Jan 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rht commented Jan 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quaquel commented Jan 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rht commented Jan 28, 2024

Uh oh!

rht commented Jan 28, 2024

Uh oh!

quaquel commented Jan 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rht commented Jan 28, 2024

Uh oh!

Corvince commented Jan 28, 2024

Uh oh!

rht commented Jan 28, 2024

Uh oh!

rht commented Jan 28, 2024

Uh oh!

quaquel commented Jan 29, 2024

Uh oh!

EwoutH commented Jan 29, 2024

Uh oh!

rht commented Jan 29, 2024

Uh oh!

quaquel commented Jan 29, 2024

Uh oh!

rht commented Jan 29, 2024

Uh oh!

quaquel commented Jan 29, 2024

Uh oh!

rht commented Feb 3, 2024

Uh oh!

quaquel commented Feb 24, 2024

Uh oh!

quaquel commented Feb 24, 2024

Uh oh!

Uh oh!

rht commented Jan 28, 2024 •

edited

Loading

rht commented Jan 28, 2024 •

edited

Loading

rht commented Jan 28, 2024 •

edited

Loading

rht commented Jan 28, 2024 •

edited

Loading

quaquel commented Jan 28, 2024 •

edited

Loading

quaquel commented Jan 28, 2024 •

edited

Loading