Skip to content

Conversation

@nablabits
Copy link

This is an early attempt to fix #1329

I'm not very familiar with the code but with my knowledge and the help of some LLM I managed to put together this solution. I'm aware that this could be quite disruptive for users as it's changing the output of some well established function such as to_list (see the example in this test). People more knowledgeable of the use cases can judge this.

If we want to preserve the nested.level1.name structure, I feel that the options go through:

  1. how the signals are extracted here
  2. the to_pandas method and more precisely this get_headers_with_length folk (source) which is responsible of getting the columns that then become a multi-index.

Let me know what you think

(There are a couple of lint errors that we can address once a consensus on the approach is reached)

When using nested columns in `group_by` partition_by, the output now uses underscores to flatten column names (e.g., `nested__level1__name` instead of `nested.level1.name`) to avoid MultiIndex in pandas output.

datachain-ai#1329
Copy link
Contributor

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nablabits PTAL https://github.com/datachain-ai/datachain/pull/1496/files . Could try to to run the use case on top of that PR?

@nablabits
Copy link
Author

nablabits commented Dec 13, 2025

@shcheklein Yep, I have rebased the branch on top of #1496 and yes, it is working as expected, see: nablabits@1570d50

I will close this now, but feel free to reopen if something else comes up 🙂

Edit: wait, it's working because I didn't remove the flatten=True 🤦

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Group by nested column doesn't work

2 participants