Skip to content

Conversation

@davidjcastrejon
Copy link

@davidjcastrejon davidjcastrejon commented Nov 3, 2025

MultiIndex.factorize() was silently converting extension dtypes (Int64, boolean, string) to base dtypes, causing data corruption. This fix preserves extension dtypes by restoring them level-by-level after factorization.

Before:

import pandas as pd

mi = pd.MultiIndex.from_arrays([pd.array([1, 2, 3], dtype="Int64")])
codes, uniques = mi.factorize()
print(uniques.dtypes.iloc[0])  # int64 ← Lost extension dtype

x = pd.Series([1, None], dtype='Int32').to_frame(name='col')

# This is 'Int32Dtype()' as expected
print(pd.MultiIndex.from_frame(x).to_frame()['col'].dtype)

# This is float64
print(pd.MultiIndex.from_frame(x).factorize()[1].to_frame().iloc[:, 0].dtype)

After:

import pandas as pd

mi = pd.MultiIndex.from_arrays([pd.array([1, 2, 3], dtype="Int64")])
codes, uniques = mi.factorize()
print(uniques.dtypes.iloc[0])  # Int64 ← Extension dtype preserved

x = pd.Series([1, None], dtype='Int32').to_frame(name='col')

# This is 'Int32Dtype()' as expected
print(pd.MultiIndex.from_frame(x).to_frame()['col'].dtype)

# This is Int32Dtype()
print(pd.MultiIndex.from_frame(x).factorize()[1].to_frame().iloc[:, 0].dtype)

Performance Increase:
Some MultiIndex operations ~10% faster due to better type consistency.

Benchmarks:

asv continuous -f 1.1 upstream/main HEAD -b ^multiindex_object

@davidjcastrejon davidjcastrejon changed the title BUG: Fix multiindex factorize extension dtypes (#62337) BUG: Fix multiindex factorize extension dtypes Nov 3, 2025
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR but I think this is working around the core issue where algorithms.factorize is being called on self._values which is just a numpy array for a MultiIndex.

I think MultiIndex would need to override factorize and use a custom implementation if any level has an ExtentionDtype.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: factorize does not preserve extension dtypes

2 participants