Skip to content

Conversation

jorisvandenbossche
Copy link
Member

Resolving some xfails: getting back the same padding as we had before.

On current main with string dtype:

>>> pd.CategoricalIndex(["a", "bb", "ccc"] * 10)
CategoricalIndex([  'a',  'bb', 'ccc',   'a',  'bb', 'ccc',   'a',  'bb',
                  'ccc',   'a',  'bb', 'ccc',   'a',  'bb', 'ccc',   'a',
                   'bb', 'ccc',   'a',  'bb', 'ccc',   'a',  'bb', 'ccc',
                    'a',  'bb', 'ccc',   'a',  'bb', 'ccc'],
                 categories=['a', 'bb', 'ccc'], ordered=False, dtype='category')

With this PR and what it looks like with object dtype:

>>> pd.CategoricalIndex(["a", "bb", "ccc"] * 10)
CategoricalIndex(['a', 'bb', 'ccc', 'a', 'bb', 'ccc', 'a', 'bb', 'ccc', 'a',
                  'bb', 'ccc', 'a', 'bb', 'ccc', 'a', 'bb', 'ccc', 'a', 'bb',
                  'ccc', 'a', 'bb', 'ccc', 'a', 'bb', 'ccc', 'a', 'bb', 'ccc'],
                 categories=['a', 'bb', 'ccc'], ordered=False, dtype='category')

@jbrockmendel
Copy link
Member

On second look, I retract my claim that the old padding is nicer. No preference.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jul 18, 2025

I think the non-aligned version (so how it was before / is with object dtype) is better, especially for cases where your categories have different length. The example here only has 1 vs 3 characters, but for example:

# on main with str dtype / without this PR
>>> pd.CategoricalIndex(["low", "intermediate", "high", "low"] * 10)
CategoricalIndex([         'low', 'intermediate',         'high',
                           'low',          'low', 'intermediate',
                          'high',          'low',          'low',
                  'intermediate',         'high',          'low',
                           'low', 'intermediate',         'high',
                           'low',          'low', 'intermediate',
                          'high',          'low',          'low',
                  'intermediate',         'high',          'low',
                           'low', 'intermediate',         'high',
                           'low',          'low', 'intermediate',
                          'high',          'low',          'low',
                  'intermediate',         'high',          'low',
                           'low', 'intermediate',         'high',
                           'low'],
                 categories=[high, intermediate, low], ordered=False, dtype='category')

vs

# with object dtype / with str dtype with this PR
>>> pd.CategoricalIndex(["low", "intermediate", "high", "low"] * 10)
CategoricalIndex(['low', 'intermediate', 'high', 'low', 'low', 'intermediate',
                  'high', 'low', 'low', 'intermediate', 'high', 'low', 'low',
                  'intermediate', 'high', 'low', 'low', 'intermediate', 'high',
                  'low', 'low', 'intermediate', 'high', 'low', 'low',
                  'intermediate', 'high', 'low', 'low', 'intermediate', 'high',
                  'low', 'low', 'intermediate', 'high', 'low', 'low',
                  'intermediate', 'high', 'low'],
                 categories=['high', 'intermediate', 'low'], ordered=False, dtype='category')

Of course this can also happen with non-strings like integers, but I think it is a lot less common

@jorisvandenbossche jorisvandenbossche merged commit 8de38e8 into pandas-dev:main Jul 19, 2025
50 of 51 checks passed
@jorisvandenbossche jorisvandenbossche deleted the string-dtype-categorical-index-repr-justify branch July 19, 2025 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants