Skip to content

*: shrink types.Datum from 72 to 56 bytes#67977

Open
mjonss wants to merge 5 commits intopingcap:masterfrom
mjonss:Datum-improvement
Open

*: shrink types.Datum from 72 to 56 bytes#67977
mjonss wants to merge 5 commits intopingcap:masterfrom
mjonss:Datum-improvement

Conversation

@mjonss
Copy link
Copy Markdown
Contributor

@mjonss mjonss commented Apr 22, 2026

What problem does this PR solve?

Issue Number: close #67976

Problem Summary:

types.Datum is a 72-byte value type that appears on every row- and value-level
path in TiDB: row decoding, expression evaluation, planner constant folding,
index lookups, join keys, and stats sampling. A significant share of those bytes
is waste:

  • collation string (16 B) stores a collation name per Datum, but the set of
    collations is a small fixed set of MySQL ids.
  • decimal uint16 (2 B) carries frac, whose maximum value is 30.
  • The first struct word had a 1-byte alignment hole.

Shrinking the struct reduces the per-row working set (especially noticeable for
[]Datum of N columns, where savings scale linearly) and also turns the
collation comparison in Datum.Equals into an integer compare.

What changed and how does it work?

1. Benchmark baseline (first commit)

  • Enabled b.ReportAllocs() on existing Datum-adjacent benches in pkg/types,
    pkg/tablecodec, pkg/util/codec so benchstat can diff B/op and
    allocs/op.
  • Added three targeted benches in pkg/types:
    • BenchmarkDatumCopy — tight Copy loop over a heterogeneous []Datum;
      most direct signal on struct size.
    • BenchmarkDatumSetMysqlTime — locks in the current 1 alloc / 8 B per call
      caused by boxing Time through d.x any.
    • BenchmarkCompareDatumCollation — exercises the named-collator compare
      path that BenchmarkCompareDatum bypasses via the binary collator.

2. Shrink types.Datum from 72 to 56 bytes (second commit)

  • Replaced the 16-byte collation string field with a 2-byte
    collationID uint16 carrying the MySQL collation id (0 = unset, matching the
    legacy empty-string sentinel).
  • Shrunk decimal from uint16 to uint8.
  • The freed padding absorbs the new collationID so the first 8-byte word is
    fully packed with k + decimal + collationID + length.

Layout transition:

before (72 B):  k(1) pad(1) decimal(2) length(4) i(8) collation(16) b(24) x(16)
after  (56 B):  k(1) decimal(1) collationID(2) length(4) i(8) b(24) x(16)

API compatibility:

  • Collation() string and SetCollation(string) are preserved as facades that
    resolve via the charset package. External callers keep working unchanged.
  • New CollationID() uint16 / SetCollationID(uint16) for hot paths.
  • MarshalJSON / UnmarshalJSON keep the on-wire "collation" field as a name
    string, so older JSON blobs round-trip.
  • collate.GetCollatorByID(0) now falls through silently to match
    GetCollator("").

Hot-path callers were migrated from collate.GetCollator(d.Collation()) to
collate.GetCollatorByID(int(d.CollationID())) so they do a single map lookup
instead of a name lookup followed by a collator lookup:
util/codec/codec.go, util/ranger/detacher.go,
planner/cardinality/row_count_column.go, table/tables/mutation_checker.go
(2 sites), executor/foreign_key.go, plus all internal Datum compare/key paths
in pkg/types/datum.go.

Datum.MemUsage drops the len(collation) term; the collation is now inline
and fully accounted for in EmptyDatumSize. TestRangeMemUsage and
GetDatumMemSize in aggfuncs/func_group_concat follow suit.

Measurements

benchstat, linux/amd64, Intel i7-13700, -count=5:

Benchmark Before After Delta
unsafe.Sizeof(types.Datum) 72 B 56 B −22%
BenchmarkDecodeWithSize (codec) B/op 8192 6144 −25%
BenchmarkDecodeWithSize (codec) ns/op 2435 1862 −24%
BenchmarkDecodeWithOutSize (codec) B/op 20.88 KiB 15.83 KiB −24%
BenchmarkCompareDatum (types) ns/op 34.06 32.12 −5.7%
BenchmarkEncodeValue (tablecodec) ns/op 135 122 −10%

No allocation regressions observed across the full bench set.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Reduce the in-memory size of types.Datum from 72 to 56 bytes by replacing the per-Datum collation name string with a uint16 collation id, shrinking the frac field, and repacking the struct.

Summary by CodeRabbit

  • Performance & Optimization

    • Reduced per-value memory accounting and tightened collation storage, lowering memory estimates and improving efficiency for string comparisons and range construction.
    • Standardized collation resolution across validation, planning, and encoding, improving consistency of comparisons and encoding behavior.
  • Tests

    • Enabled allocation reporting in multiple benchmarks and added new collation/Datum benchmarks.
    • Updated integration test expectations to match revised memory/warning thresholds.

mjonss and others added 2 commits April 22, 2026 18:08
Turn on b.ReportAllocs() across Datum-related benchmarks so benchstat can
diff B/op and allocs/op against upcoming struct-size changes. Add three
new targeted benches in pkg/types:

- BenchmarkDatumCopy: tightest signal on Datum struct size (heterogeneous
  slice copied per iteration).
- BenchmarkDatumSetMysqlTime: locks in the current 1 alloc / 8 B per call
  caused by boxing Time through d.x; will drop to 0 allocs once Time packs
  into d.i.
- BenchmarkCompareDatumCollation: exercises the named-collator compare
  path, which the existing BenchmarkCompareDatum skips by using the
  binary collator.

Behavior-only change in benchmarks; no production code touched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the 16-byte `collation string` field on types.Datum with a 2-byte
`collationID uint16` carrying the MySQL collation id, and shrink
`decimal uint16` (frac; MySQL DECIMAL frac <= 30, fsp <= 6) to `uint8`.
The freed padding absorbs the new collationID so the first struct word
stays 8 bytes. Net: Datum is now 56 bytes instead of 72 (-22%), every
`[]Datum` allocation is correspondingly smaller, and Datum.Equals now
compares collations with a uint16 compare instead of a string compare.

API compat:
- Collation() string and SetCollation(string) are preserved as
  facades that resolve via the charset package. External callers keep
  working unchanged.
- New CollationID() uint16 / SetCollationID(uint16) for hot paths.
- MarshalJSON/UnmarshalJSON keep the on-wire "collation" field as a
  name string, so older JSON blobs round-trip.
- collate.GetCollatorByID(0) now falls through silently (matching
  GetCollator("")), since id 0 is the uint16 counterpart of the
  legacy empty-string "no collation" sentinel.

Hot-path callers migrated from GetCollator(d.Collation()) to
GetCollatorByID(int(d.CollationID())) in:
codec.encodeString, ranger/detacher, planner/cardinality,
table/tables/mutation_checker, executor/foreign_key,
plus the internal Datum compare/key paths.

Datum.MemUsage drops the len(collation) term — the collation is now
inline (counted by EmptyDatumSize); TestRangeMemUsage is updated to
match. GetDatumMemSize in aggfuncs/func_group_concat follows suit.

Measured on linux/amd64, Intel i7-13700:
- unsafe.Sizeof(types.Datum) = 56 B (was 72 B)
- BenchmarkDecodeWithSize (codec):  8.0 KiB -> 6.0 KiB B/op (-25%),
  2435 -> 1862 ns/op (-24%)
- BenchmarkDecodeWithOutSize (codec): 20.88 KiB -> 15.83 KiB (-24%)
- BenchmarkCompareDatum (types): 34.06 -> 32.12 ns/op (-5.7%)
- BenchmarkEncodeValue (tablecodec): 135 -> 122 ns/op (-10%)
- No allocation regressions across the bench set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 22, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/needs-tests-checked labels Apr 22, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 22, 2026

📝 Walkthrough

Walkthrough

Datum in-memory layout was compacted (collation name → collationID; decimal narrowed), callers switched to collator-by-ID, memory accounting and benchmarks updated, and numerous tests adjusted to the new Datum representation and mem-usage expectations.

Changes

Cohort / File(s) Summary
Core Datum & tests
pkg/types/datum.go, pkg/types/datum_test.go
Replaced collation stringcollationID uint16, decimal uint16uint8; added CollationID()/SetCollationID(); Collation() resolves ID→name; updated MemUsage, JSON marshal/unmarshal, equality semantics, and added/updated benchmarks/tests.
Collator lookup callsites
pkg/executor/foreign_key.go, pkg/planner/cardinality/row_count_column.go, pkg/table/tables/mutation_checker.go, pkg/util/codec/codec.go, pkg/util/ranger/detacher.go
Migrated collator lookups to collate.GetCollatorByID(int(d.CollationID())) (replacing name-based GetCollator(...)) for comparisons and encoding paths.
Memory accounting & aggfuncs
pkg/executor/aggfuncs/func_group_concat.go, pkg/util/ranger/types_test.go
Removed per-datum variable collation string length from memory-size calculations; tests updated accordingly.
Collation utilities
pkg/util/collate/collate.go
GetCollatorByID treats id == 0 as “no collation set” and suppresses noisy warnings for unknown id 0; fallback remains binary collator.
Benchmark instrumentation
pkg/tablecodec/bench_test.go, pkg/tablecodec/tablecodec_test.go, pkg/util/codec/bench_test.go
Added b.ReportAllocs() to multiple benchmarks to include allocation metrics; no functional benchmark logic changed.
Planner / integration tests
tests/integrationtest/.../index/index.test, tests/integrationtest/.../index/index.result, pkg/util/ranger/types_test.go
Adjusted integration test thresholds and range memory-usage expectations to match reduced memory estimates from inline collation ID.
Minor accessor fixes
pkg/types/field_type.go
InferParamTypeFromDatum now uses d.Collation() accessor rather than direct field access.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

size/XXL, ok-to-test

Suggested reviewers

  • qw4990
  • yudongusa
  • 0xPoe

Poem

🐰 I swapped my strings for tiny IDs today,
Datums slim and nimbly hop away,
Cache lines cheer, benchmarks dance bright,
A rabbit’s tweak made memory light! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: shrinking types.Datum from 72 to 56 bytes by repacking the struct.
Description check ✅ Passed The PR description comprehensively covers all required sections: problem statement with issue link, detailed changes, measurements, test coverage checklist, and release note.
Linked Issues check ✅ Passed All coding objectives from issue #67976 are met: Datum struct reduced from 72 to 56 bytes via collationID uint16, decimal uint8, API facades preserved, hot paths migrated, MemUsage updated, and benchmarks added with measured gains.
Out of Scope Changes check ✅ Passed All changes align with the PR objectives: struct repacking, collation field migration, hot-path caller updates, benchmark instrumentation, and test additions are all directly in scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

Command failed


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added sig/planner SIG: Planner size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 22, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 22, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign cfzjywxk, elsa0520, xuhuaiyu for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tiprow
Copy link
Copy Markdown

tiprow Bot commented Apr 22, 2026

Hi @mjonss. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mjonss mjonss changed the title types: Datum storage improvement *: shrink types.Datum from 72 to 56 bytes Apr 22, 2026
@ti-chi-bot ti-chi-bot Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Apr 22, 2026
@mjonss mjonss marked this pull request as ready for review April 22, 2026 18:32
@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 22, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/types/datum_test.go`:
- Around line 619-634: The benchmark comment is inaccurate:
BenchmarkCompareDatumCollation currently passes a collator into Datum.Compare
(the local variable coll), so Compare → compareString uses the provided comparer
and never calls d1.Collation() or reads d1.collationID; update the benchmark or
its comment. Either reword the docstring on BenchmarkCompareDatumCollation to
say it measures the cost of using an external collator via the comparer argument
(coll) or change the benchmark to exercise the Datum.Collation() path (e.g.,
call Compare without supplying the coll argument so compareString obtains the
collator from d1.Collation()/d1.collationID). Make the change around the
BenchmarkCompareDatumCollation function and mention Datum.Compare,
compareString, Datum.Collation()/collationID, and the local coll variable in
your fix.

In `@pkg/types/datum.go`:
- Around line 167-170: The SetFrac method silently narrows int to uint8; add a
guard using intest.Assert to ensure the provided frac is within the allowable
range before casting and assigning in Datum.SetFrac. Specifically, in
Datum.SetFrac(frac int) add an assertion such as intest.Assert(frac >= 0 && frac
<= 255) (or the tighter documented invariant if you prefer, e.g., frac <= 30 for
DECIMAL) and then perform d.decimal = uint8(frac); this will catch
UnspecifiedLength (-1) or out-of-range values passed from callers like
convertToMysqlDecimal/target.GetDecimal() or ProduceDecWithSpecifiedTp instead
of silently truncating.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 4671e5c7-975b-4ed2-826b-d8ddf3a2ce50

📥 Commits

Reviewing files that changed from the base of the PR and between 5581189 and 2a3c81e.

📒 Files selected for processing (14)
  • pkg/executor/aggfuncs/func_group_concat.go
  • pkg/executor/foreign_key.go
  • pkg/planner/cardinality/row_count_column.go
  • pkg/table/tables/mutation_checker.go
  • pkg/tablecodec/bench_test.go
  • pkg/tablecodec/tablecodec_test.go
  • pkg/types/datum.go
  • pkg/types/datum_test.go
  • pkg/types/field_type.go
  • pkg/util/codec/bench_test.go
  • pkg/util/codec/codec.go
  • pkg/util/collate/collate.go
  • pkg/util/ranger/detacher.go
  • pkg/util/ranger/types_test.go

Comment thread pkg/types/datum_test.go
Comment on lines +619 to +634
// BenchmarkCompareDatumCollation compares collated string datums. Unlike
// BenchmarkCompareDatum, the datums carry a non-empty collation and the
// compare path dispatches through the named collator instead of the binary
// shortcut, so it exercises Datum.Collation() on the hot path.
func BenchmarkCompareDatumCollation(b *testing.B) {
d1 := NewCollationStringDatum("abcdefghij", charset.CollationUTF8MB4)
d2 := NewCollationStringDatum("abcdefghik", charset.CollationUTF8MB4)
coll := collate.GetCollator(charset.CollationUTF8MB4)
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
if _, err := d1.Compare(DefaultStmtNoWarningContext, &d2, coll); err != nil {
b.Fatal(err)
}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Benchmark comment is misleading — the compare path doesn't dispatch through Datum.Collation().

Datum.Compare → compareString uses the comparer argument you pass in (coll here); it never reads d1.Collation() / d1.collationID. So the bench measures collated comparator cost, not any Collation() accessor on the hot path. Consider rewording the doc, or change what the bench is intended to measure.

📝 Suggested comment tweak
-// BenchmarkCompareDatumCollation compares collated string datums. Unlike
-// BenchmarkCompareDatum, the datums carry a non-empty collation and the
-// compare path dispatches through the named collator instead of the binary
-// shortcut, so it exercises Datum.Collation() on the hot path.
+// BenchmarkCompareDatumCollation compares collated string datums through
+// a named (non-binary) collator. Unlike BenchmarkCompareDatum, this avoids
+// the binary-collator shortcut so the cost of the collated comparison
+// dominates the measurement.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// BenchmarkCompareDatumCollation compares collated string datums. Unlike
// BenchmarkCompareDatum, the datums carry a non-empty collation and the
// compare path dispatches through the named collator instead of the binary
// shortcut, so it exercises Datum.Collation() on the hot path.
func BenchmarkCompareDatumCollation(b *testing.B) {
d1 := NewCollationStringDatum("abcdefghij", charset.CollationUTF8MB4)
d2 := NewCollationStringDatum("abcdefghik", charset.CollationUTF8MB4)
coll := collate.GetCollator(charset.CollationUTF8MB4)
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
if _, err := d1.Compare(DefaultStmtNoWarningContext, &d2, coll); err != nil {
b.Fatal(err)
}
}
}
// BenchmarkCompareDatumCollation compares collated string datums through
// a named (non-binary) collator. Unlike BenchmarkCompareDatum, this avoids
// the binary-collator shortcut so the cost of the collated comparison
// dominates the measurement.
func BenchmarkCompareDatumCollation(b *testing.B) {
d1 := NewCollationStringDatum("abcdefghij", charset.CollationUTF8MB4)
d2 := NewCollationStringDatum("abcdefghik", charset.CollationUTF8MB4)
coll := collate.GetCollator(charset.CollationUTF8MB4)
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
if _, err := d1.Compare(DefaultStmtNoWarningContext, &d2, coll); err != nil {
b.Fatal(err)
}
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/types/datum_test.go` around lines 619 - 634, The benchmark comment is
inaccurate: BenchmarkCompareDatumCollation currently passes a collator into
Datum.Compare (the local variable coll), so Compare → compareString uses the
provided comparer and never calls d1.Collation() or reads d1.collationID; update
the benchmark or its comment. Either reword the docstring on
BenchmarkCompareDatumCollation to say it measures the cost of using an external
collator via the comparer argument (coll) or change the benchmark to exercise
the Datum.Collation() path (e.g., call Compare without supplying the coll
argument so compareString obtains the collator from
d1.Collation()/d1.collationID). Make the change around the
BenchmarkCompareDatumCollation function and mention Datum.Compare,
compareString, Datum.Collation()/collationID, and the local coll variable in
your fix.

Comment thread pkg/types/datum.go
Comment on lines 167 to 170
// SetFrac sets the frac of the datum.
func (d *Datum) SetFrac(frac int) {
d.decimal = uint16(frac)
d.decimal = uint8(frac)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find all SetFrac call sites and show surrounding context to audit negative/UnspecifiedLength inputs.
rg -nP --type=go -C2 '\.SetFrac\s*\('

Repository: pingcap/tidb

Length of output: 3893


🏁 Script executed:

#!/bin/bash
# 1. Show the context around line 1615 in convertToMysqlDecimal
echo "=== convertToMysqlDecimal context (lines 1600–1630) ==="
sed -n '1600,1630p' pkg/types/datum.go

# 2. Check what GetDecimal() returns and UnspecifiedLength value
echo -e "\n=== UnspecifiedLength constant definition ==="
rg -n 'const\s+UnspecifiedLength' pkg/types/

# 3. Check GetDecimal() method signature
echo -e "\n=== GetDecimal() method definition ==="
rg -n 'func.*GetDecimal\(' pkg/types/field_type.go | head -5

# 4. Check actual values returned by GetDecimal in realistic scenarios
echo -e "\n=== Sample GetDecimal() usages to understand typical values ==="
rg -B2 -A1 '\.GetDecimal\(' pkg/types/datum.go | head -30

Repository: pingcap/tidb

Length of output: 1890


🏁 Script executed:

#!/bin/bash
# 5. Check if Frac() is actually used after SetFrac in line 1615 caller chain
echo "=== Who calls convertToMysqlDecimal? ==="
rg -n 'convertToMysqlDecimal' pkg/types/ | head -10

# 6. Show function signature of convertToMysqlDecimal to understand target parameter
echo -e "\n=== convertToMysqlDecimal function signature ==="
sed -n '1600,1620p' pkg/types/datum.go | grep -A5 'func.*convertToMysqlDecimal'

Repository: pingcap/tidb

Length of output: 535


🏁 Script executed:

#!/bin/bash
# 7. Verify: does row.go line 184 actually pass UnspecifiedLength?
echo "=== chunk/row.go lines 175–195 (context around SetFrac calls) ==="
sed -n '175,195p' pkg/util/chunk/row.go

Repository: pingcap/tidb

Length of output: 879


🏁 Script executed:

#!/bin/bash
# 8. Check if intest.Assert is already used elsewhere in datum.go for type contracts
echo "=== intest.Assert patterns in datum.go ==="
rg -n 'intest.Assert' pkg/types/datum.go

Repository: pingcap/tidb

Length of output: 131


🏁 Script executed:

#!/bin/bash
# Check who calls ConvertDatum (line 1091 references convertToMysqlDecimal)
echo "=== ConvertDatum function and its callers ==="
sed -n '1050,1100p' pkg/types/datum.go | head -50

# Find all callers of ConvertDatum
echo -e "\n=== ConvertDatum callers ==="
rg -n 'ConvertDatum\(' pkg/ --type=go | head -15

Repository: pingcap/tidb

Length of output: 2021


🏁 Script executed:

#!/bin/bash
# Check if there are any tests that verify SetFrac with UnspecifiedLength or -1
echo "=== Tests calling SetFrac or testing frac narrowing ==="
rg -n 'SetFrac.*-1|UnspecifiedLength.*SetFrac' pkg/types/ --type=go -B2 -A2

# Also check: are there any test cases for convertToMysqlDecimal with UnspecifiedLength?
echo -e "\n=== convertToMysqlDecimal tests ==="
rg -n 'convertToMysqlDecimal' pkg/types/ --type=go -B1 -A3

Repository: pingcap/tidb

Length of output: 722


🏁 Script executed:

#!/bin/bash
# Check the actual GetDecimal() implementation to confirm it can return -1
echo "=== FieldType.GetDecimal() implementation ==="
rg -A5 'func.*\*FieldType\).*GetDecimal' pkg/types/field_type.go

Repository: pingcap/tidb

Length of output: 101


🏁 Script executed:

#!/bin/bash
# Search for GetDecimal method more broadly
echo "=== Searching for GetDecimal method ==="
rg -n 'func.*GetDecimal' pkg/types/ --type=go

# Also check the actual field definition
echo -e "\n=== FieldType struct definition ==="
sed -n '1,100p' pkg/types/field_type.go | grep -A20 'type FieldType struct'

Repository: pingcap/tidb

Length of output: 213


🏁 Script executed:

#!/bin/bash
# Now check: can convertToMysqlDecimal receive a FieldType with Decimal == -1?
# Look at callers of ConvertTo to understand what FieldTypes are passed
echo "=== ConvertTo callers in tests/real code ==="
rg -n '\.ConvertTo\(' pkg/ --type=go -B1 -A1 | head -30

Repository: pingcap/tidb

Length of output: 1614


🏁 Script executed:

#!/bin/bash
# Key: check if any cast/decimal type with GetDecimal() == -1 is passed to ConvertTo
# Look at tests that call ConvertTo with TypeNewDecimal
echo "=== Tests for ConvertTo with TypeNewDecimal ==="
rg -n 'TypeNewDecimal' pkg/types/datum_test.go -B2 -A2 | head -40

Repository: pingcap/tidb

Length of output: 1382


🏁 Script executed:

#!/bin/bash
# Check FieldTypeBuilder to see what the default decimal value is
echo "=== FieldTypeBuilder implementation ==="
sed -n '1,100p' pkg/types/field_type_builder.go

# Also check NewFieldType helper
echo -e "\n=== NewFieldType function ==="
rg -A10 '^func NewFieldType' pkg/types/field_type.go

Repository: pingcap/tidb

Length of output: 3556


🏁 Script executed:

#!/bin/bash
# Most importantly: check if any caller of ConvertTo passes a FieldType 
# where GetDecimal() was not explicitly set (would default to -1)
echo "=== Checking ConvertTo calls with default FieldType ==="
rg -B5 'ConvertTo.*NewFieldType' pkg/ --type=go | head -40

Repository: pingcap/tidb

Length of output: 2759


🏁 Script executed:

#!/bin/bash
# Also check the actual behavior: does convertToMysqlDecimal actually fail or has issues
# when SetFrac gets -1?
echo "=== What happens after SetFrac in convertToMysqlDecimal ==="
sed -n '1612,1680p' pkg/types/datum.go

Repository: pingcap/tidb

Length of output: 2147


🏁 Script executed:

#!/bin/bash
# Check minFlenAndDecimalForType to see default decimals for TypeNewDecimal
echo "=== minFlenAndDecimalForType function ==="
rg -A50 'func minFlenAndDecimalForType' pkg/types/field_type.go | head -70

Repository: pingcap/tidb

Length of output: 2150


🏁 Script executed:

#!/bin/bash
# Check if ProduceDecWithSpecifiedTp actually handles the case where decimal is -1
echo "=== ProduceDecWithSpecifiedTp full implementation ==="
sed -n '1653,1730p' pkg/types/datum.go

Repository: pingcap/tidb

Length of output: 2608


🏁 Script executed:

#!/bin/bash
# Most importantly: verify if Frac() is actually read and used after SetFrac
# Check what Frac() does
echo "=== Datum.Frac() method ==="
rg -A3 'func \(d \*Datum\) Frac\(' pkg/types/datum.go

Repository: pingcap/tidb

Length of output: 138


Silent narrowing in SetFrac — add a guard on the documented invariant.

uint8(frac) silently truncates values outside [0, 255]. The struct doc (line 78) documents the contract that MySQL fsp ≤ 6 and DECIMAL frac ≤ 30, but convertToMysqlDecimal (line 1615) passes target.GetDecimal() unconditionally, which can be UnspecifiedLength (-1) for newly constructed FieldTypes:

  • Old behavior: uint16(-1) = 65535, so Frac() returned 65535
  • New behavior: uint8(-1) = 255, so Frac() now returns 255

For KindMysqlDecimal the authoritative frac lives in *MyDecimal, and ProduceDecWithSpecifiedTp skips rounding when frac is unspecified, so the practical impact is small. However, like the pattern shown in nearby chunk/row.go (lines 181–184), the narrowing should be guarded to prevent silent value changes. Add an intest.Assert to catch any future violations:

Suggested guard
 // SetFrac sets the frac of the datum.
 func (d *Datum) SetFrac(frac int) {
+	intest.Assert(frac >= 0 && frac <= math.MaxUint8, "Datum.SetFrac: frac out of range for uint8")
 	d.decimal = uint8(frac)
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// SetFrac sets the frac of the datum.
func (d *Datum) SetFrac(frac int) {
d.decimal = uint16(frac)
d.decimal = uint8(frac)
}
// SetFrac sets the frac of the datum.
func (d *Datum) SetFrac(frac int) {
intest.Assert(frac >= 0 && frac <= math.MaxUint8, "Datum.SetFrac: frac out of range for uint8")
d.decimal = uint8(frac)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/types/datum.go` around lines 167 - 170, The SetFrac method silently
narrows int to uint8; add a guard using intest.Assert to ensure the provided
frac is within the allowable range before casting and assigning in
Datum.SetFrac. Specifically, in Datum.SetFrac(frac int) add an assertion such as
intest.Assert(frac >= 0 && frac <= 255) (or the tighter documented invariant if
you prefer, e.g., frac <= 30 for DECIMAL) and then perform d.decimal =
uint8(frac); this will catch UnspecifiedLength (-1) or out-of-range values
passed from callers like convertToMysqlDecimal/target.GetDecimal() or
ProduceDecWithSpecifiedTp instead of silently truncating.

mjonss and others added 3 commits April 22, 2026 23:25
The inl_join range-fallback thresholds (2900/2300/700) were calibrated
to Datum's old 72-byte size. After shrinking Datum to 56 bytes, ranges
fit comfortably under the old limits and no longer exercise the
progressive fallback. Lower the thresholds to 2300/1500/500 so the
same three fallback levels (drop trailing col, drop two cols, INL_JOIN
inapplicable) are triggered again.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Shrinking Datum.decimal from uint16 to uint8 in 2a3c81e dropped the
int8 sign-extending cast in GetMysqlDuration, so Fsp=-1 (UnspecifiedFsp)
came back as 255 and crashed Duration.formatFrac with a slice-bounds
panic (TestDateArithFuncs).

Restore the int8 round-trip so the -1 sentinel survives the uint storage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After shrinking Datum.collation from a byte string to a uint16 id, Equals
compares the canonical MySQL collation id instead of the raw name. That
makes "UTF8MB4_BIN" and "utf8mb4_bin" compare equal where they previously
compared unequal. It aligns with MySQL's case-insensitive collation names
but was not previously test-pinned, so a future refactor could silently
flip it back.

Add TestDatumEqualsCollation covering same-canonical/different-casing,
distinct-canonical, and the unknown-name-collapses-to-0 edge case. Also
call out the lossy name -> id storage in SetCollation's doc so future
readers understand the trade-off without hunting the test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/types/datum.go`:
- Around line 2274-2280: The MarshalJSON in function MarshalJSON is currently
converting d.decimal/d.length to uint8/uint16 directly which loses the legacy
sentinel for UnspecifiedFsp/UnspecifiedLength; update the jsonDatum Decimal and
Length population so that when d.decimal == UnspecifiedFsp or d.length ==
UnspecifiedLength you set the marshaled uint16 value to 65535 (the old -1
sentinel) otherwise cast the actual value, i.e., preserve the legacy sentinel
mapping for the Decimal and Length fields in jsonDatum before returning the
JSON.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 6bdb37f6-3adc-49f7-acf4-3d98cbdc99c8

📥 Commits

Reviewing files that changed from the base of the PR and between 2a3c81e and ac5a044.

📒 Files selected for processing (3)
  • pkg/types/datum.go
  • tests/integrationtest/r/planner/core/casetest/index/index.result
  • tests/integrationtest/t/planner/core/casetest/index/index.test

Comment thread pkg/types/datum.go
Comment on lines 2274 to +2280
func (d *Datum) MarshalJSON() ([]byte, error) {
jd := &jsonDatum{
K: d.k,
Decimal: d.decimal,
Decimal: uint16(d.decimal),
Length: d.length,
I: d.i,
Collation: d.collation,
Collation: d.Collation(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve the legacy JSON sentinel when marshaling narrowed decimal.

For UnspecifiedFsp/UnspecifiedLength, the old uint16 backing encoded -1 as 65535; this now marshals as 255. That breaks the stated JSON wire-compatibility goal for older readers, which would decode 255 as a positive FSP/frac instead of the sentinel.

Compatibility-preserving marshal adjustment
 func (d *Datum) MarshalJSON() ([]byte, error) {
+	decimal := uint16(d.decimal)
+	if d.decimal == ^uint8(0) {
+		decimal = ^uint16(0)
+	}
 	jd := &jsonDatum{
 		K:         d.k,
-		Decimal:   uint16(d.decimal),
+		Decimal:   decimal,
 		Length:    d.length,
 		I:         d.i,
 		Collation: d.Collation(),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/types/datum.go` around lines 2274 - 2280, The MarshalJSON in function
MarshalJSON is currently converting d.decimal/d.length to uint8/uint16 directly
which loses the legacy sentinel for UnspecifiedFsp/UnspecifiedLength; update the
jsonDatum Decimal and Length population so that when d.decimal == UnspecifiedFsp
or d.length == UnspecifiedLength you set the marshaled uint16 value to 65535
(the old -1 sentinel) otherwise cast the actual value, i.e., preserve the legacy
sentinel mapping for the Decimal and Length fields in jsonDatum before returning
the JSON.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 62.06897% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.1441%. Comparing base (5581189) to head (604bac5).
⚠️ Report is 20 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #67977        +/-   ##
================================================
- Coverage   77.7952%   77.1441%   -0.6511%     
================================================
  Files          1984       1966        -18     
  Lines        549728     549753        +25     
================================================
- Hits         427662     424102      -3560     
- Misses       121146     125647      +4501     
+ Partials        920          4       -916     
Flag Coverage Δ
integration 40.8538% <62.0689%> (+1.0565%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 60.4888% <ø> (ø)
parser ∅ <ø> (∅)
br 49.9988% <ø> (-13.0317%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ti-chi-bot ti-chi-bot Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 22, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/types/datum.go (1)

782-784: Stale datum.collation reference in Compare docstring.

The warning on line 783 references datum.collation, but that field no longer exists — it was replaced by collationID in this PR. Meanwhile, ToHashKey (line 2208/2214) and SortDatums (line 2463) now deliberately derive the collator from d.collationID via GetCollatorByID, so the comment is both stale (wrong field name) and somewhat inconsistent with the new pattern established by this PR. Consider updating it to clarify the actual invariant the caller must respect.

♻️ Suggested doc tweak
 // Compare compares datum to another datum.
-// Notes: don't rely on datum.collation to get the collator, it's tend to buggy.
+// Notes: callers must pass an explicit `comparer` for string-typed datums;
+// do not assume the datum's stored collationID alone selects the right
+// collator for the comparison context.
 func (d *Datum) Compare(ctx Context, ad *Datum, comparer collate.Collator) (int, error) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/types/datum.go` around lines 782 - 784, Update the stale docstring on
Datum.Compare: replace the reference to the removed datum.collation with the
current collationID-based pattern and clarify the caller contract — callers must
not rely on a non-existent per-datum collator field and should instead ensure
the correct collator is provided (or derive it via GetCollatorByID using
d.collationID, as done in ToHashKey and SortDatums); edit the comment on Compare
to mention collatationID and the established invariant that collators must be
obtained via GetCollatorByID when needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/types/datum_test.go`:
- Around line 558-585: In TestDatumEqualsCollation, ensure the case-variant
equality check isn't vacuous by asserting the resolved collation ID is non-zero
before comparing; specifically, after creating a and b (from mk("UTF8MB4_BIN")
and mk("utf8mb4_bin")), add an assertion that a.CollationID() (and/or
b.CollationID()) != 0 (use the test helper like require.NotZero or
require.NotEqual with uint16(0)) before calling require.Equal and require.True
on CollationID() and Equals(&b) so this block verifies a known (non-zero)
canonical collation.

---

Nitpick comments:
In `@pkg/types/datum.go`:
- Around line 782-784: Update the stale docstring on Datum.Compare: replace the
reference to the removed datum.collation with the current collationID-based
pattern and clarify the caller contract — callers must not rely on a
non-existent per-datum collator field and should instead ensure the correct
collator is provided (or derive it via GetCollatorByID using d.collationID, as
done in ToHashKey and SortDatums); edit the comment on Compare to mention
collatationID and the established invariant that collators must be obtained via
GetCollatorByID when needed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f1189cf6-162c-42a5-ab47-7e2364f32b1a

📥 Commits

Reviewing files that changed from the base of the PR and between ac5a044 and 604bac5.

📒 Files selected for processing (2)
  • pkg/types/datum.go
  • pkg/types/datum_test.go

Comment thread pkg/types/datum_test.go
Comment on lines +558 to +585
func TestDatumEqualsCollation(t *testing.T) {
mk := func(name string) Datum {
var d Datum
d.SetString("abc", name)
return d
}

// Same canonical collation, different name casing → equal.
a := mk("UTF8MB4_BIN")
b := mk("utf8mb4_bin")
require.Equal(t, a.CollationID(), b.CollationID())
require.True(t, a.Equals(&b))

// Different canonical collations → not equal.
c := mk(charset.CollationUTF8MB4)
d := mk(charset.CollationASCII)
require.NotEqual(t, c.CollationID(), d.CollationID())
require.False(t, c.Equals(&d))

// Unknown names both collapse to id 0. The resulting equality is a
// documented consequence of lossy name → id storage; pin it so a
// future change doesn't accidentally flip it back to string compare.
u1 := mk("nonexistent_one")
u2 := mk("nonexistent_two")
require.Equal(t, uint16(0), u1.CollationID())
require.Equal(t, uint16(0), u2.CollationID())
require.True(t, u1.Equals(&u2))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Strengthen TestDatumEqualsCollation so the case-variant claim cannot pass vacuously.

The first block (lines 566-569) asserts a.CollationID() == b.CollationID() and a.Equals(&b), but does not assert the resolved ID is non-zero. If charset.GetCollationByName were ever changed such that "UTF8MB4_BIN"/"utf8mb4_bin" failed to resolve, collationNameToID would return 0 for both and this test would still pass — while silently merging into the unknown-name case lower down. Since the stated purpose is to pin canonical-id equality for known case-variant names, pin a non-zero ID here.

🛡️ Proposed tightening
 	// Same canonical collation, different name casing → equal.
 	a := mk("UTF8MB4_BIN")
 	b := mk("utf8mb4_bin")
+	require.NotZero(t, a.CollationID(), "expected UTF8MB4_BIN to resolve to a known collation id")
 	require.Equal(t, a.CollationID(), b.CollationID())
 	require.True(t, a.Equals(&b))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func TestDatumEqualsCollation(t *testing.T) {
mk := func(name string) Datum {
var d Datum
d.SetString("abc", name)
return d
}
// Same canonical collation, different name casing → equal.
a := mk("UTF8MB4_BIN")
b := mk("utf8mb4_bin")
require.Equal(t, a.CollationID(), b.CollationID())
require.True(t, a.Equals(&b))
// Different canonical collations → not equal.
c := mk(charset.CollationUTF8MB4)
d := mk(charset.CollationASCII)
require.NotEqual(t, c.CollationID(), d.CollationID())
require.False(t, c.Equals(&d))
// Unknown names both collapse to id 0. The resulting equality is a
// documented consequence of lossy name → id storage; pin it so a
// future change doesn't accidentally flip it back to string compare.
u1 := mk("nonexistent_one")
u2 := mk("nonexistent_two")
require.Equal(t, uint16(0), u1.CollationID())
require.Equal(t, uint16(0), u2.CollationID())
require.True(t, u1.Equals(&u2))
}
func TestDatumEqualsCollation(t *testing.T) {
mk := func(name string) Datum {
var d Datum
d.SetString("abc", name)
return d
}
// Same canonical collation, different name casing → equal.
a := mk("UTF8MB4_BIN")
b := mk("utf8mb4_bin")
require.NotZero(t, a.CollationID(), "expected UTF8MB4_BIN to resolve to a known collation id")
require.Equal(t, a.CollationID(), b.CollationID())
require.True(t, a.Equals(&b))
// Different canonical collations → not equal.
c := mk(charset.CollationUTF8MB4)
d := mk(charset.CollationASCII)
require.NotEqual(t, c.CollationID(), d.CollationID())
require.False(t, c.Equals(&d))
// Unknown names both collapse to id 0. The resulting equality is a
// documented consequence of lossy name → id storage; pin it so a
// future change doesn't accidentally flip it back to string compare.
u1 := mk("nonexistent_one")
u2 := mk("nonexistent_two")
require.Equal(t, uint16(0), u1.CollationID())
require.Equal(t, uint16(0), u2.CollationID())
require.True(t, u1.Equals(&u2))
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/types/datum_test.go` around lines 558 - 585, In TestDatumEqualsCollation,
ensure the case-variant equality check isn't vacuous by asserting the resolved
collation ID is non-zero before comparing; specifically, after creating a and b
(from mk("UTF8MB4_BIN") and mk("utf8mb4_bin")), add an assertion that
a.CollationID() (and/or b.CollationID()) != 0 (use the test helper like
require.NotZero or require.NotEqual with uint16(0)) before calling require.Equal
and require.True on CollationID() and Equals(&b) so this block verifies a known
(non-zero) canonical collation.

@mjonss
Copy link
Copy Markdown
Contributor Author

mjonss commented Apr 23, 2026

/retest

@tiprow
Copy link
Copy Markdown

tiprow Bot commented Apr 23, 2026

@mjonss: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 23, 2026

@mjonss: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-next-gen 604bac5 link true /test pull-unit-test-next-gen
idc-jenkins-ci-tidb/unit-test 604bac5 link true /test unit-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/planner SIG: Planner size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve storage efficiency of types.Datum

1 participant