Skip to content

Add defensive checks for Parquet statistics handling and improve library path resolution#941

Open
a-hirota wants to merge 1 commit intoheterodb:masterfrom
a-hirota:fix-parquet-statistics-crash
Open

Add defensive checks for Parquet statistics handling and improve library path resolution#941
a-hirota wants to merge 1 commit intoheterodb:masterfrom
a-hirota:fix-parquet-statistics-crash

Conversation

@a-hirota
Copy link
Copy Markdown

Summary

Adds defensive checks for Parquet statistics handling to prevent crashes when ABI
incompatibilities exist between compile-time and runtime Arrow/Parquet library
versions.

Background

Issue #940 reported segmentation faults when reading Parquet files with statistics
created by PyArrow 19.0.1. Investigation revealed that the crashes occurred due to
ABI mismatches between different versions of Arrow/Parquet libraries loaded at
compile-time vs runtime.

Changes

arrow_meta.cpp

  • Add null pointer checks for statistics objects before accessing methods
  • Wrap physical_type() and statistics method calls in try-catch blocks
  • Gracefully handle exceptions from virtual method calls on corrupted vtables

Makefile

  • Add automatic rpath configuration for Arrow/Parquet libraries using pkgconf
  • Ensures runtime library path matches compile-time library selection

Testing

  • ✅ Successfully reads PyArrow 19.0.1 created Parquet files with statistics
  • ✅ Handles mixed data types (int64, string, double) without crashes
  • ✅ Graceful degradation when statistics cannot be accessed

Impact

While the root cause (library version conflicts) should be resolved at the
environment level, these defensive changes provide:

  • Crash prevention in mixed-version environments
  • Better error handling for statistics processing
  • Improved library path resolution

Fixes #940

@a-hirota a-hirota marked this pull request as ready for review September 21, 2025 03:42
Copilot AI review requested due to automatic review settings September 21, 2025 03:42
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds defensive error handling to prevent segmentation faults when reading Parquet files with statistics created by different Arrow/Parquet library versions, addressing issue #940.

  • Added null pointer checks and exception handling for Parquet statistics method calls
  • Wrapped statistics access in try-catch blocks to handle ABI incompatibilities gracefully
  • Enhanced Makefile to automatically configure runtime library paths using pkgconf

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/arrow_meta.cpp Added defensive checks for statistics access with try-catch blocks and null pointer validation
src/Makefile Added automatic rpath configuration for Arrow/Parquet libraries to ensure runtime path matches compile-time selection

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +3232 to +3243
phys_type = stats->physical_type();
} catch (const std::exception& e) {
#ifdef PGSTROM_DEBUG
elog(DEBUG1, "Failed to get physical_type from Parquet statistics: %s", e.what());
#endif
return;
} catch (...) {
#ifdef PGSTROM_DEBUG
elog(DEBUG1, "Unknown error getting physical_type from Parquet statistics");
#endif
return;
}
Copy link

Copilot AI Sep 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code still calls stats->physical_type() without additional protection after the null check. If the object is corrupted due to ABI mismatch, this call itself could segfault before the exception handlers can catch it. Consider adding a more defensive approach or additional validation before making any method calls on the stats object.

Copilot uses AI. Check for mistakes.
Comment on lines +3382 to +3385
if (stats->HasNullCount())
field->null_count = stats->null_count();
if (stats->HasMinMax())
__readParquetMinMaxStats(field, stats);
Copy link

Copilot AI Sep 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the issue in __readParquetMinMaxStats, the method calls stats->HasNullCount() and stats->HasMinMax() could still segfault due to corrupted vtables before the exception handlers can catch them. These calls should be moved inside the try block or additional validation should be added.

Copilot uses AI. Check for mistakes.
…ary path resolution

This commit addresses potential crashes when reading Parquet files with statistics,
particularly when there are ABI incompatibilities between compile-time and runtime
Arrow/Parquet library versions.

Changes:
- arrow_meta.cpp: Enhanced null pointer checks and exception handling for statistics access
  * Removed redundant shared_ptr.get() checks
  * Added specific exception catching with debug logging
  * Separated try-catch blocks for null count and min/max statistics access
  * Improved granular error handling to isolate different types of statistics failures
- Makefile: Added automatic rpath configuration for both Parquet and Arrow libraries
  * Ensures runtime library path matches compile-time library selection
  * Covers both HAS_PARQUET=yes and Arrow-only scenarios

The defensive programming approach prevents segmentation faults when virtual method
calls fail due to vtable corruption from library version mismatches, while providing
better debugging information in debug builds. Each statistics method call is now
individually protected to maximize data recovery in case of partial failures.
@a-hirota a-hirota force-pushed the fix-parquet-statistics-crash branch from 1ee6e8b to 7106e33 Compare September 21, 2025 03:46
@a-hirota
Copy link
Copy Markdown
Author

I've addressed the Copilot suggestions regarding defensive programming for Parquet
statistics handling:

Changes Made

1. Enhanced Exception Handling Granularity

  • Separated null count and min/max statistics into individual try-catch blocks
  • This allows partial recovery: if one type of statistics fails, the other can still
    be processed
  • Each failure type now has specific debug logging for better troubleshooting

2. Improved Fault Isolation

// Before: Single try-catch for all statistics methods
try {
    if (stats->HasNullCount()) field->null_count = stats->null_count();
    if (stats->HasMinMax()) __readParquetMinMaxStats(field, stats);
} catch (...) { /* handle all failures together */ }

// After: Individual protection for each statistics type
try {
    if (stats->HasNullCount()) field->null_count = stats->null_count();
} catch (...) { /* handle null count failures specifically */ }

try {
    if (stats->HasMinMax()) __readParquetMinMaxStats(field, stats);
} catch (...) { /* handle min/max failures specifically */ }

3. Benefits of This Approach
- Partial recovery: If HasNullCount() fails due to ABI mismatch, min/max statistics
can still be processed
- Better debugging: Specific error messages help identify which statistics method is
affected
- Graceful degradation: Maximum data extraction even in mixed-version environments

This addresses the Copilot concern about insufficient protection while maintaining
performance and providing better error isolation for debugging ABI compatibility
issues.

kaigai added a commit that referenced this pull request Sep 21, 2025
this change was based on pull request #941 by a-hirota san
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Segmentation fault when importing Parquet files with statistics from PyArrow 19.0.1

2 participants