Add defensive checks for Parquet statistics handling and improve library path resolution#941
Add defensive checks for Parquet statistics handling and improve library path resolution#941a-hirota wants to merge 1 commit intoheterodb:masterfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds defensive error handling to prevent segmentation faults when reading Parquet files with statistics created by different Arrow/Parquet library versions, addressing issue #940.
- Added null pointer checks and exception handling for Parquet statistics method calls
- Wrapped statistics access in try-catch blocks to handle ABI incompatibilities gracefully
- Enhanced Makefile to automatically configure runtime library paths using pkgconf
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/arrow_meta.cpp | Added defensive checks for statistics access with try-catch blocks and null pointer validation |
| src/Makefile | Added automatic rpath configuration for Arrow/Parquet libraries to ensure runtime path matches compile-time selection |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| phys_type = stats->physical_type(); | ||
| } catch (const std::exception& e) { | ||
| #ifdef PGSTROM_DEBUG | ||
| elog(DEBUG1, "Failed to get physical_type from Parquet statistics: %s", e.what()); | ||
| #endif | ||
| return; | ||
| } catch (...) { | ||
| #ifdef PGSTROM_DEBUG | ||
| elog(DEBUG1, "Unknown error getting physical_type from Parquet statistics"); | ||
| #endif | ||
| return; | ||
| } |
There was a problem hiding this comment.
The code still calls stats->physical_type() without additional protection after the null check. If the object is corrupted due to ABI mismatch, this call itself could segfault before the exception handlers can catch it. Consider adding a more defensive approach or additional validation before making any method calls on the stats object.
| if (stats->HasNullCount()) | ||
| field->null_count = stats->null_count(); | ||
| if (stats->HasMinMax()) | ||
| __readParquetMinMaxStats(field, stats); |
There was a problem hiding this comment.
Similar to the issue in __readParquetMinMaxStats, the method calls stats->HasNullCount() and stats->HasMinMax() could still segfault due to corrupted vtables before the exception handlers can catch them. These calls should be moved inside the try block or additional validation should be added.
…ary path resolution This commit addresses potential crashes when reading Parquet files with statistics, particularly when there are ABI incompatibilities between compile-time and runtime Arrow/Parquet library versions. Changes: - arrow_meta.cpp: Enhanced null pointer checks and exception handling for statistics access * Removed redundant shared_ptr.get() checks * Added specific exception catching with debug logging * Separated try-catch blocks for null count and min/max statistics access * Improved granular error handling to isolate different types of statistics failures - Makefile: Added automatic rpath configuration for both Parquet and Arrow libraries * Ensures runtime library path matches compile-time library selection * Covers both HAS_PARQUET=yes and Arrow-only scenarios The defensive programming approach prevents segmentation faults when virtual method calls fail due to vtable corruption from library version mismatches, while providing better debugging information in debug builds. Each statistics method call is now individually protected to maximize data recovery in case of partial failures.
1ee6e8b to
7106e33
Compare
|
I've addressed the Copilot suggestions regarding defensive programming for Parquet Changes Made1. Enhanced Exception Handling Granularity
2. Improved Fault Isolation // Before: Single try-catch for all statistics methods
try {
if (stats->HasNullCount()) field->null_count = stats->null_count();
if (stats->HasMinMax()) __readParquetMinMaxStats(field, stats);
} catch (...) { /* handle all failures together */ }
// After: Individual protection for each statistics type
try {
if (stats->HasNullCount()) field->null_count = stats->null_count();
} catch (...) { /* handle null count failures specifically */ }
try {
if (stats->HasMinMax()) __readParquetMinMaxStats(field, stats);
} catch (...) { /* handle min/max failures specifically */ }
3. Benefits of This Approach
- Partial recovery: If HasNullCount() fails due to ABI mismatch, min/max statistics
can still be processed
- Better debugging: Specific error messages help identify which statistics method is
affected
- Graceful degradation: Maximum data extraction even in mixed-version environments
This addresses the Copilot concern about insufficient protection while maintaining
performance and providing better error isolation for debugging ABI compatibility
issues. |
this change was based on pull request #941 by a-hirota san
Summary
Adds defensive checks for Parquet statistics handling to prevent crashes when ABI
incompatibilities exist between compile-time and runtime Arrow/Parquet library
versions.
Background
Issue #940 reported segmentation faults when reading Parquet files with statistics
created by PyArrow 19.0.1. Investigation revealed that the crashes occurred due to
ABI mismatches between different versions of Arrow/Parquet libraries loaded at
compile-time vs runtime.
Changes
arrow_meta.cpp
physical_type()and statistics method calls in try-catch blocksMakefile
Testing
Impact
While the root cause (library version conflicts) should be resolved at the
environment level, these defensive changes provide:
Fixes #940