Skip to content

Conversation

@tuhaihe
Copy link
Member

@tuhaihe tuhaihe commented Oct 9, 2025

Move yyjson from top-level dependency directory to contrib/pax_storage/src/cpp/contrib/yyjson since it's only used by the PAX storage component. Update all references in CMake build files, .gitmodules, and documentation to reflect the new location. This change improves the project structure by keeping PAX-specific dependencies together and removing the unnecessary top-level dependency directory.

The move includes:

  • Moving yyjson directory to contrib/pax_storage/src/cpp/contrib/yyjson
  • Updating CMakeLists.txt and pax.cmake to reference the new location
  • Updating .gitmodules to reflect the new submodule path
  • Updating README.md documentation to show the new location

Fixes #ISSUE_Number

What does this PR do?

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


Move yyjson from top-level dependency directory to
`contrib/pax_storage/src/cpp/contrib/yyjson` since it's only used by the
PAX storage component. Update all references in CMake build files,
.gitmodules, and documentation to reflect the new location. This change
improves the project structure by keeping PAX-specific dependencies
together and removing the unnecessary top-level dependency directory.

The move includes:
- Moving yyjson directory to contrib/pax_storage/src/cpp/contrib/yyjson
- Updating CMakeLists.txt and pax.cmake to reference the new location
- Updating .gitmodules to reflect the new submodule path
- Updating README.md documentation to show the new location
@edespino
Copy link
Contributor

Hi @tuhaihe !

I reviewed your PR and wanted to share an alternative implementation that eliminates the need for git submodules entirely. This approach uses CMake's FetchContent module, which I've seen work well in other Apache projects like MADlib.

Alternative: CMake FetchContent

I've implemented this on branch fetchcontent-yyjson for your consideration.

Key change: Replace the git submodule with automatic dependency fetching during CMake configuration.

Implementation

contrib/pax_storage/CMakeLists.txt:

if(USE_MANIFEST_API AND NOT USE_PAX_CATALOG)
    include(FetchContent)

    # Try system package first
    find_package(yyjson QUIET)

    if(NOT yyjson_FOUND)
        message(STATUS "yyjson not found in system, fetching from GitHub...")
        FetchContent_Declare(
            yyjson
            GIT_REPOSITORY https://github.com/ibireme/yyjson.git
            GIT_TAG 0.12.0
            GIT_SHALLOW TRUE
        )

        set(SAVED_BUILD_SHARED_LIBS ${BUILD_SHARED_LIBS})
        set(BUILD_SHARED_LIBS ON)

        FetchContent_MakeAvailable(yyjson)

        set(BUILD_SHARED_LIBS ${SAVED_BUILD_SHARED_LIBS})
    else()
        message(STATUS "Using system yyjson package")
    endif()
endif()

contrib/pax_storage/src/cpp/cmake/pax.cmake:

# Update include path to use FetchContent variable
set(pax_target_include ${pax_target_include} ${yyjson_SOURCE_DIR}/src)

Remove from .gitmodules - No submodule entry needed.

Benefits

  1. Simpler workflow - No git submodule update --init --recursive required
  2. Automatic handling - CMake downloads yyjson when needed during configuration
  3. Version explicit - GIT_TAG 0.12.0 (latest release) is clearer than a commit hash
    • Submodules point to opaque commit SHAs - without checking the yyjson repo, you can't tell if it's a release, a random commit, or how old it is
    • FetchContent uses semantic versions - immediately clear you're using the latest stable release
  4. System package support - Respects system-installed yyjson if available
  5. Pattern consistency - Aligns with how protobuf and zstd are already handled via find_package()

Important Note

During my review, I noticed that yyjson is only used when both USE_MANIFEST_API=ON and USE_PAX_CATALOG=OFF, which is not the default configuration. This means:

  • Default builds (./configure --enable-pax) don't use yyjson at all
  • Your reorganization and this alternative have zero impact on normal builds
  • The dependency is only relevant for the JSON manifest implementation mode

Testing

To verify this works when yyjson IS needed:

cd contrib/pax_storage/build
cmake .. -DUSE_MANIFEST_API=ON -DUSE_PAX_CATALOG=OFF
# Should see: "yyjson not found in system, fetching from GitHub..."
make

Recommendation

Your PR's organizational improvement is valuable regardless. This FetchContent approach is simply an alternative that:

  • Reduces the dependency management burden (no submodule commands)
  • Follows patterns used in other Apache projects
  • Makes the build more self-contained

Both approaches work - this is just offered as food for thought. Happy to discuss!

Branch for reference: fetchcontent-yyjson

@tuhaihe
Copy link
Member Author

tuhaihe commented Oct 11, 2025

Hi @edespino, thanks for your great idea. There are other submodules needed for PAX building, such as tabulate. I prefer to keep the same behavior across yyjson and the other submodules so that users can download all required submodules with git submodule update --init --recursive once, rather than downloading yyjson separately during the build.

@tuhaihe
Copy link
Member Author

tuhaihe commented Oct 11, 2025

More:

I searched and found that yyjson is available via the EPEL repository on Rocky Linux 9, but it is not available in Ubuntu 22.04 (though it can be found in Ubuntu 25.04). Given that it is not a commonly provided package across popular Linux distros, we can continue using the submodule approach for now.

@tuhaihe
Copy link
Member Author

tuhaihe commented Oct 11, 2025

Would like to have more voices from the core PAX developers on this. cc @gfphoenix78 @jiaqizho @gongxun0928

At least, it's not good to place the yyjson separately at the top of the dir.

@edespino
Copy link
Contributor

Subject: Release Engineering View – Using EPEL vs. Building from Source

When building on Rocky Linux 9, EPEL can be convenient for missing dependencies, but it introduces several trade-offs worth noting:

Key Issues

  • Reproducibility: EPEL is rolling; rebuilds can silently pick up new versions. Exact, bit-reproducible releases become hard to guarantee.
  • Supply-chain control: Packages are maintained outside ASF governance. Building from source keeps provenance and signatures within our audit trail.
  • Licensing clarity: ASF policy expects verifiable source for all distributed code. Relying on EPEL binaries delegates license vetting to Fedora.
  • Runtime portability: Binaries linked against EPEL libraries may fail on systems where EPEL isn’t enabled or where ABI flags differ.
  • Longevity: EPEL mirrors move forward; old package versions aren’t preserved, complicating long-term rebuilds.

Recommended Approach

  • Build third-party libraries from verified upstream source whenever they appear in our release artifacts.
  • Use EPEL only for transient developer tools, not for runtime dependencies.
  • If EPEL is required in CI, snapshot or mirror it to fix versions.
  • Record all external sources and hashes in DEPENDENCIES or release notes.

Bottom Line

EPEL is fine for development convenience, but from a release-engineering and ASF-compliance standpoint, source builds give stronger reproducibility, auditability, and long-term stability.

@edespino
Copy link
Contributor

Hi @edespino, thanks for your great idea. There are other submodules needed for PAX building, such as tabulate. I prefer to keep the same behavior across yyjson and the other submodules so that users can download all required submodules with git submodule update --init --recursive once, rather than downloading yyjson separately during the build.

One thing I'm curious about is why the selected submodule commit shas were used. It is not important for the relocation work. But it wasn't obvious at all why the yyjson submodule commit sha was used.

@jiaqizho
Copy link
Contributor

Would like to have more voices from the core PAX developers on this. cc @gfphoenix78 @jiaqizho @gongxun0928

At least, it's not good to place the yyjson separately at the top of the dir.

The reason why dependency/yyjson is separated and stored instead of being placed under PAX is because not only PAX depends on yyjson. Does the core of cloud also rely on yyjson?

@gfphoenix78 do you still have any context about this module?

@tuhaihe
Copy link
Member Author

tuhaihe commented Oct 16, 2025

Would like to have more voices from the core PAX developers on this. cc @gfphoenix78 @jiaqizho @gongxun0928
At least, it's not good to place the yyjson separately at the top of the dir.

The reason why dependency/yyjson is separated and stored instead of being placed under PAX is because not only PAX depends on yyjson. Does the core of cloud also rely on yyjson?

Thanks @jiaqizho for your reply. I remember our private conversation on this topic. ❤️

Now, from the open-source side, I believe we can move the yyjson to the correct place. If some downstream vendors refer to it as a dependency, the downstream should have a patch for it or clone it separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants