Skip to content

Conversation

@brycekbargar
Copy link
Collaborator

Using COPY FROM is an order of magnitude faster for bulk insertions for postgres. This is the last low-hanging fruit for download performance until async/await is supported and optimized for in a future future release. DuckDB can kind of sort of do bulk operations but it isn't a priority right now, especially with sqlite still supported in this release.

Getting bytes do go directly into the database was a bit of a struggle but postgres expects a "version" byte at the beginning of any record. By doing the byte munging ourselves we skip a whole lot of conversion in ldlite and conversion/logic in psycopg which would have added points of failure and slowness. I'm going to be testing to make sure it works across 5C FOLIO data before releasing the change.

@brycekbargar brycekbargar merged commit 63e9c88 into library-data-platform:release-v3.2.0 Sep 17, 2025
5 checks passed
brycekbargar added a commit that referenced this pull request Sep 18, 2025
I wasn't super happy with the weird type shenanigans happening in the last MR #43 but thought that doing a loads/dumps on all the source records to keep it consistent would be too slow. I discovered orjson.Fragment which means we can just use dumps and treat both srs and non-srs as bytes. This simplified the signatures and type handling quite a bit.

Streaming support for SRS was rushed because non-streaming became even more unstable under Ramsons. There was a chance (though probably small) that someone was loading source-storage endpoints that weren't the source-records one and they would have broken. This adds more consistency around which endpoints we do support with streaming and fixes any accidental breaks.
@brycekbargar brycekbargar deleted the pg-copyto branch December 9, 2025 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant