Commit b786900
FEAT: Add arrow fetch support (#354)
### Work Item / Issue Reference
<!--
IMPORTANT: Please follow the PR template guidelines below.
For mssql-python maintainers: Insert your ADO Work Item ID below (e.g.
AB#37452)
For external contributors: Insert Github Issue number below (e.g. #149)
Only one reference is required - either GitHub issue OR ADO Work Item.
-->
<!-- External contributors: GitHub Issue -->
> GitHub Issue: #130
-------------------------------------------------------------------
### Summary
<!-- Insert your summary of changes below. Minimum 10 characters
required. -->
Hey, you mentioned in issue #130 that you were willing to consider
community contributions for adding Apache Arrow support, so here you go.
I have focused only on fetching data into Arrow structures from the
Database.
The Function signatures I chose are:
* **`arrow_batch(chunk_size=10000)`**: Fetch a single
`pyarrow.RecordBatch`, base for the other two methods.
* **`arrow(chunk_size=10000)`**: Fetches the entire result set as a
single `pyarrow.Table`.
* **`arrow_reader(chunk_size=10000)`**: Returns a
`pyarrow.RecordBatchReader` for streaming results without loading the
entire dataset into RAM.
Using `fetch_arrow...` instead of just `arrow...` could also be a good
option, but I think the terse version is not too ambiguous.
### Technical details
I am not very familiar with C++, but I did have some prior practice for
this task from implementing my [own ODBC driver in
Zig](https://github.com/ffelixg/zodbc_py) (a very good language for
projects like this!). The implementation is written almost entirely in
C++ in the `FetchArrowBatch_wrap` function, which produces PyCapsules
that are then consumed by `arrow_batch` and turned into actual arrow
objects.
The function itself is very large. I'm sure it could be factored in a
better way, even sharing some code with the other methods of fetching,
but my goal was to keep the whole thing as straight forward as possible.
I have also implemented my own loop for SQLGetData for Lob-Columns.
Unlike with the python fetch methods, I don't use the result directly,
but instead copy it into the same buffer I would use for the case with
bound columns. Maybe that's an abstraction that would make sense for
that case as well.
### Notes on data types
I noticed that you use SQL_C_TYPE_TIME for time(x) columns. The arrow
fetch does the same, but I think it would be better to use
SQL_C_SS_TIME2, since that supports fractional seconds.
Datetimeoffset is a bit tricky, since SQL Server stores timezone
information alongside each cell, while arrow tables expect a fixed
timezone for the entire column. I don't really see any solution other
than converting everything to UTC and returning a UTC column, so that's
what I did.
SQL_C_CHAR columns get copied directly into arrow utf8 arrays. Maybe
some encoding options would be useful.
### Performance
I think the main performance win to be gained is not interacting with
any Python data structures in the hot path. That is satisfied. Further
optimizations, which I did not make are:
* Releasing the GIL for the entire fetch loop
* Sharing the bound fetch buffer across repeated fetch calls
* Improve the hot loop switching
Instead of looping over rows and columns and then switching on the data
type for each cell, you could
* Put the row loop inside each switch case (fastest I think, but would
bloat the code a lot more)
* Use function pointers like you recently did for python fetching (has
overhead because of the indirect function call I think, also code is
more scattered)
* Replace both loops and the switch with computed gotos. That's what I
opted for in my ODBC driver (the Zig equivalent is a labeled switch) and
I am quite happy with how it came out. Performance seems very good and
it allows you to abstract the fetching process on a row by row basis. I
don't know how well that would translate to C++.
Overall the arrow performance seems not too far off from what I achieved
with zodbc.
<!--
### PR Title Guide
> For feature requests
FEAT: (short-description)
> For non-feature requests like test case updates, config updates ,
dependency updates etc
CHORE: (short-description)
> For Fix requests
FIX: (short-description)
> For doc update requests
DOC: (short-description)
> For Formatting, indentation, or styling update
STYLE: (short-description)
> For Refactor, without any feature changes
REFACTOR: (short-description)
> For release related changes, without any feature changes
RELEASE: #<RELEASE_VERSION> (short-description)
### Contribution Guidelines
External contributors:
- Create a GitHub issue first:
https://github.com/microsoft/mssql-python/issues/new
- Link the GitHub issue in the "GitHub Issue" section above
- Follow the PR title format and provide a meaningful summary
mssql-python maintainers:
- Create an ADO Work Item following internal processes
- Link the ADO Work Item in the "ADO Work Item" section above
- Follow the PR title format and provide a meaningful summary
-->
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: gargsaumya <saumyagarg.100@gmail.com>
Co-authored-by: Gaurav Sharma <sharmag@microsoft.com>1 parent 87f653b commit b786900
File tree
6 files changed
+1736
-0
lines changed- mssql_python
- pybind
- tests
6 files changed
+1736
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
| 39 | + | |
39 | 40 | | |
| 41 | + | |
| 42 | + | |
40 | 43 | | |
41 | 44 | | |
42 | 45 | | |
| |||
775 | 778 | | |
776 | 779 | | |
777 | 780 | | |
| 781 | + | |
| 782 | + | |
| 783 | + | |
| 784 | + | |
| 785 | + | |
| 786 | + | |
| 787 | + | |
| 788 | + | |
| 789 | + | |
| 790 | + | |
| 791 | + | |
| 792 | + | |
| 793 | + | |
778 | 794 | | |
779 | 795 | | |
780 | 796 | | |
| |||
2516 | 2532 | | |
2517 | 2533 | | |
2518 | 2534 | | |
| 2535 | + | |
| 2536 | + | |
| 2537 | + | |
| 2538 | + | |
| 2539 | + | |
| 2540 | + | |
| 2541 | + | |
| 2542 | + | |
| 2543 | + | |
| 2544 | + | |
| 2545 | + | |
| 2546 | + | |
| 2547 | + | |
| 2548 | + | |
| 2549 | + | |
| 2550 | + | |
| 2551 | + | |
| 2552 | + | |
| 2553 | + | |
| 2554 | + | |
| 2555 | + | |
| 2556 | + | |
| 2557 | + | |
| 2558 | + | |
| 2559 | + | |
| 2560 | + | |
| 2561 | + | |
| 2562 | + | |
| 2563 | + | |
| 2564 | + | |
| 2565 | + | |
| 2566 | + | |
| 2567 | + | |
| 2568 | + | |
| 2569 | + | |
| 2570 | + | |
| 2571 | + | |
| 2572 | + | |
| 2573 | + | |
| 2574 | + | |
| 2575 | + | |
| 2576 | + | |
| 2577 | + | |
| 2578 | + | |
| 2579 | + | |
| 2580 | + | |
| 2581 | + | |
| 2582 | + | |
| 2583 | + | |
| 2584 | + | |
| 2585 | + | |
| 2586 | + | |
| 2587 | + | |
| 2588 | + | |
| 2589 | + | |
| 2590 | + | |
| 2591 | + | |
| 2592 | + | |
| 2593 | + | |
| 2594 | + | |
| 2595 | + | |
| 2596 | + | |
| 2597 | + | |
| 2598 | + | |
| 2599 | + | |
| 2600 | + | |
| 2601 | + | |
| 2602 | + | |
| 2603 | + | |
| 2604 | + | |
| 2605 | + | |
| 2606 | + | |
| 2607 | + | |
| 2608 | + | |
| 2609 | + | |
| 2610 | + | |
| 2611 | + | |
| 2612 | + | |
| 2613 | + | |
| 2614 | + | |
| 2615 | + | |
| 2616 | + | |
| 2617 | + | |
| 2618 | + | |
| 2619 | + | |
| 2620 | + | |
| 2621 | + | |
| 2622 | + | |
2519 | 2623 | | |
2520 | 2624 | | |
2521 | 2625 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| 10 | + | |
10 | 11 | | |
11 | 12 | | |
12 | 13 | | |
| |||
199 | 200 | | |
200 | 201 | | |
201 | 202 | | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
202 | 208 | | |
203 | 209 | | |
204 | 210 | | |
| |||
0 commit comments