Skip to content

feat(mysql): switch to adbc#11958

Open
tokoko wants to merge 5 commits intoibis-project:mainfrom
tokoko:master
Open

feat(mysql): switch to adbc#11958
tokoko wants to merge 5 commits intoibis-project:mainfrom
tokoko:master

Conversation

@tokoko
Copy link
Contributor

@tokoko tokoko commented Feb 28, 2026

Description of changes

  • switches mysql backend to use adbc rather than mysqlclient. export to arrow no longer goes through row-based cursors and pandas. managed to get rid of extra type wrangling.
  • metadata operations are still retrieved with sql queries rather than adbc get_objects.
  • this is a breaking change in terms of backend setup. adbc driver needs to be installed separately with dbc.

@github-actions github-actions bot added docs Documentation related issues or PRs tests Issues or PRs related to tests ci Continuous Integration issues or PRs mysql The MySQL backend dependencies Issues or PRs related to dependencies labels Feb 28, 2026
@github-actions github-actions bot added the sql Backends that generate SQL label Feb 28, 2026
def test_get_schema_from_query_special_cases(
con, mysql_type, get_schema_expected_type, table_expected_type
):
def test_get_schema_from_query_special_cases(con, mysql_type, expected_type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this function get renamed? It doesn't seem to be testing special cases any more, since now the two implementations result in teh same thing (which is awesome!)

@NickCrews
Copy link
Contributor

Thanks @tokoko. I looked through the tests and all other files besides the actual backend, and they looked good. I got scared by the amount of changes in the backend so I wanted to reply here before I took more time reading.

What is the original motivation here? performance? maintenance burden? There is a cost to requiring users to install a binary and for the (admittedly small looking) migration cost they will have to go through so want to understand the tradeoffs.

Can you also give us a roadmap as we review this? What are the parts that are most contentious that I should really pay attention to? Were there any particular design decisions you weren't sure about that you want to call out/ask about?

@tokoko
Copy link
Contributor Author

tokoko commented Mar 4, 2026

@NickCrews sure, happily. thanks for taking the time.

I guess the motivation is two-fold, one is performance. Current implementation needs to go through both row objects and pandas, which is frankly hard not to beat. Another is standardization on adbc as a way ibis talks to the backends similar to to how it relies on sqlglot for sql conversion for most of the backends now. I have to admit I'm not much of a user of ibis mysql backend myself, I just picked a random backend and went to work on it.

here is a cost to requiring users to install a binary

yes, there is, but to be fair in this particular instance users already had to provide a system-level dependency in order for mysqldb to work, so the only remaining issue is that now they have to install a different binary with a different "package manager" (dbc). If we were using something pure-python like pymysql before, this would be a more of a newly-introduced inconvenience.

One driver for so many changes in the tests were that errors are being classified differently, which is technically a breaking change, but I assumed we can look past it. I'll leave PR comments for all other workaround solutions I had to employ.

sqlite3.InterfaceError,
),
)
@pytest.mark.notyet(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a regression due to a quirk in mysql adbc behavior. I think it's a bug, I have a ticket open on the driver side here.

# nan can not be used with MySQL
df = df.replace(float("nan"), None)
ncols = len(schema)
# MySQL has a 65535 prepared statement placeholder limit.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already fixed in mysql driver (here). we can drop batch_size calculation here once a new version of the driver is out.

if isinstance(col, pa.Array)
else col.combine_chunks().storage
)
# All-null opaque columns (e.g., type_name=NULL) can't be cast
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversion stuff here is the one I'm least confident about. Frankly, it's simply just enough hacks to pass the tests.

@NickCrews
Copy link
Contributor

Thanks a lot @tokoko for the detailed response, that is super helpful.

re the performance, I realize this might be a big ask, but if it's easy, would you be willing to do a VERY rough benchmark of the difference for doing .to_pyarrow() and .to_pandas() on tables (I guess it shouldn't matter the complexity of the query?) of slightly varying sizes, eg .limit(5) to represent interactive queries, .limit(10_000) to represent medium size queries and .limit(1_000_000) for pretty big queries? You can push back if you want I realize that is nontrivial.

At first pass, I might guess that perf isn't that much of a motivator for me. MySQL is a row-oriented DB, so won't we have to do the row->column transformation somewhere anyways? Either it will be in the ADBC driver, or here in ibis. I might think that for a lot of queries, the time required is going to be dominated by the execution time within MySQL (as it slowly operates over rows), not by the conversion time. I also haven't seen any issues filed here where a user asks for better perf, but they very well might be experiencing it, please correct me/link if someone does have some examples.

On the other hand, I would consider much larger motivators for me, if this PR does bring them:

  1. Greater consistency between backends from the user experience, eg datatypes are treated the same when someone switches backends.
  2. Easier installation of the dependencies for users.
  3. Reduced complexity of the ibis codebase. If all backends could use the same ADBC API, that would be awesome for us!

1. Greater consistency between backends from the user experience

It does seem like this PR would bring that. Or at least it wouldn't hurt it.

2. Easier installation of the dependencies for users.

Thanks for the clarification that we already did depend on a binary for mysql, I didn't know that. Do you have any sense of the diff in ease/portability between the installation stories of these two binaries? It looks like [dbc](https://docs.columnar.tech/dbc/getting_started/installation/) is pretty easy to install across all platforms. Maybe that is even easier than the current mysql binary driver? I want to be very conscious of what we are asking of users. If installing the adbc driver is hard, and that is the only way they can use ibis. I want to be careful tying ourselves to depend on another project. But we're not looking for perfect, just better than the existing mysql driver 😉. Any comments here?

3. Reduced complexity of the ibis codebase

I took a look at the mysql backend changes you proposed. They look very well crafted, I really appreciate you taking the time. The comments and docstrings were the perfect level of explanation.

I want to hear your thoughts on if you think switching to ADBC would actually reduce the complexity of the ibis codebase. It looks like a similar amount of code as before. Some things were able to be removed, such as the datatype mapping. But then other complexities were added, like the parsing of ["42 40"] of binary data?? Or another example, why do we still need to use the sqe.Describe() to get the schema of a table, why can we not use the standard adbc_get_table_schema() provided by ADBC? Are there other existing bits of the backend that you tried to convert to a standard ADBC API, and then discovered that wouldn't work and so you had to fall back to backend-specific SQL? Basically, I'm wondering if we switch the postgres backend to use ADBC, are we going to be able to reuse this same conversion logic, or are we going to have to do the same thing again there? How many of these things are just temporary, eg they are bugs or limitations in the ADBC driver and we can avoid the complexity here if they get implemented there (eg in the future we could switch to adbc_get_table_schema()) , and which of them are inherent to ADBC and we are going to be stuck with them forever?

I see that in terms of line count, this is +404 and -399, with some added tests and some added lines in lockfiles, but you managed to entirely remove the 150 line datatypes.py, instead swapping this out for the added code needed to translate the arrow batches, eg _decode_opaque_storage(). That looks to me at this stage like the complexity is about the same as before, but would you disagree? Or is this an investment that other backends would also be able to share?

In general: how much is this actually going to help us unify the ibis codebase?

@tokoko
Copy link
Contributor Author

tokoko commented Mar 5, 2026

All really good questions, let me take them one at a time.

I can do a benchmark a little later, no problem. I found this one but it's measuring write rather than read.

MySQL is a row-oriented DB, so won't we have to do the row->column transformation somewhere anyways? Either it will be in the ADBC driver, or here in ibis.

well, technically yes, you can't avoid row -> columnar conversion at some stage of the pipeline, but in practice it matters quite a lot where the conversion happens. with the current approach, you're creating a new python object for each row of the query output that needs to be garbage collected as well, not to mention pandas conversion as an intermediate step. adbc drivers on the other hand do the conversion on the native side (most implementations are in go) and effectively only transfer the pointers to the data over to the python side because of how arrow works under the hood. there are other projects (like turbodbc) that use similar techniques to avoid exposing as little data as possible to python vm.

That being said, the actual impact of the performance boost depends on the type of a query as you hinted above. If a query needs to do a lot of processing on the server side and returns only a few rows, then the difference will be negligible, while a query that needs to output a huge amount of rows to the caller will experience a dramatic impact.

If installing the adbc driver is hard, and that is the only way they can use ibis

It is very simple to install drivers and locate them simply by name from the client (especially compared to odbc). The main complexity comes from the fact that you need another tool to do it. In (my) ideal world all drivers would be conda packages installable with conda or pixi, but Columnar is working on improving adbc driver UX with a separate tool so there's that. as long as you have dbc, managing drivers is straightforward.

But we're not looking for perfect, just better than the existing mysql driver

exactly, the fact that you're asking a user to prepare an environment for ibis is not perfect, but at least with dbc, the instructions will be similar across most backends.

Are there other existing bits of the backend that you tried to convert to a standard ADBC API

I deliberately did not convert metadata extraction to use adbc api simply to limit the the scope of the changes in this PR. I can go ahead and include those changes here as well to illustrate what that would look like.

How many of these things are just temporary, eg they are bugs or limitations in the ADBC driver and we can avoid the complexity here if they get implemented there (eg in the future we could switch to adbc_get_table_schema()) , and which of them are inherent to ADBC and we are going to be stuck with them forever?

The way I look at it both adbc and ibis try to solve the same problem in this regard. granted, adbc doesn't really care about ibis type system, but even w/o adbc the physical conversion ibis does is still between source systems and arrow. I don't think there can be any issues inherent to ADBC (other than driver bugs). A particular driver might have a solution to a particular problem that differs from the one that's already in ibis, but that's more of a backward compat issue rather than something fundamentally problematic imho.

Of course I can't really claim that all conversion troubles are going to go away since a lot of them stem from the "peculiarities" of the backend type systems. I think the most realistic expectation is that most of those conversions will be handled by adbc unless the type system itself makes it impossible. In other words, problems don't go away, but solutions will move to the adbc side.

That looks to me at this stage like the complexity is about the same as before, but would you disagree? Or is this an investment that other backends would also be able to share?

The goal is to share most of these stuff across backends, yes (maybe we can have a base AdbcSqlBackend for example). But I have to note moving all backends to adbc is probably not realistic at the moment. some backends don't have driver implementations, while some others (like oracle) exist but are not free.

@NickCrews
Copy link
Contributor

@tokoko I think I am convinced that this is a good idea. I can do a slightly more thorough review but in general it looks good.

But before I ask you to put in any more time, I think we should get the opinions of other ibis maintainers that honestly have more background and history here. @gforsyth and @cpcloud, could you give a thumbs up if this looks like a good direction in general to you?

@NickCrews
Copy link
Contributor

@tokoko do you know what the licensing story of the ADBC ecosystem is? In this discussion, a user wants us to switch away from mysqlclient because it isn't licenced permissivley enough. Would switching to ADBC be a win here?

@tokoko
Copy link
Contributor Author

tokoko commented Mar 6, 2026

yes, both driver manager and mysql driver are Apache 2.0. Different drivers can have different licenses, but all of the ones that are free under adbc-drivers org seem to be Apache 2.0.

@b4rlw
Copy link

b4rlw commented Mar 8, 2026

@NickCrews As per your request, this migration looks helpful for our use case. The more permissive licensing of the components would make adoption straightforward.

We're using a uv based workflow. While @tokoko mentions this migration requires incorporating another tool dbc, that isn't necessarily a new precedent, in the sense that building mysqlclient from source (only win wheels exist) often requires some configuration of system packages.

To obtain a working Ibis MySQL backend against MariaDB on macOS for example, I have something like this set in my shell:

set -gx OPENSSL_PREFIX (brew --prefix openssl@3)
set -gx MDB_PREFIX (brew --prefix mariadb-connector-c)
# Compile flags
set -gx CPPFLAGS "-I$OPENSSL_PREFIX/include -I$MDB_PREFIX/include -I$MDB_PREFIX/include/mariadb $CPPFLAGS"
# Linker flags
set -gx LDFLAGS "-L$OPENSSL_PREFIX/lib -L$MDB_PREFIX/lib $LDFLAGS"
# pkg-config
set -gx PKG_CONFIG_PATH "$OPENSSL_PREFIX/lib/pkgconfig:$MDB_PREFIX/lib/pkgconfig:$PKG_CONFIG_PATH"

If this is reduced to:

dbc install mysql

and having adbc_driver_manager available in the virtual environment, then that seems like a reduction in complexity from a user perspective.

@gforsyth
Copy link
Member

gforsyth commented Mar 8, 2026

I'm broadly in favor of switching over to ADBC connectors -- this was on our roadmap in the olden days.
The more we get away from pulling in row-wise tuples the better, and as more ADBC connectors are developed, we can also standardize a bunch more of the connector logic (that is currently necessarily splintered).

I'd like to get the ADBC connectors on conda-forge, at least, but installing dbc is straightforward and also provides a more consistent cross-platform experience.

@NickCrews
Copy link
Contributor

Thanks all. Sounds good to move this forward then! I can take another more detailed review with just a few bits of feedback and then get this merged.

I think we can skip the perf benchmarks I originally asked about, there are enough other reasons for the switch and I'm pretty confident the perf won't get worse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Continuous Integration issues or PRs dependencies Issues or PRs related to dependencies docs Documentation related issues or PRs mysql The MySQL backend sql Backends that generate SQL tests Issues or PRs related to tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants