Conversation
| def test_get_schema_from_query_special_cases( | ||
| con, mysql_type, get_schema_expected_type, table_expected_type | ||
| ): | ||
| def test_get_schema_from_query_special_cases(con, mysql_type, expected_type): |
There was a problem hiding this comment.
should this function get renamed? It doesn't seem to be testing special cases any more, since now the two implementations result in teh same thing (which is awesome!)
|
Thanks @tokoko. I looked through the tests and all other files besides the actual backend, and they looked good. I got scared by the amount of changes in the backend so I wanted to reply here before I took more time reading. What is the original motivation here? performance? maintenance burden? There is a cost to requiring users to install a binary and for the (admittedly small looking) migration cost they will have to go through so want to understand the tradeoffs. Can you also give us a roadmap as we review this? What are the parts that are most contentious that I should really pay attention to? Were there any particular design decisions you weren't sure about that you want to call out/ask about? |
|
@NickCrews sure, happily. thanks for taking the time. I guess the motivation is two-fold, one is performance. Current implementation needs to go through both row objects and pandas, which is frankly hard not to beat. Another is standardization on adbc as a way ibis talks to the backends similar to to how it relies on sqlglot for sql conversion for most of the backends now. I have to admit I'm not much of a user of ibis mysql backend myself, I just picked a random backend and went to work on it.
yes, there is, but to be fair in this particular instance users already had to provide a system-level dependency in order for mysqldb to work, so the only remaining issue is that now they have to install a different binary with a different "package manager" (dbc). If we were using something pure-python like pymysql before, this would be a more of a newly-introduced inconvenience. One driver for so many changes in the tests were that errors are being classified differently, which is technically a breaking change, but I assumed we can look past it. I'll leave PR comments for all other workaround solutions I had to employ. |
| sqlite3.InterfaceError, | ||
| ), | ||
| ) | ||
| @pytest.mark.notyet( |
There was a problem hiding this comment.
this is a regression due to a quirk in mysql adbc behavior. I think it's a bug, I have a ticket open on the driver side here.
| # nan can not be used with MySQL | ||
| df = df.replace(float("nan"), None) | ||
| ncols = len(schema) | ||
| # MySQL has a 65535 prepared statement placeholder limit. |
There was a problem hiding this comment.
this is already fixed in mysql driver (here). we can drop batch_size calculation here once a new version of the driver is out.
| if isinstance(col, pa.Array) | ||
| else col.combine_chunks().storage | ||
| ) | ||
| # All-null opaque columns (e.g., type_name=NULL) can't be cast |
There was a problem hiding this comment.
This conversion stuff here is the one I'm least confident about. Frankly, it's simply just enough hacks to pass the tests.
|
Thanks a lot @tokoko for the detailed response, that is super helpful. re the performance, I realize this might be a big ask, but if it's easy, would you be willing to do a VERY rough benchmark of the difference for doing At first pass, I might guess that perf isn't that much of a motivator for me. MySQL is a row-oriented DB, so won't we have to do the row->column transformation somewhere anyways? Either it will be in the ADBC driver, or here in ibis. I might think that for a lot of queries, the time required is going to be dominated by the execution time within MySQL (as it slowly operates over rows), not by the conversion time. I also haven't seen any issues filed here where a user asks for better perf, but they very well might be experiencing it, please correct me/link if someone does have some examples. On the other hand, I would consider much larger motivators for me, if this PR does bring them:
1. Greater consistency between backends from the user experienceIt does seem like this PR would bring that. Or at least it wouldn't hurt it. 2. Easier installation of the dependencies for users.Thanks for the clarification that we already did depend on a binary for mysql, I didn't know that. Do you have any sense of the diff in ease/portability between the installation stories of these two binaries? It looks like 3. Reduced complexity of the ibis codebaseI took a look at the mysql backend changes you proposed. They look very well crafted, I really appreciate you taking the time. The comments and docstrings were the perfect level of explanation. I want to hear your thoughts on if you think switching to ADBC would actually reduce the complexity of the ibis codebase. It looks like a similar amount of code as before. Some things were able to be removed, such as the datatype mapping. But then other complexities were added, like the parsing of I see that in terms of line count, this is +404 and -399, with some added tests and some added lines in lockfiles, but you managed to entirely remove the 150 line datatypes.py, instead swapping this out for the added code needed to translate the arrow batches, eg _decode_opaque_storage(). That looks to me at this stage like the complexity is about the same as before, but would you disagree? Or is this an investment that other backends would also be able to share? In general: how much is this actually going to help us unify the ibis codebase? |
|
All really good questions, let me take them one at a time. I can do a benchmark a little later, no problem. I found this one but it's measuring write rather than read.
well, technically yes, you can't avoid row -> columnar conversion at some stage of the pipeline, but in practice it matters quite a lot where the conversion happens. with the current approach, you're creating a new python object for each row of the query output that needs to be garbage collected as well, not to mention pandas conversion as an intermediate step. adbc drivers on the other hand do the conversion on the native side (most implementations are in go) and effectively only transfer the pointers to the data over to the python side because of how arrow works under the hood. there are other projects (like turbodbc) that use similar techniques to avoid exposing as little data as possible to python vm. That being said, the actual impact of the performance boost depends on the type of a query as you hinted above. If a query needs to do a lot of processing on the server side and returns only a few rows, then the difference will be negligible, while a query that needs to output a huge amount of rows to the caller will experience a dramatic impact.
It is very simple to install drivers and locate them simply by name from the client (especially compared to odbc). The main complexity comes from the fact that you need another tool to do it. In (my) ideal world all drivers would be conda packages installable with conda or pixi, but Columnar is working on improving adbc driver UX with a separate tool so there's that. as long as you have dbc, managing drivers is straightforward.
exactly, the fact that you're asking a user to prepare an environment for ibis is not perfect, but at least with dbc, the instructions will be similar across most backends.
I deliberately did not convert metadata extraction to use adbc api simply to limit the the scope of the changes in this PR. I can go ahead and include those changes here as well to illustrate what that would look like.
The way I look at it both adbc and ibis try to solve the same problem in this regard. granted, adbc doesn't really care about ibis type system, but even w/o adbc the physical conversion ibis does is still between source systems and arrow. I don't think there can be any issues inherent to ADBC (other than driver bugs). A particular driver might have a solution to a particular problem that differs from the one that's already in ibis, but that's more of a backward compat issue rather than something fundamentally problematic imho. Of course I can't really claim that all conversion troubles are going to go away since a lot of them stem from the "peculiarities" of the backend type systems. I think the most realistic expectation is that most of those conversions will be handled by adbc unless the type system itself makes it impossible. In other words, problems don't go away, but solutions will move to the adbc side.
The goal is to share most of these stuff across backends, yes (maybe we can have a base |
|
@tokoko I think I am convinced that this is a good idea. I can do a slightly more thorough review but in general it looks good. But before I ask you to put in any more time, I think we should get the opinions of other ibis maintainers that honestly have more background and history here. @gforsyth and @cpcloud, could you give a thumbs up if this looks like a good direction in general to you? |
|
@tokoko do you know what the licensing story of the ADBC ecosystem is? In this discussion, a user wants us to switch away from mysqlclient because it isn't licenced permissivley enough. Would switching to ADBC be a win here? |
|
yes, both driver manager and mysql driver are Apache 2.0. Different drivers can have different licenses, but all of the ones that are free under adbc-drivers org seem to be Apache 2.0. |
|
@NickCrews As per your request, this migration looks helpful for our use case. The more permissive licensing of the components would make adoption straightforward. We're using a To obtain a working Ibis MySQL backend against MariaDB on macOS for example, I have something like this set in my shell: set -gx OPENSSL_PREFIX (brew --prefix openssl@3)
set -gx MDB_PREFIX (brew --prefix mariadb-connector-c)
# Compile flags
set -gx CPPFLAGS "-I$OPENSSL_PREFIX/include -I$MDB_PREFIX/include -I$MDB_PREFIX/include/mariadb $CPPFLAGS"
# Linker flags
set -gx LDFLAGS "-L$OPENSSL_PREFIX/lib -L$MDB_PREFIX/lib $LDFLAGS"
# pkg-config
set -gx PKG_CONFIG_PATH "$OPENSSL_PREFIX/lib/pkgconfig:$MDB_PREFIX/lib/pkgconfig:$PKG_CONFIG_PATH"If this is reduced to: and having |
|
I'm broadly in favor of switching over to ADBC connectors -- this was on our roadmap in the olden days. I'd like to get the ADBC connectors on conda-forge, at least, but installing |
|
Thanks all. Sounds good to move this forward then! I can take another more detailed review with just a few bits of feedback and then get this merged. I think we can skip the perf benchmarks I originally asked about, there are enough other reasons for the switch and I'm pretty confident the perf won't get worse. |
Description of changes
mysqlclient. export to arrow no longer goes through row-based cursors and pandas. managed to get rid of extra type wrangling.