Skip to content

Drop MySQL dependencyΒ #145

@bmschmidt

Description

@bmschmidt

This is a big one that I put here partly just to see who's still interested in this repo!

I've been playing around a bit the last couple days with DuckDB, a columnar database that's the heir to MonetDB, which I had thought about for this project but never used.

Duck is much lighter than anything out there except SQLlite, but unlike SQLlite, does a column-oriented store more appropriate for the queries here because related blocks of memory will be close together.

The purpose of MySQL for this project has always been:

  1. Handle the details of JOIN queries
  2. Build massive B-tree indices that allow reasonably fast access to wordid pages and put all the bookids for a given wordid contiguously on disk so that you don't have to seek to a million different places on the hard drive.

Duck DB can do the first fine; and for the second, the builtin BRIN indexes turn out to be faster than MySQL if you can handle the fairly difficult work of sorting a billion records or so before loading it into duck. Having managed to sort the rate my professor bookworm master_bookcounts data in Apache Arrow feather format (that's a whole different story--this can takes days on MySQL for a trillion words, but I think I've got a decent O(N-log(N)) multi-pass on-disk sort going.), it performs better than MySQL on some standard queries.

Plus, it doesn't have the expensive request for large in-memory tables; and the actual on-disk files seems to be a bit smaller, even though we're using 4-byte ints instead of 3-byte ints.

SELECT SUM(count), date_year FROM master_bookcounts NATURAL JOIN fastcat WHERE wordid = 9 GROUP BY date_year;

MySQL: 16 rows in set (5.246 sec)
DuckDB: 3.65 s

SELECT date_year, department, SUM(count) FROM master_bookcounts NATURAL JOIN catalog WHERE wordid = 118 GROUP BY date_year, department;

8624 rows. 1 min 2 seconds in MySQL (Not fair because we've always grouped on integer keys, not text keys for department).
3.78 seconds in DuckDB (Still grouping on text keys!!)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions