|  | 
|  | 1 | +--- | 
|  | 2 | +title: "duckplyr" | 
|  | 3 | +output: rmarkdown::html_vignette | 
|  | 4 | +vignette: > | 
|  | 5 | +  %\VignetteIndexEntry{duckplyr} | 
|  | 6 | +  %\VignetteEngine{knitr::rmarkdown} | 
|  | 7 | +  %\VignetteEncoding{UTF-8} | 
|  | 8 | +--- | 
|  | 9 | + | 
|  | 10 | +```{r, include = FALSE} | 
|  | 11 | +knitr::opts_chunk$set( | 
|  | 12 | +  collapse = TRUE, | 
|  | 13 | +  comment = "#>" | 
|  | 14 | +) | 
|  | 15 | +``` | 
|  | 16 | + | 
|  | 17 | +```{r setup} | 
|  | 18 | +library(duckplyr) | 
|  | 19 | +``` | 
|  | 20 | + | 
|  | 21 | +## Design principles | 
|  | 22 | + | 
|  | 23 | +The duckplyr package uses **DuckDB under the hood** but is also a **drop-in replacement for dplyr**. | 
|  | 24 | +These two facts create a tension: | 
|  | 25 | + | 
|  | 26 | +-   When using dplyr, we are not used to explicitly collect results: the data.frames are eager by default. | 
|  | 27 | +    Adding a `collect()` step by default would confuse users and make "drop-in replacement" an exaggeration. | 
|  | 28 | +    Therefore, _duckplyr needs eagerness_! | 
|  | 29 | + | 
|  | 30 | +-   The whole advantage of using DuckDB under the hood is letting DuckDB optimize computations, like dtplyr does with data.table. | 
|  | 31 | +    _Therefore, duckplyr needs laziness_! | 
|  | 32 | + | 
|  | 33 | +As a consequence, duckplyr is lazy on the inside for all DuckDB operations but eager on the outside, thanks to [ALTREP](https://duckdb.org/2024/04/02/duckplyr.html#eager-vs-lazy-materialization), a powerful R feature that among other things supports **deferred evaluation**. | 
|  | 34 | + | 
|  | 35 | +> "ALTREP allows R objects to have different in-memory representations, and for custom code to be executed whenever those objects are accessed." | 
|  | 36 | +
 | 
|  | 37 | +If the duckplyr data.frame is accessed by... | 
|  | 38 | + | 
|  | 39 | +-   not duckplyr (say, ggplot2), then a special callback is executed, allowing materialization of the data frame. | 
|  | 40 | +-   duckplyr, then the operations continue to be lazy (until a call to `collect.duckplyr_df()` for instance). | 
|  | 41 | + | 
|  | 42 | +Therefore, duckplyr can be **both lazy** (within itself) **and not lazy** (for the outside world). | 
|  | 43 | + | 
|  | 44 | +Now, the default materialization can be problematic if dealing with large data: what if the materialization eats up all RAM? | 
|  | 45 | +Therefore, the duckplyr package has a **safeguard called funneling** (in the current development version of the package). | 
|  | 46 | +A funneled data.frame cannot be materialized by default, it needs a call to a `compute()` function. | 
|  | 47 | +By default, duckplyr frames are _unfunneled_, but duckplyr frames created from Parquet data (presumedly large) are _funneled_. | 
|  | 48 | + | 
|  | 49 | +## How to use duckplyr | 
|  | 50 | + | 
|  | 51 | +### For normal sized data (instead of dplyr) | 
|  | 52 | + | 
|  | 53 | +To replace dplyr with duckplyr, you can either | 
|  | 54 | + | 
|  | 55 | +- load duckplyr and then keep your pipeline as is. | 
|  | 56 | + | 
|  | 57 | +```r | 
|  | 58 | +library(conflicted) | 
|  | 59 | +library(duckplyr) | 
|  | 60 | +conflict_prefer("filter", "dplyr", quiet = TRUE) | 
|  | 61 | +``` | 
|  | 62 | + | 
|  | 63 | +- convert individual data.frames to duck frames which allows you to control their automatic materialization parameters. To do that, you use `duckdb_tibble()`, `as_duckdb_tibble()` or read data using `read_*()` functions like `read_csv_duckdb()`. | 
|  | 64 | + | 
|  | 65 | +In both cases, if an operation cannot be performed  | 
|  | 66 | +by duckplyr (see `vignettes("limits")`), it will be outsourced to dplyr.  | 
|  | 67 | +You can choose to be informed about fallbacks to dplyr, see `?fallback_config`. | 
|  | 68 | +You can disable fallbacks by turning off automatic materialization. | 
|  | 69 | +In that case, if an operation cannot be performed by duckplyr, your code will error. | 
|  | 70 | + | 
|  | 71 | +### For large data (instead of dbplyr) | 
|  | 72 | + | 
|  | 73 | +With large datasets, you want: | 
|  | 74 | + | 
|  | 75 | +- input data in an efficient format, like Parquet files. Therefore you might input data using `read_parquet_duckdb()`. | 
|  | 76 | +- efficient computation, which duckplyr provides via DuckDB's holistic optimization, without your having to use another syntax than dplyr. | 
|  | 77 | +- the output to not clutter all the memory. Therefore you can make use of these features: | 
|  | 78 | +    - funneling see vignette TODO ADD CURRENT NAME to disable automatic materialization completely or to disable automatic materialization up to a certain output size. | 
|  | 79 | +    - computation to files using  `compute_parquet()` or `compute_csv()`. | 
|  | 80 | +     | 
|  | 81 | + | 
|  | 82 | + | 
|  | 83 | +A drawback of analyzing large data with duckplyr is that the limitations of duckplyr  | 
|  | 84 | +(unsupported verbs or data types, see `vignette("limits")`) won't be compensated by fallbacks since fallbacks to dplyr necessitate putting data into memory. | 
|  | 85 | + | 
|  | 86 | +## How to improve duckplyr | 
|  | 87 | + | 
|  | 88 | +- telemetry | 
|  | 89 | +- report issues, contribute | 
0 commit comments