You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The duckplyr package aims at providing a fully compatible drop-in replacement for dplyr.
48
-
All operations, R functions, and data types that are supported by dplyr should work in an identical way with duckplyr.
49
-
This is achieved in two ways:
48
+
Currently, only a carefully selected subset of dplyr's operations, R functions, and R data types are implemented (see `vignette("limits")`).
49
+
Whenever a request cannot be handled by DuckDB, duckplyr falls back to dplyr.
50
50
51
-
- A carefully selected subset of dplyr operations, R functions, and R data types are implemented in DuckDB, focusing on faithful translation.
52
-
- When DuckDB does not support an operation, duckplyr falls back to dplyr, guaranteeing identical behavior.
53
-
54
-
## DuckDB mode
51
+
## A pipeline directly supported by duckplyr
55
52
56
53
The following operation is supported by duckplyr:
57
54
@@ -70,18 +67,18 @@ duckdb %>%
70
67
explain()
71
68
```
72
69
73
-
The plan shows three operations:
70
+
The plan shows three **operations**:
74
71
75
-
- a data frame scan (the input),
72
+
- a data frame scan (the input),
76
73
- a sort operation,
77
74
- a projection (adding the `b` column and removing the `a` column).
78
75
79
-
Each operation is supported by DuckDB.
80
-
The resulting object contains a plan for the entire pipeline that is executed lazily, only when the data is needed.
76
+
Because each operation is supported by DuckDB, the resulting object contains a **plan for the entire pipeline**.
77
+
The plan is only executed when the data is needed, i.e. lazily (see `vignette("prudence")`).
81
78
82
-
## Relation objects
79
+
###Relation objects
83
80
84
-
DuckDB accepts a tree of interconnected _relation objects_ as input.
81
+
DuckDB accepts a tree of interconnected *relation objects* as input.
85
82
Each relation object represents a logical step of the execution plan.
86
83
The duckplyr package translates dplyr verbs into relation objects.
87
84
@@ -101,7 +98,7 @@ duckplyr::last_rel()
101
98
102
99
The `last_rel()` function now shows a relation that describes logical plan for executing the whole pipeline.
103
100
104
-
## Help from dplyr
101
+
## A pipeline with functionality not directly supported by duckplyr
105
102
106
103
Using a custom function with a side effect is not supported by DuckDB and triggers a dplyr fallback:
107
104
@@ -118,7 +115,7 @@ fallback <-
118
115
select(-a)
119
116
```
120
117
121
-
The `verbose_plus_one()` function is not supported by DuckDB, so the `mutate()` step is forwarded to dplyr and already executed (eagerly) when the pipeline is defined.
118
+
The `verbose_plus_one()` function is not supported by DuckDB, so the `mutate()` step is handled by dplyr and already executed when the pipeline is defined, i.e. eagerly.
122
119
This is confirmed by the `last_rel()` function:
123
120
124
121
```{r}
@@ -148,30 +145,26 @@ duckplyr::last_rel()
148
145
149
146
The `last_rel()` function confirms that only the final `select()` is handled by DuckDB again.
150
147
151
-
## Enforce DuckDB operation
152
-
153
-
For any duck frame, one can control the automatic materialization.
154
-
For fallbacks to dplyr, automatic materialization must be allowed for the frame at hand, as dplyr necessitates eager evaluation.
155
-
156
-
Therefore, by making a data frame frugal, one can ensure a pipeline will error when a fallback to dplyr would have normally happened.
157
-
See `vignette("prudence")` for details.
158
-
159
-
By using operations supported by duckplyr and avoiding fallbacks as much as possible, your pipelines will be executed by DuckDB in an optimized way.
160
-
161
148
## Configure fallbacks
162
149
163
150
Using the `fallback_sitrep()` and `fallback_config()` functions you can examine and change settings related to fallbacks.
164
151
165
152
- You can choose to make fallbacks verbose with `fallback_config(info = TRUE)`.
166
153
167
-
- You can change settings related to logging and reporting fallback to duckplyr development team to inform their work.
154
+
- You can change settings related to logging and reporting fallback to duckplyr development team to inform their work. See `vignette("telemetry")`.
155
+
156
+
### Enforcing DuckDB operation
157
+
158
+
For any duck frame, one can control the automatic materialization.
159
+
For fallbacks to dplyr, automatic materialization must be allowed for the frame at hand, as dplyr necessitate eager evaluation.
160
+
161
+
Therefore, by making a data frame frugal, one can ensure a pipeline will error when a fallback to dplyr would have normally happened. See `vignette("prudence")`.
168
162
169
-
See `vignette("telemetry")` for details.
163
+
By using operations supported by duckplyr and avoiding fallbacks as much as possible, your pipelines will be executed by DuckDB in an optimized way.
170
164
171
165
## Conclusion
172
166
173
167
The fallback mechanism in duckplyr allows for a seamless integration of dplyr verbs and R functions that are not supported by DuckDB.
174
168
It is transparent to the user and only triggers when necessary.
175
169
With small or medium-sized data sets, it will not even be noticeable in most settings.
176
170
177
-
See `vignette("large")` for techniques for working with large data, `vignette("limits")` for the currently implementated translations, `vignette("prudence")` for details on controlling fallback behavior, and `vignette("telemetry")` for the automatic reporting of fallback situations.
0 commit comments