Skip to content

Commit d58973d

Browse files
alambcomphead
authored andcommitted
Docs: Update the crate configuration / build settings page (apache#17038)
* Docs: Update the crate configuration / build settings page * Update docs/source/user-guide/crate-configuration.md Co-authored-by: Oleks V <[email protected]> --------- Co-authored-by: Oleks V <[email protected]>
1 parent 75e9192 commit d58973d

File tree

1 file changed

+54
-24
lines changed

1 file changed

+54
-24
lines changed

docs/source/user-guide/crate-configuration.md

Lines changed: 54 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -19,18 +19,19 @@
1919

2020
# Crate Configuration
2121

22-
This section contains information on how to configure DataFusion in your Rust
23-
project. See the [Configuration Settings] section for a list of options that
24-
control DataFusion's behavior.
22+
This section contains information on how to configure builds of DataFusion in
23+
your Rust project. The [Configuration Settings] section lists options that
24+
control additional aspects DataFusion's runtime behavior.
2525

2626
[configuration settings]: configs.md
2727

28-
## Add latest non published DataFusion dependency
28+
## Using the nightly DataFusion builds
2929

3030
DataFusion changes are published to `crates.io` according to the [release schedule](https://github.com/apache/datafusion/blob/main/dev/release/README.md#release-process)
3131

32-
If you would like to test out DataFusion changes which are merged but not yet
33-
published, Cargo supports adding dependency directly to GitHub branch:
32+
If you would like to use or test versions of the DataFusion code which are
33+
merged but not yet published, you can use Cargo's [support for adding
34+
dependencies] directly to a GitHub branch:
3435

3536
```toml
3637
datafusion = { git = "https://github.com/apache/datafusion", branch = "main"}
@@ -50,22 +51,58 @@ datafusion = { git = "https://github.com/apache/datafusion", branch = "main", de
5051

5152
More on [Cargo dependencies](https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html#specifying-dependencies)
5253

53-
## Optimized Configuration
54+
## Optimizing Builds
5455

55-
For an optimized build several steps are required. First, use the below in your `Cargo.toml`. It is
56-
worth noting that using the settings in the `[profile.release]` section will significantly increase the build time.
56+
Here are several suggestions to get the Rust compler to produce faster code when
57+
compiling DataFusion. Note that these changes may increase compile time and
58+
binary size.
5759

58-
```toml
59-
[dependencies]
60-
datafusion = { version = "22.0" }
61-
tokio = { version = "^1.0", features = ["rt-multi-thread"] }
62-
snmalloc-rs = "0.3"
60+
### Generate Code with CPU Specific Instructions
61+
62+
By default, the Rust compiler produces code that runs on a wide range of CPUs,
63+
but may not take advantage of all the features of your specific CPU (such as
64+
certain [SIMD instructions]). This is especially true for x86_64 CPUs, where the
65+
default target is `x86_64-unknown-linux-gnu`, which only guarantees support for
66+
the `SSE2` instruction set. DataFusion can benefit from the more advanced
67+
instructions in the `AVX2` and `AVX512` to speed up operations like filtering,
68+
aggregation, and joins. To tell the Rust compiler to use these instructions, set
69+
the `RUSTFLAGS` environment variable to specify a more specific target CPU.
6370

71+
We recommend setting `target-cpu` or at least `avx2`, or preferably at least
72+
`native` (whatever the current CPU is). For example, to build and run DataFusion
73+
with optimizations for your current CPU:
74+
75+
```shell
76+
RUSTFLAGS='-C target-cpu=native' cargo run --release
77+
```
78+
79+
[simd instructions]: https://en.wikipedia.org/wiki/SIMD
80+
81+
### Enable Link Time Optimization / Single Codegen Unit
82+
83+
You can potentially improve your performance by compiling DataFusion into a
84+
single codegen unit which gives the Rust compiler more opportunity to optimize
85+
across crate boundaries. To do so, modify your projects' `Cargo.toml` to include
86+
`lto = true` and `codegen-units = 1` as shown below. Beware that using a single
87+
codegen unit _significantly_ increases `--release` build times.
88+
89+
```toml
6490
[profile.release]
6591
lto = true
6692
codegen-units = 1
6793
```
6894

95+
### Alternate Allocator: `snmalloc`
96+
97+
You can also use [snmalloc-rs](https://crates.io/crates/snmalloc-rs) crate as
98+
the memory allocator for DataFusion to improve performance. To do so, add the
99+
dependency to your `Cargo.toml` as shown below.
100+
101+
```toml
102+
[dependencies]
103+
snmalloc-rs = "0.3"
104+
```
105+
69106
Then, in `main.rs.` update the memory allocator with the below after your imports:
70107

71108
<!-- Note can't include snmalloc-rs in a runnable example, because it takes over the global allocator -->
@@ -82,17 +119,10 @@ async fn main() -> datafusion::error::Result<()> {
82119
}
83120
```
84121

85-
Based on the instruction set architecture you are building on you will want to configure the `target-cpu` as well, ideally
86-
with `native` or at least `avx2`.
87-
88-
```shell
89-
RUSTFLAGS='-C target-cpu=native' cargo run --release
90-
```
91-
92-
## Enable backtraces
122+
## Enable Backtraces
93123

94-
By default Datafusion returns errors as a plain message. There is option to enable more verbose details about the error,
95-
like error backtrace. To enable a backtrace you need to add Datafusion `backtrace` feature to your `Cargo.toml` file:
124+
By default, Datafusion returns errors as a plain text message. You can enable more verbose details about the error,
125+
such as backtraces by enabling the `backtrace` feature to your `Cargo.toml` file like this:
96126

97127
```toml
98128
datafusion = { version = "31.0.0", features = ["backtrace"]}

0 commit comments

Comments
 (0)