Thanks for Criterion!
I spent some time understanding sample-size and measurement-time which I believe I got right. My questions relate to statistical significance and may underscore not having the a-priori knowledge of best practices in benchmarking and here they go.
https://bheisler.github.io/criterion.rs/book/analysis.html seems to make it clear that the samples of any granular bench go in linearly increasing iteration count: [d, 2d, 3d, ... Nd].
How does this interact with statistical significance testing which Criterion is doing, which ultimately ends up as possibly warnings like:
Found 10 outliers among 50 measurements (20.00%)
6 (12.00%) low severe
2 (4.00%) high mild
2 (4.00%) high severe
Or, what is the motivation for [d, 2d, 3d, ... Nd] in case it's not for catering to a statistical property per-se?
Is there also already some place where these low/high and mild/severe text indications are defined?
Thanks for Criterion!
I spent some time understanding
sample-sizeandmeasurement-timewhich I believe I got right. My questions relate to statistical significance and may underscore not having the a-priori knowledge of best practices in benchmarking and here they go.https://bheisler.github.io/criterion.rs/book/analysis.html seems to make it clear that the
samples of any granular bench go in linearly increasing iteration count: [d, 2d, 3d, ... Nd].How does this interact with statistical significance testing which Criterion is doing, which ultimately ends up as possibly warnings like:
Or, what is the motivation for [d, 2d, 3d, ... Nd] in case it's not for catering to a statistical property per-se?
Is there also already some place where these
low/highandmild/severetext indications are defined?