Skip to content

Conversation

@kmiller68
Copy link
Contributor

Kotlin-compose-wasm is the longest running test in JS3 by a pretty wide margin. However after the first few iterations almost all code is in the most optimizing compiler in the respective engines.

This change reduces the iterations to 5 and sets adjusts the worst case count accordingly.

Kotlin-compose-wasm is the longest running test in JS3 by a pretty wide
margin. However after the first few iterations almost all code is in the
most optimizing compiler in the respective engines.

This change reduces the iterations to 5 and sets adjusts the worst
case count accordingly.
@netlify
Copy link

netlify bot commented Oct 23, 2025

Deploy Preview for webkit-jetstream-preview ready!

Name Link
🔨 Latest commit f9f5be3
🔍 Latest deploy log https://app.netlify.com/projects/webkit-jetstream-preview/deploys/68fa6ea2ee4b1d0008186c52
😎 Deploy Preview https://deploy-preview-225--webkit-jetstream-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@kmiller68
Copy link
Contributor Author

CC @eqrion

@kmiller68 kmiller68 requested a review from danleh October 23, 2025 18:06
Copy link
Contributor

@danleh danleh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me. I also briefly checked that the scores on x64 with all three shells are roughly the same. I am however a bit worried about noise/flaky results with worstCaseCount: 1, how about compromising on 10 iterations with 2 worst case?

"./simple/doxbee-promise.js",
],
tags: ["default", "js", "promise", "Simple"],
iterations: 80,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the doxbee changes intended (or accidental)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, accidental, will revert.

@kmiller68
Copy link
Contributor Author

I am however a bit worried about noise/flaky results with worstCaseCount: 1, how about compromising on 10 iterations with 2 worst case?

That would make it still quite long (~4 sec) on my M2 MBP (probably a lot worse on phones/older CPUs). Let me run a variance analysis on the worst case count for this benchmark and get back to you. It seems somewhat unlikely it's dramatically worse than Firsts though, in general, especially since the iteration time is so long.

@kmiller68
Copy link
Contributor Author

Some numbers:

====================================================================================================
JetStream3 Benchmark Power Analysis
Significance Level (α): 0.05
Statistical Power: 0.8
Detection multiple: 1.02
====================================================================================================

Chrome:
Kotlin-compose-wasm
----------------------------------------------------------------------------------------------------

  Category: RawTime - Worst
    Sample size (n): 40
    Mean: 392.1590
    Std Dev: 8.8443
    CV: 2.26%
    Effect Size (Cohen's d): 0.887
    Sample size needed: 20.966

  Category: Worst
    Sample size (n): 20
    Mean: 12.7520
    Std Dev: 0.1661
    CV: 1.30%
    Effect Size (Cohen's d): 1.535
    Sample size needed: 7.747
    
Firefox:
Kotlin-compose-wasm
----------------------------------------------------------------------------------------------------

  Category: RawTime - Worst
    Sample size (n): 40
    Mean: 1361.0875
    Std Dev: 68.3518
    CV: 5.02%
    Effect Size (Cohen's d): 0.398
    Sample size needed: 99.940

  Category: Worst
    Sample size (n): 20
    Mean: 3.6805
    Std Dev: 0.1648
    CV: 4.48%
    Effect Size (Cohen's d): 0.447
    Sample size needed: 79.664

Safari: 
Kotlin-compose-wasm
----------------------------------------------------------------------------------------------------

  Category: RawTime - Worst
    Sample size (n): 40
    Mean: 491.7625
    Std Dev: 15.5822
    CV: 3.17%
    Effect Size (Cohen's d): 0.631
    Sample size needed: 40.386

  Category: Worst
    Sample size (n): 20
    Mean: 10.1751
    Std Dev: 0.2878
    CV: 2.83%
    Effect Size (Cohen's d): 0.707
    Sample size needed: 32.380
    

@kmiller68
Copy link
Contributor Author

We can discuss offline but those numbers don't seem too bad relative to other Worsts in the benchmark. Many of the line items are on the 1000s of samples needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants