JIT-inline F64SIN, F64COS, and F64TAN for reduced precision x87 path by pmatos · Pull Request #5343 · FEX-Emu/FEX

pmatos · 2026-03-03T15:45:35Z

No description provided.

pmatos · 2026-03-03T15:48:39Z

I didn't manage to replicate the exact algorithm in the advsimd routines due to lack of registers, but I manage to rewrite it using a similar form with the available registers and got some good results:

  │ Operation │ vs ABI fallback │ vs Softfloat 80-bit │
  │ sin       │ 2.1x faster     │ 11.8x faster        │
  │ cos       │ 2.9x faster     │ 15.6x faster        │
  │ tan       │ 1.8x faster     │ 11.2x faster        │
  │ sincos    │ 2.4x faster     │ 25.3x faster        │

I am going to try to see if I can get similar results by jitting other f64 operations.

Sonicadvance1 · 2026-03-03T21:29:44Z

AmpereA1A:

Before:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 13208488520, 250000000, 52.83, 52.83 nanosecond, 18927222.42
FSIN, 11246190780, 250000000, 44.98, 44.98 nanosecond, 22229749.16
FCOS, 11428738260, 250000000, 45.71, 45.71 nanosecond, 21874680.68

After:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 7781475640, 250000000, 31.13, 31.13 nanosecond, 32127582.42
FSIN, 3908226340, 250000000, 15.63, 15.63 nanosecond, 63967636.02
FCOS, 4101809360, 250000000, 16.41, 16.41 nanosecond, 60948712.64

Cortex-A720/Radxa:
Before:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 16818118130, 250000000, 67.27, 67.27 nanosecond, 14864921.16
FSIN, 14680384340, 250000000, 58.72, 58.72 nanosecond, 17029526.90
FCOS, 15025618350, 250000000, 60.10, 60.10 nanosecond, 16638250.37

After:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 21518495820, 250000000, 86.07, 86.07 nanosecond, 11617912.43
FSIN, 3161922390, 250000000, 12.65, 12.65 nanosecond, 79065824.26
FCOS, 3327755660, 250000000, 13.31, 13.31 nanosecond, 75125708.00

Looks like Cortex-A720 is worse off in SINCOS because of the code emission. Would be interesting to keep it in the dispatcher and branch out so we don't annihilate icache. Kind of like the conversion operations.

…n x87 path

…ion x87 path

pmatos · 2026-03-17T17:15:07Z

AmpereA1A:

Before:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 13208488520, 250000000, 52.83, 52.83 nanosecond, 18927222.42
FSIN, 11246190780, 250000000, 44.98, 44.98 nanosecond, 22229749.16
FCOS, 11428738260, 250000000, 45.71, 45.71 nanosecond, 21874680.68

After:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 7781475640, 250000000, 31.13, 31.13 nanosecond, 32127582.42
FSIN, 3908226340, 250000000, 15.63, 15.63 nanosecond, 63967636.02
FCOS, 4101809360, 250000000, 16.41, 16.41 nanosecond, 60948712.64

Cortex-A720/Radxa: Before:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 16818118130, 250000000, 67.27, 67.27 nanosecond, 14864921.16
FSIN, 14680384340, 250000000, 58.72, 58.72 nanosecond, 17029526.90
FCOS, 15025618350, 250000000, 60.10, 60.10 nanosecond, 16638250.37

After:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 21518495820, 250000000, 86.07, 86.07 nanosecond, 11617912.43
FSIN, 3161922390, 250000000, 12.65, 12.65 nanosecond, 79065824.26
FCOS, 3327755660, 250000000, 13.31, 13.31 nanosecond, 75125708.00

Looks like Cortex-A720 is worse off in SINCOS because of the code emission. Would be interesting to keep it in the dispatcher and branch out so we don't annihilate icache. Kind of like the conversion operations.

Ufff - inside the dispatcher I could use more registers so I tried to make the code closer to the advsimd in the arm optimized-routines repo.

I have done similarly for other operations and I will push them as separate prs. If you run the same microbenchmarks as earlier do you get better results?

Sonicadvance1 · 2026-03-17T18:46:55Z

Much better! So I guess the main question now is checking if the precision difference causes real problems or not?

A1A-Before:
Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FPTAN, 16301669240, 250000000, 65.21, 65.21 nanosecond, 15335852.81
FSIN, 10976102700, 250000000, 43.90, 43.90 nanosecond, 22776754.81
FCOS, 11436870040, 250000000, 45.75, 45.75 nanosecond, 21859127.46
FSINCOS, 13527759940, 250000000, 54.11, 54.11 nanosecond, 18480517.18

A1A-After:
Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FPTAN, 5591198400, 250000000, 22.36, 22.36 nanosecond, 44713133.41
FSIN, 3363389100, 250000000, 13.45, 13.45 nanosecond, 74329788.37
FCOS, 4450470540, 250000000, 17.80, 17.80 nanosecond, 56173835.50
FSINCOS, 7663827020, 250000000, 30.66, 30.66 nanosecond, 32620778.02

Cortex-A720/Radxa-Before:
Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FPTAN, 17921762970, 250000000, 71.69, 71.69 nanosecond, 13949520.50
FSIN, 14688639670, 250000000, 58.75, 58.75 nanosecond, 17019955.94
FCOS, 15022294320, 250000000, 60.09, 60.09 nanosecond, 16641931.96
FSINCOS, 16760394360, 250000000, 67.04, 67.04 nanosecond, 14916116.81

Cortex-A720/Radxa-After:
Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FPTAN, 4265674550, 250000000, 17.06, 17.06 nanosecond, 58607377.82
FSIN, 3624740340, 250000000, 14.50, 14.50 nanosecond, 68970457.62
FCOS, 3544529050, 250000000, 14.18, 14.18 nanosecond, 70531231.79
FSINCOS, 6225296450, 250000000, 24.90, 24.90 nanosecond, 40158730.11

Sonicadvance1 · 2026-03-17T19:33:57Z

Oop, looks like something about this implementation breaks Mirror's Edge from running.

pmatos · 2026-03-17T20:16:41Z

Oop, looks like something about this implementation breaks Mirror's Edge from running.

That's odd - thanks for pointing that out.

pmatos changed the title ~~F64 sin cos tan~~ JIT-inline F64SIN, F64COS, and F64TAN for reduced precision x87 path Mar 6, 2026

pmatos added 3 commits March 16, 2026 15:43

JIT-inline F64SIN, F64COS, and F64TAN for reduced precision x87 path

6e82efe

asm_tests: JIT-inline F64SIN, F64COS, and F64TAN for reduced precisio…

dfeaff4

…n x87 path

instcountci: JIT-inline F64SIN, F64COS, and F64TAN for reduced precis…

29de640

…ion x87 path

pmatos force-pushed the f64-sin-cos-tan branch from e6d6b85 to 29de640 Compare March 17, 2026 17:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT-inline F64SIN, F64COS, and F64TAN for reduced precision x87 path#5343

JIT-inline F64SIN, F64COS, and F64TAN for reduced precision x87 path#5343
pmatos wants to merge 3 commits intoFEX-Emu:mainfrom
pmatos:f64-sin-cos-tan

pmatos commented Mar 3, 2026

Uh oh!

pmatos commented Mar 3, 2026

Uh oh!

Sonicadvance1 commented Mar 3, 2026

Uh oh!

pmatos commented Mar 17, 2026

Uh oh!

Sonicadvance1 commented Mar 17, 2026

Uh oh!

Sonicadvance1 commented Mar 17, 2026

Uh oh!

pmatos commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pmatos commented Mar 3, 2026

Uh oh!

pmatos commented Mar 3, 2026

Uh oh!

Sonicadvance1 commented Mar 3, 2026

Uh oh!

pmatos commented Mar 17, 2026

Uh oh!

Sonicadvance1 commented Mar 17, 2026

Uh oh!

Sonicadvance1 commented Mar 17, 2026

Uh oh!

pmatos commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants