Skip to content

JIT-inline F64SIN, F64COS, and F64TAN for reduced precision x87 path#5343

Open
pmatos wants to merge 3 commits intoFEX-Emu:mainfrom
pmatos:f64-sin-cos-tan
Open

JIT-inline F64SIN, F64COS, and F64TAN for reduced precision x87 path#5343
pmatos wants to merge 3 commits intoFEX-Emu:mainfrom
pmatos:f64-sin-cos-tan

Conversation

@pmatos
Copy link
Collaborator

@pmatos pmatos commented Mar 3, 2026

No description provided.

@pmatos
Copy link
Collaborator Author

pmatos commented Mar 3, 2026

I didn't manage to replicate the exact algorithm in the advsimd routines due to lack of registers, but I manage to rewrite it using a similar form with the available registers and got some good results:

  │ Operation │ vs ABI fallback │ vs Softfloat 80-bit │
  │ sin       │ 2.1x faster     │ 11.8x faster        │
  │ cos       │ 2.9x faster     │ 15.6x faster        │
  │ tan       │ 1.8x faster     │ 11.2x faster        │
  │ sincos    │ 2.4x faster     │ 25.3x faster        │

I am going to try to see if I can get similar results by jitting other f64 operations.

@Sonicadvance1
Copy link
Member

AmpereA1A:

Before:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 13208488520, 250000000, 52.83, 52.83 nanosecond, 18927222.42
FSIN, 11246190780, 250000000, 44.98, 44.98 nanosecond, 22229749.16
FCOS, 11428738260, 250000000, 45.71, 45.71 nanosecond, 21874680.68

After:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 7781475640, 250000000, 31.13, 31.13 nanosecond, 32127582.42
FSIN, 3908226340, 250000000, 15.63, 15.63 nanosecond, 63967636.02
FCOS, 4101809360, 250000000, 16.41, 16.41 nanosecond, 60948712.64

Cortex-A720/Radxa:
Before:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 16818118130, 250000000, 67.27, 67.27 nanosecond, 14864921.16
FSIN, 14680384340, 250000000, 58.72, 58.72 nanosecond, 17029526.90
FCOS, 15025618350, 250000000, 60.10, 60.10 nanosecond, 16638250.37

After:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 21518495820, 250000000, 86.07, 86.07 nanosecond, 11617912.43
FSIN, 3161922390, 250000000, 12.65, 12.65 nanosecond, 79065824.26
FCOS, 3327755660, 250000000, 13.31, 13.31 nanosecond, 75125708.00

Looks like Cortex-A720 is worse off in SINCOS because of the code emission. Would be interesting to keep it in the dispatcher and branch out so we don't annihilate icache. Kind of like the conversion operations.

@pmatos pmatos changed the title F64 sin cos tan JIT-inline F64SIN, F64COS, and F64TAN for reduced precision x87 path Mar 6, 2026
@pmatos
Copy link
Collaborator Author

pmatos commented Mar 17, 2026

AmpereA1A:

Before:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 13208488520, 250000000, 52.83, 52.83 nanosecond, 18927222.42
FSIN, 11246190780, 250000000, 44.98, 44.98 nanosecond, 22229749.16
FCOS, 11428738260, 250000000, 45.71, 45.71 nanosecond, 21874680.68

After:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 7781475640, 250000000, 31.13, 31.13 nanosecond, 32127582.42
FSIN, 3908226340, 250000000, 15.63, 15.63 nanosecond, 63967636.02
FCOS, 4101809360, 250000000, 16.41, 16.41 nanosecond, 60948712.64

Cortex-A720/Radxa: Before:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 16818118130, 250000000, 67.27, 67.27 nanosecond, 14864921.16
FSIN, 14680384340, 250000000, 58.72, 58.72 nanosecond, 17029526.90
FCOS, 15025618350, 250000000, 60.10, 60.10 nanosecond, 16638250.37

After:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 21518495820, 250000000, 86.07, 86.07 nanosecond, 11617912.43
FSIN, 3161922390, 250000000, 12.65, 12.65 nanosecond, 79065824.26
FCOS, 3327755660, 250000000, 13.31, 13.31 nanosecond, 75125708.00

Looks like Cortex-A720 is worse off in SINCOS because of the code emission. Would be interesting to keep it in the dispatcher and branch out so we don't annihilate icache. Kind of like the conversion operations.

Ufff - inside the dispatcher I could use more registers so I tried to make the code closer to the advsimd in the arm optimized-routines repo.

I have done similarly for other operations and I will push them as separate prs. If you run the same microbenchmarks as earlier do you get better results?

@Sonicadvance1
Copy link
Member

Much better! So I guess the main question now is checking if the precision difference causes real problems or not?

A1A-Before:
Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FPTAN, 16301669240, 250000000, 65.21, 65.21 nanosecond, 15335852.81
FSIN, 10976102700, 250000000, 43.90, 43.90 nanosecond, 22776754.81
FCOS, 11436870040, 250000000, 45.75, 45.75 nanosecond, 21859127.46
FSINCOS, 13527759940, 250000000, 54.11, 54.11 nanosecond, 18480517.18

A1A-After:
Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FPTAN, 5591198400, 250000000, 22.36, 22.36 nanosecond, 44713133.41
FSIN, 3363389100, 250000000, 13.45, 13.45 nanosecond, 74329788.37
FCOS, 4450470540, 250000000, 17.80, 17.80 nanosecond, 56173835.50
FSINCOS, 7663827020, 250000000, 30.66, 30.66 nanosecond, 32620778.02

Cortex-A720/Radxa-Before:
Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FPTAN, 17921762970, 250000000, 71.69, 71.69 nanosecond, 13949520.50
FSIN, 14688639670, 250000000, 58.75, 58.75 nanosecond, 17019955.94
FCOS, 15022294320, 250000000, 60.09, 60.09 nanosecond, 16641931.96
FSINCOS, 16760394360, 250000000, 67.04, 67.04 nanosecond, 14916116.81

Cortex-A720/Radxa-After:
Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FPTAN, 4265674550, 250000000, 17.06, 17.06 nanosecond, 58607377.82
FSIN, 3624740340, 250000000, 14.50, 14.50 nanosecond, 68970457.62
FCOS, 3544529050, 250000000, 14.18, 14.18 nanosecond, 70531231.79
FSINCOS, 6225296450, 250000000, 24.90, 24.90 nanosecond, 40158730.11

@Sonicadvance1
Copy link
Member

Oop, looks like something about this implementation breaks Mirror's Edge from running.

@pmatos
Copy link
Collaborator Author

pmatos commented Mar 17, 2026

Oop, looks like something about this implementation breaks Mirror's Edge from running.

That's odd - thanks for pointing that out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants