Skip to content

Conversation

@majocha
Copy link

@majocha majocha commented Aug 13, 2025

Proposed Changes

Use synchronous single thread trampolines whenever recursion is deep enough.
Cache exceptions in a ExceptionDispatchInfo to cut short deep stack traces and enable fast recovery from deep call stacks.

This builds and the test passes without SO.

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist

Many todos here:

  • Build and tests pass locally
  • I have added tests that prove my fix is effective or that my feature works (if appropriate)
  • Implement the dynamic path
  • use trampoline for SetResult / SetException to make this work for old .NET Framework
  • double check the exception caching with EDI works as intended
  • clean up and organize code
  • come up with some way to dispose the trampolines that are not in use. (Is it worth it ?)
  • Add more tests
  • Benchmark this
  • I have added necessary documentation (if appropriate)

Further comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution you did and what alternatives you considered, etc...

@majocha
Copy link
Author

majocha commented Aug 14, 2025

This is just a proof of concept. Most probably I'll rework it quite a lot before it's ready.

@majocha
Copy link
Author

majocha commented Aug 17, 2025

This currently is blocked by an issue that possibly needs fixing elsewhere:

When you queue a lot of work items concurrently (for example with Task.Yield() in all the "for in ..." tests here), the following starts to throw in resumable code:

      if sm.Data.Finished then
          if not sm.Data.Finished then failwith "corrupted state machine"

I'll try to make a minimal repro later.

This is probably the same as dotnet/fsharp#18853, with the difference being TaskSeq uses a reference type for state machine Data, hence the null there and zeroed Data struct here.

@majocha
Copy link
Author

majocha commented Aug 21, 2025

This currently is blocked by an issue that possibly needs fixing elsewhere:

Yeah, the bug is in my code after all :D

The thing is, AwaitUnsafeOnCompleted does not wait for the current MoveNext to finish before starting the next one.
That's why it's crucial to call it as the very last one thing in the step. Any logic or assignments after AwaitUnsafeOnCompleted are not safe.

@majocha
Copy link
Author

majocha commented Aug 22, 2025

Ok, major things are solved and this seems to be holding up when I test it locally. I'll try to run some benchmarks later.

@majocha
Copy link
Author

majocha commented Aug 22, 2025

AsyncCompletion benchmark

master


BenchmarkDotNet v0.13.9+228a464e8be6c580ad9408e98f18813f6407fb5a, Windows 11 (10.0.26100.4946)
13th Gen Intel Core i5-13600KF, 1 CPU, 20 logical and 14 physical cores
.NET SDK 9.0.304
  [Host]     : .NET 8.0.19 (8.0.1925.36514), X64 RyuJIT AVX2 DEBUG
  DefaultJob : .NET 8.0.19 (8.0.1925.36514), X64 RyuJIT AVX2


Method Categories Mean Error StdDev Ratio RatioSD Gen0 Allocated Alloc Ratio
CSharp_TenBindsAsync_TaskBuilder AsyncBinds,CSharp,TaskBuilder 3.082 μs 0.0353 μs 0.0331 μs 1.00 0.00 0.0039 96 B 1.00
CSharp_TenBindsAsync_ValueTaskBuilder AsyncBinds,CSharp,ValueTaskBuilder 3.157 μs 0.0426 μs 0.0377 μs 1.03 0.02 0.0078 104 B 1.08
FSharp_TenBindsAsync_AsyncBuilder AsyncBinds,FSharp,AsyncBuilder 62.071 μs 0.7312 μs 0.6482 μs 20.16 0.29 0.6250 8224 B 85.67
FSharp_TenBindsAsync_CancellableTaskBuilder AsyncBinds,FSharp,CancellableTaskBuilder 3.796 μs 0.0203 μs 0.0190 μs 1.23 0.01 0.0625 808 B 8.42
FSharp_TenBindsAsync_CancellableTaskBuilder_BindCancellableTask AsyncBinds,FSharp,CancellableTaskBuilder,BindCancellableValueTask 3.593 μs 0.0188 μs 0.0176 μs 1.17 0.02 0.0625 808 B 8.42
FSharp_TenBindsAsync_CancellableValueTaskBuilder AsyncBinds,FSharp,CancellableValueTaskBuilder 3.611 μs 0.0244 μs 0.0228 μs 1.17 0.02 0.0625 824 B 8.58
FSharp_TenBindsAsync_CancellableValueTaskBuilder_BindCancellableTask AsyncBinds,FSharp,CancellableValueTaskBuilder,BindCancellableValueTask 3.597 μs 0.0297 μs 0.0278 μs 1.17 0.02 0.0625 824 B 8.58
FSharp_TenBindsAsync_PlyTaskBuilder AsyncBinds,FSharp,PlyTaskBuilder 3.847 μs 0.0400 μs 0.0375 μs 1.25 0.02 0.0508 656 B 6.83
FSharp_TenBindsAsync_PlyValueTaskBuilder AsyncBinds,FSharp,PlyValueTaskBuilder 3.587 μs 0.0225 μs 0.0211 μs 1.16 0.02 0.0508 656 B 6.83
FSharp_TenBindsAsync_TaskBuilder AsyncBinds,FSharp,TaskBuilder 3.183 μs 0.0427 μs 0.0400 μs 1.03 0.02 0.0078 112 B 1.17
FSharp_TenBindsAsync_ValueTaskBuilder AsyncBinds,FSharp,ValueTaskBuilder 3.802 μs 0.0240 μs 0.0225 μs 1.23 0.02 0.0586 744 B 7.75

trampoline, this PR


BenchmarkDotNet v0.13.9+228a464e8be6c580ad9408e98f18813f6407fb5a, Windows 11 (10.0.26100.4946)
13th Gen Intel Core i5-13600KF, 1 CPU, 20 logical and 14 physical cores
.NET SDK 9.0.304
  [Host]     : .NET 8.0.19 (8.0.1925.36514), X64 RyuJIT AVX2 DEBUG
  DefaultJob : .NET 8.0.19 (8.0.1925.36514), X64 RyuJIT AVX2


Method Categories Mean Error StdDev Ratio RatioSD Gen0 Allocated Alloc Ratio
CSharp_TenBindsAsync_TaskBuilder AsyncBinds,CSharp,TaskBuilder 3.152 μs 0.0316 μs 0.0295 μs 1.00 0.00 0.0039 96 B 1.00
CSharp_TenBindsAsync_ValueTaskBuilder AsyncBinds,CSharp,ValueTaskBuilder 3.197 μs 0.0197 μs 0.0184 μs 1.01 0.01 0.0078 104 B 1.08
FSharp_TenBindsAsync_AsyncBuilder AsyncBinds,FSharp,AsyncBuilder 62.764 μs 0.7427 μs 0.6947 μs 19.92 0.31 0.6250 8224 B 85.67
FSharp_TenBindsAsync_CancellableTaskBuilder AsyncBinds,FSharp,CancellableTaskBuilder 3.673 μs 0.0149 μs 0.0140 μs 1.17 0.01 0.0664 840 B 8.75
FSharp_TenBindsAsync_CancellableTaskBuilder_BindCancellableTask AsyncBinds,FSharp,CancellableTaskBuilder,BindCancellableValueTask 3.754 μs 0.0193 μs 0.0180 μs 1.19 0.01 0.0664 840 B 8.75
FSharp_TenBindsAsync_CancellableValueTaskBuilder AsyncBinds,FSharp,CancellableValueTaskBuilder 3.747 μs 0.0199 μs 0.0186 μs 1.19 0.01 0.0664 856 B 8.92
FSharp_TenBindsAsync_CancellableValueTaskBuilder_BindCancellableTask AsyncBinds,FSharp,CancellableValueTaskBuilder,BindCancellableValueTask 3.727 μs 0.0350 μs 0.0327 μs 1.18 0.02 0.0625 856 B 8.92
FSharp_TenBindsAsync_PlyTaskBuilder AsyncBinds,FSharp,PlyTaskBuilder 3.742 μs 0.0290 μs 0.0271 μs 1.19 0.01 0.0508 656 B 6.83
FSharp_TenBindsAsync_PlyValueTaskBuilder AsyncBinds,FSharp,PlyValueTaskBuilder 3.709 μs 0.0265 μs 0.0221 μs 1.18 0.01 0.0469 656 B 6.83
FSharp_TenBindsAsync_TaskBuilder AsyncBinds,FSharp,TaskBuilder 3.105 μs 0.0288 μs 0.0269 μs 0.99 0.01 0.0078 112 B 1.17
FSharp_TenBindsAsync_ValueTaskBuilder AsyncBinds,FSharp,ValueTaskBuilder 3.739 μs 0.0169 μs 0.0158 μs 1.19 0.01 0.0586 760 B 7.92

@majocha
Copy link
Author

majocha commented Aug 23, 2025

Now there is yet another problem.

The general idea how this PR works:
Each new task when started, delays its first MoveNext() step on the trampoline. This way a chain of task starting task starting task does not SO.

This generally works well within a computation where tasks are started when bound (recursive do!, return! etc.), but what if the task is not started as a rhs of a let! or return!?

Consider

let rec foo n =
    async {
        if n = 0 then
            return 42
        else
            let x = foo (n - 1) |> Async.RunSynchronously
            return x
    }

foo 10_000 |> Async.RunSynchronously

this will SO because each Async.RunSynchronously starts on a new trampoline. There's nothing we can do about it and it's ok.

However with this PR,

let rec foo n =
    coldTask {
        if n = 0 then
            return 42
        else
            let x = foo (n - 1) () |> _.Result
            return x
    }

foo 10_000 () |> _.Result

Will hang instead. This is bad.

The issue is, given ColdTask<'t> is just unit -> Task<'t>, there is no straightforward way to tell it that it starts as a rhs of let! or return!.

OTOH I can easily make a Async2 type like this

type Async2<'T>(start: unit -> Task<'T>) =

    member _.Start() =
        Trampoline.PushNewTrampoline()
        start()

    member _.StartBound() = start()

then in the builder:

member inline _.Source(code: Async2<_>) = code.StartBound() |> _.GetAwaiter()

and it will work just as async in the same scenario:

let rec foo n =
    async2 {
        if n = 0 then
            return 42
        else
            let x = foo (n - 1) |> _.Start() |> _.Result
            return x
    }

foo2 10_000 |> Async2.run

So, the big TODO now is come up with some way of doing this with tasks / cold tasks.

This seems doable for the cold task family at least without changes to the API.
Hot tasks look hopeless, the only thing possible is another, stack safe variant.

@majocha
Copy link
Author

majocha commented Aug 23, 2025

This seems doable for the cold task family at least without changes to the API.
Hot tasks look hopeless, the only thing possible is another, stack safe variant.

So I removed hot tasks from the picture. This is currently only for the cold start variants.
The way the above problem is solved, it calls BindContext.SetIsBind() before any evaluation of cold task in the context of CE. This signals to the task that it can use the current trampoline, because it is a rhs of a let!, 'return!' etc.
But still it feels hacky and unmaintainable. The tests pass now, but there are so many places where the tasks are evaluated, between all the Source methods, various extensions, getAwaiter and so on I feel I missed quite a few, even with copilot help.

@majocha
Copy link
Author

majocha commented Aug 28, 2025

Unfortunatelly I can't find a clean way to make it work with CancellableTask. I understand, it is by design erased into a standard Task, as in, there is no Source(task: CancellableTask<'TResult1>) to plug in to.

@TheAngryByrd
Copy link
Owner

Unfortunatelly I can't find a clean way to make it work with CancellableTask. I understand, it is by design erased into a standard Task, as in, there is no Source(task: CancellableTask<'TResult1>) to plug in to.

It should be similar to ColtTasks. There was a lot of shared code between the various builders, and was moved into CancellableTaskBuilderBase. I didn't use the alias here for some reason.

member inline _.Source
([<InlineIfLambda>] cancellableTask: CancellationToken -> Task<'T>)
: CancellationToken -> Awaiter<TaskAwaiter<'T>, 'T> =
(fun ct -> Awaitable.GetTaskAwaiter(cancellableTask ct))

if n = 0 then
return false
else
return! evenC (n - 1) CancellationToken.None
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it seems, this won't work for dissimilar CE's involving cancellableTask like here. The problem is, the function call CancellationToken -> Task<'t> happens before the resulting Task<'t> is passed to Source, so there is no way to intercept it.

Copy link
Owner

@TheAngryByrd TheAngryByrd Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, you would have to do return! fun () -> evenC (n - 1) CancellationToken.None

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if we can get an LLM to write an analyzer for checking if returning tasks in a TCO spot.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) OTOH this is not a big deal, there just should be clear documentation that this does not work for hot tasks, and in the test we effectively bind a hot task.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a pity this could work for all tasks, hot and cold, if not for a possibility of sync over async antipattern.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) OTOH this is not a big deal, there just should be clear documentation that this does not work for hot tasks, and in the test we effectively bind a hot task.

Feels like such an easy footgun though 😢

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a pity this could work for all tasks, hot and cold, if not for a possibility of sync over async antipattern.

What do you mean here. Im familiar with the pattern but not why it applies here. Is this something we could handle but if someone calls .Result its their problem?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I'm fighting for almost a week.
In practice, we could just count tasks started synchronously on the thread and bounce every 50th or so, and we're safe from SO.
But if someone calls _.Result on a bounced task (from the inside of the CE), it's a instant dead lock. Because the trampoline is synchronous, it's not unlike synchronization context dead lock, I guess.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like “sync over async” pattern is documented and taught enough that we rely on that footgun being user error.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I feel it happens in the wild a lot, in non-essential places like tests for example. If this started dead locking it would break a lot of things.

@majocha
Copy link
Author

majocha commented Aug 28, 2025

BTW i reduced the bind limit from 100 to 50, probably that's why the MacOS leg passes now.

@TheAngryByrd
Copy link
Owner

BTW i reduced the bind limit from 100 to 50, probably that's why the MacOS leg passes now.

Is this something we need to be a bit more dynamic about or allow someone to set? I know you can set the stack size as an environment variable. Doubt we can handle if someone uses editbin.exe to edit the stacksize of a windows binary tho.

@majocha
Copy link
Author

majocha commented Sep 1, 2025

BTW i reduced the bind limit from 100 to 50, probably that's why the MacOS leg passes now.

Is this something we need to be a bit more dynamic about or allow someone to set? I know you can set the stack size as an environment variable. Doubt we can handle if someone uses editbin.exe to edit the stacksize of a windows binary tho.

Classic async just has a hardcoded limit:
https://github.com/dotnet/fsharp/blob/9425e4d8f16eeb48e7cb499d3446950baa90e426/src/FSharp.Core/async.fs#L87

The internal StackGuard in the compiler has defaults that can be overriden by env vars.

I think for this PR eventually this could be configurable by an utility function, the whole thing could be also opt in, if there is no good solution for the problematic edge cases.

As an aside, there's also a mistake in this PR currently that I need to fix/revert, the limit should be checked each time there's a possible yield, not just once for each started task.

@majocha
Copy link
Author

majocha commented Sep 1, 2025

Bind limit of 300 do pass the tests on Windows, but IIUC, Macs usually have a shorter stack. This is also very much dependent on the complexity of the compiled state machine.

Another open question is how to do a seamless tail-call hand over with ReturnFromFinal. One idea would be to abandon the previous state machine and continue on the next one on each tail-call. But that would require a custom awaiter, some construct that has the overview of the whole execution, not just the single task that happens to be running at the moment.

@majocha
Copy link
Author

majocha commented Sep 28, 2025

I'm not giving up on this. I think this could work in general. I have a few more ideas I need to try and some improvements I need to integrate, for example a working prototype of ReturnFromFinal tail-calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants