Enable safe recursion with trampoline #54

majocha · 2025-08-13T08:28:23Z

Proposed Changes

Use synchronous single thread trampolines whenever recursion is deep enough.
Cache exceptions in a ExceptionDispatchInfo to cut short deep stack traces and enable fast recovery from deep call stacks.

This builds and the test passes without SO.

Types of changes

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist

Many todos here:

Further comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution you did and what alternatives you considered, etc...

src/IcedTasks/CancellablePoolingValueTask.fs

src/IcedTasks/CancellableTaskBuilderBase.fs

src/IcedTasks/TaskBuilderBase.fs

tests/IcedTasks.Tests/ColdTaskTests.fs

src/IcedTasks/Trampoline.fs

majocha · 2025-08-14T10:20:56Z

This is just a proof of concept. Most probably I'll rework it quite a lot before it's ready.

majocha · 2025-08-17T12:00:54Z

This currently is blocked by an issue that possibly needs fixing elsewhere:

When you queue a lot of work items concurrently (for example with Task.Yield() in all the "for in ..." tests here), the following starts to throw in resumable code:

      if sm.Data.Finished then
          if not sm.Data.Finished then failwith "corrupted state machine"

I'll try to make a minimal repro later.

This is probably the same as dotnet/fsharp#18853, with the difference being TaskSeq uses a reference type for state machine Data, hence the null there and zeroed Data struct here.

majocha · 2025-08-21T07:26:03Z

This currently is blocked by an issue that possibly needs fixing elsewhere:

Yeah, the bug is in my code after all :D

The thing is, AwaitUnsafeOnCompleted does not wait for the current MoveNext to finish before starting the next one.
That's why it's crucial to call it as the very last one thing in the step. Any logic or assignments after AwaitUnsafeOnCompleted are not safe.

src/IcedTasks/Trampoline.fs

majocha · 2025-08-22T20:21:28Z

Ok, major things are solved and this seems to be holding up when I test it locally. I'll try to run some benchmarks later.

majocha · 2025-08-22T21:09:46Z

AsyncCompletion benchmark

master


BenchmarkDotNet v0.13.9+228a464e8be6c580ad9408e98f18813f6407fb5a, Windows 11 (10.0.26100.4946)
13th Gen Intel Core i5-13600KF, 1 CPU, 20 logical and 14 physical cores
.NET SDK 9.0.304
  [Host]     : .NET 8.0.19 (8.0.1925.36514), X64 RyuJIT AVX2 DEBUG
  DefaultJob : .NET 8.0.19 (8.0.1925.36514), X64 RyuJIT AVX2

Method	Categories	Mean	Error	StdDev	Ratio	RatioSD	Gen0	Allocated	Alloc Ratio
CSharp_TenBindsAsync_TaskBuilder	AsyncBinds,CSharp,TaskBuilder	3.082 μs	0.0353 μs	0.0331 μs	1.00	0.00	0.0039	96 B	1.00
CSharp_TenBindsAsync_ValueTaskBuilder	AsyncBinds,CSharp,ValueTaskBuilder	3.157 μs	0.0426 μs	0.0377 μs	1.03	0.02	0.0078	104 B	1.08
FSharp_TenBindsAsync_AsyncBuilder	AsyncBinds,FSharp,AsyncBuilder	62.071 μs	0.7312 μs	0.6482 μs	20.16	0.29	0.6250	8224 B	85.67
FSharp_TenBindsAsync_CancellableTaskBuilder	AsyncBinds,FSharp,CancellableTaskBuilder	3.796 μs	0.0203 μs	0.0190 μs	1.23	0.01	0.0625	808 B	8.42
FSharp_TenBindsAsync_CancellableTaskBuilder_BindCancellableTask	AsyncBinds,FSharp,CancellableTaskBuilder,BindCancellableValueTask	3.593 μs	0.0188 μs	0.0176 μs	1.17	0.02	0.0625	808 B	8.42
FSharp_TenBindsAsync_CancellableValueTaskBuilder	AsyncBinds,FSharp,CancellableValueTaskBuilder	3.611 μs	0.0244 μs	0.0228 μs	1.17	0.02	0.0625	824 B	8.58
FSharp_TenBindsAsync_CancellableValueTaskBuilder_BindCancellableTask	AsyncBinds,FSharp,CancellableValueTaskBuilder,BindCancellableValueTask	3.597 μs	0.0297 μs	0.0278 μs	1.17	0.02	0.0625	824 B	8.58
FSharp_TenBindsAsync_PlyTaskBuilder	AsyncBinds,FSharp,PlyTaskBuilder	3.847 μs	0.0400 μs	0.0375 μs	1.25	0.02	0.0508	656 B	6.83
FSharp_TenBindsAsync_PlyValueTaskBuilder	AsyncBinds,FSharp,PlyValueTaskBuilder	3.587 μs	0.0225 μs	0.0211 μs	1.16	0.02	0.0508	656 B	6.83
FSharp_TenBindsAsync_TaskBuilder	AsyncBinds,FSharp,TaskBuilder	3.183 μs	0.0427 μs	0.0400 μs	1.03	0.02	0.0078	112 B	1.17
FSharp_TenBindsAsync_ValueTaskBuilder	AsyncBinds,FSharp,ValueTaskBuilder	3.802 μs	0.0240 μs	0.0225 μs	1.23	0.02	0.0586	744 B	7.75

trampoline, this PR


BenchmarkDotNet v0.13.9+228a464e8be6c580ad9408e98f18813f6407fb5a, Windows 11 (10.0.26100.4946)
13th Gen Intel Core i5-13600KF, 1 CPU, 20 logical and 14 physical cores
.NET SDK 9.0.304
  [Host]     : .NET 8.0.19 (8.0.1925.36514), X64 RyuJIT AVX2 DEBUG
  DefaultJob : .NET 8.0.19 (8.0.1925.36514), X64 RyuJIT AVX2

Method	Categories	Mean	Error	StdDev	Ratio	RatioSD	Gen0	Allocated	Alloc Ratio
CSharp_TenBindsAsync_TaskBuilder	AsyncBinds,CSharp,TaskBuilder	3.152 μs	0.0316 μs	0.0295 μs	1.00	0.00	0.0039	96 B	1.00
CSharp_TenBindsAsync_ValueTaskBuilder	AsyncBinds,CSharp,ValueTaskBuilder	3.197 μs	0.0197 μs	0.0184 μs	1.01	0.01	0.0078	104 B	1.08
FSharp_TenBindsAsync_AsyncBuilder	AsyncBinds,FSharp,AsyncBuilder	62.764 μs	0.7427 μs	0.6947 μs	19.92	0.31	0.6250	8224 B	85.67
FSharp_TenBindsAsync_CancellableTaskBuilder	AsyncBinds,FSharp,CancellableTaskBuilder	3.673 μs	0.0149 μs	0.0140 μs	1.17	0.01	0.0664	840 B	8.75
FSharp_TenBindsAsync_CancellableTaskBuilder_BindCancellableTask	AsyncBinds,FSharp,CancellableTaskBuilder,BindCancellableValueTask	3.754 μs	0.0193 μs	0.0180 μs	1.19	0.01	0.0664	840 B	8.75
FSharp_TenBindsAsync_CancellableValueTaskBuilder	AsyncBinds,FSharp,CancellableValueTaskBuilder	3.747 μs	0.0199 μs	0.0186 μs	1.19	0.01	0.0664	856 B	8.92
FSharp_TenBindsAsync_CancellableValueTaskBuilder_BindCancellableTask	AsyncBinds,FSharp,CancellableValueTaskBuilder,BindCancellableValueTask	3.727 μs	0.0350 μs	0.0327 μs	1.18	0.02	0.0625	856 B	8.92
FSharp_TenBindsAsync_PlyTaskBuilder	AsyncBinds,FSharp,PlyTaskBuilder	3.742 μs	0.0290 μs	0.0271 μs	1.19	0.01	0.0508	656 B	6.83
FSharp_TenBindsAsync_PlyValueTaskBuilder	AsyncBinds,FSharp,PlyValueTaskBuilder	3.709 μs	0.0265 μs	0.0221 μs	1.18	0.01	0.0469	656 B	6.83
FSharp_TenBindsAsync_TaskBuilder	AsyncBinds,FSharp,TaskBuilder	3.105 μs	0.0288 μs	0.0269 μs	0.99	0.01	0.0078	112 B	1.17
FSharp_TenBindsAsync_ValueTaskBuilder	AsyncBinds,FSharp,ValueTaskBuilder	3.739 μs	0.0169 μs	0.0158 μs	1.19	0.01	0.0586	760 B	7.92

majocha · 2025-08-23T11:07:18Z

Now there is yet another problem.

The general idea how this PR works:
Each new task when started, delays its first MoveNext() step on the trampoline. This way a chain of task starting task starting task does not SO.

This generally works well within a computation where tasks are started when bound (recursive do!, return! etc.), but what if the task is not started as a rhs of a let! or return!?

Consider

let rec foo n =
    async {
        if n = 0 then
            return 42
        else
            let x = foo (n - 1) |> Async.RunSynchronously
            return x
    }

foo 10_000 |> Async.RunSynchronously

this will SO because each Async.RunSynchronously starts on a new trampoline. There's nothing we can do about it and it's ok.

However with this PR,

let rec foo n =
    coldTask {
        if n = 0 then
            return 42
        else
            let x = foo (n - 1) () |> _.Result
            return x
    }

foo 10_000 () |> _.Result

Will hang instead. This is bad.

The issue is, given ColdTask<'t> is just unit -> Task<'t>, there is no straightforward way to tell it that it starts as a rhs of let! or return!.

OTOH I can easily make a Async2 type like this

type Async2<'T>(start: unit -> Task<'T>) =

    member _.Start() =
        Trampoline.PushNewTrampoline()
        start()

    member _.StartBound() = start()

then in the builder:

member inline _.Source(code: Async2<_>) = code.StartBound() |> _.GetAwaiter()

and it will work just as async in the same scenario:

let rec foo n =
    async2 {
        if n = 0 then
            return 42
        else
            let x = foo (n - 1) |> _.Start() |> _.Result
            return x
    }

foo2 10_000 |> Async2.run

So, the big TODO now is come up with some way of doing this with tasks / cold tasks.

This seems doable for the cold task family at least without changes to the API.
Hot tasks look hopeless, the only thing possible is another, stack safe variant.

majocha · 2025-08-23T18:18:39Z

This seems doable for the cold task family at least without changes to the API.
Hot tasks look hopeless, the only thing possible is another, stack safe variant.

So I removed hot tasks from the picture. This is currently only for the cold start variants.
The way the above problem is solved, it calls BindContext.SetIsBind() before any evaluation of cold task in the context of CE. This signals to the task that it can use the current trampoline, because it is a rhs of a let!, 'return!' etc.
But still it feels hacky and unmaintainable. The tests pass now, but there are so many places where the tasks are evaluated, between all the Source methods, various extensions, getAwaiter and so on I feel I missed quite a few, even with copilot help.

majocha · 2025-08-28T08:16:52Z

Unfortunatelly I can't find a clean way to make it work with CancellableTask. I understand, it is by design erased into a standard Task, as in, there is no Source(task: CancellableTask<'TResult1>) to plug in to.

TheAngryByrd · 2025-08-28T14:00:06Z

Unfortunatelly I can't find a clean way to make it work with CancellableTask. I understand, it is by design erased into a standard Task, as in, there is no Source(task: CancellableTask<'TResult1>) to plug in to.

It should be similar to ColtTasks. There was a lot of shared code between the various builders, and was moved into CancellableTaskBuilderBase. I didn't use the alias here for some reason.

IcedTasks/src/IcedTasks/CancellableTaskBuilderBase.fs

Lines 735 to 738 in 5af1568

    
                       member inline _.Source 
        
                           ([<InlineIfLambda>] cancellableTask: CancellationToken -> Task<'T>) 
        
                           : CancellationToken -> Awaiter<TaskAwaiter<'T>, 'T> = 
        
                           (fun ct -> Awaitable.GetTaskAwaiter(cancellableTask ct))

majocha · 2025-08-28T15:16:44Z

tests/IcedTasks.Tests/CancellableTaskTests.fs

+                        if n = 0 then
+                            return false
+                        else
+                            return! evenC (n - 1) CancellationToken.None


Yeah it seems, this won't work for dissimilar CE's involving cancellableTask like here. The problem is, the function call CancellationToken -> Task<'t> happens before the resulting Task<'t> is passed to Source, so there is no way to intercept it.

Right, you would have to do return! fun () -> evenC (n - 1) CancellationToken.None

Wonder if we can get an LLM to write an analyzer for checking if returning tasks in a TCO spot.

:) OTOH this is not a big deal, there just should be clear documentation that this does not work for hot tasks, and in the test we effectively bind a hot task.

It's a pity this could work for all tasks, hot and cold, if not for a possibility of sync over async antipattern.

:) OTOH this is not a big deal, there just should be clear documentation that this does not work for hot tasks, and in the test we effectively bind a hot task.

Feels like such an easy footgun though 😢

It's a pity this could work for all tasks, hot and cold, if not for a possibility of sync over async antipattern.

What do you mean here. Im familiar with the pattern but not why it applies here. Is this something we could handle but if someone calls .Result its their problem?

Yes, that's what I'm fighting for almost a week.
In practice, we could just count tasks started synchronously on the thread and bounce every 50th or so, and we're safe from SO.
But if someone calls _.Result on a bounced task (from the inside of the CE), it's a instant dead lock. Because the trampoline is synchronous, it's not unlike synchronization context dead lock, I guess.

I feel like “sync over async” pattern is documented and taught enough that we rely on that footgun being user error.

Yes, but I feel it happens in the wild a lot, in non-essential places like tests for example. If this started dead locking it would break a lot of things.

majocha · 2025-08-28T16:25:51Z

BTW i reduced the bind limit from 100 to 50, probably that's why the MacOS leg passes now.

TheAngryByrd · 2025-09-01T18:09:02Z

BTW i reduced the bind limit from 100 to 50, probably that's why the MacOS leg passes now.

Is this something we need to be a bit more dynamic about or allow someone to set? I know you can set the stack size as an environment variable. Doubt we can handle if someone uses editbin.exe to edit the stacksize of a windows binary tho.

majocha · 2025-09-01T19:34:51Z

BTW i reduced the bind limit from 100 to 50, probably that's why the MacOS leg passes now.

Is this something we need to be a bit more dynamic about or allow someone to set? I know you can set the stack size as an environment variable. Doubt we can handle if someone uses editbin.exe to edit the stacksize of a windows binary tho.

Classic async just has a hardcoded limit:
https://github.com/dotnet/fsharp/blob/9425e4d8f16eeb48e7cb499d3446950baa90e426/src/FSharp.Core/async.fs#L87

The internal StackGuard in the compiler has defaults that can be overriden by env vars.

I think for this PR eventually this could be configurable by an utility function, the whole thing could be also opt in, if there is no good solution for the problematic edge cases.

As an aside, there's also a mistake in this PR currently that I need to fix/revert, the limit should be checked each time there's a possible yield, not just once for each started task.

majocha · 2025-09-01T20:14:40Z

Bind limit of 300 do pass the tests on Windows, but IIUC, Macs usually have a shorter stack. This is also very much dependent on the complexity of the compiled state machine.

Another open question is how to do a seamless tail-call hand over with ReturnFromFinal. One idea would be to abandon the previous state machine and continue on the next one on each tail-call. But that would require a custom awaiter, some construct that has the overview of the whole execution, not just the single task that happens to be running at the moment.

majocha · 2025-09-28T12:18:46Z

I'm not giving up on this. I think this could work in general. I have a few more ideas I need to try and some improvements I need to integrate, for example a working prototype of ReturnFromFinal tail-calls.