Skip to content

Async multi packet fixes for 6.1.0 #3534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

Conversation

Wraith2
Copy link
Contributor

@Wraith2 Wraith2 commented Jul 31, 2025

Fixes #3519

Fix 1:
When reading multi packet strings it is possible for multiple strings to happen in a single row. When reading asynchronously a snapshot is used which contains a linked list of packets. The current codebase has logic which keeps a cleared spare linked list node when the snapshot is cleared. The logic to clear the spare packet was faulty and did not clear all the fields leaving the data length in the node. In specific circumstances it is possible to re-use the spare linked list node containing an old data value as the first packet in a new linked list of packets. When this happens in a read which reaches the continue stage (3 or more packets) the size calculation is incorrect and various errors can occur.

The spare packet functionality is not very useful because it can store a single node. It doesn't retain the byte[] buffer so the memory saving is tiny. I have removed it and changed the linked list node fields to be readonly. This resolves the bug.

Fix 2:
When reading a multi packet string the plp chunks are read from each packet and the end is signalled by a terminator. It is possible for the data to align such that the contents of a string complete exactly at the end of a packet and the terminator is in the next packet. In this case some pre-existing logic checks for a 0 chars remaining and exists early.

This logic needed to be updated so that in when continuing it returns the entire existing length read and not a 0 value.

Fix 3:
While debugging the first two issues the buffer sizes and calculations were confusing me. I eventually realised that the code was directly using _longlenleft which is measured in bytes to size a char array, meaning that all char arrays were twice as long as needed. I have updated the code to handle that and use smaller appropriately sized arrays.

I have updated the existing test to iterate from 512 (minimum packet size) to 2048 bytes in size. This can cause lots of interesting alignments in the data testing the paths through the string reading code more effectively. The range could be increased but I considered that the runtime needed to be low enough to not timeout CI runs, most higher packet size will be similar to lower sized runs due to factoring.

Thanks to @erenes and @Suchiman for their help finding the reproduction that worked on my machine, without that I would have been unable to fix anything

@dotnet/sqlclientdevteam can I get a CI run please.

/cc @Jakimar

Wraith2 added 2 commits July 31, 2025 18:53
fix 0 length read at the start of a packe in plp stream returning 0 when continuing
handle char array sizing better
change existing test to use multiple packet sizes
@Wraith2 Wraith2 requested a review from a team as a code owner July 31, 2025 20:57
@Wraith2
Copy link
Contributor Author

Wraith2 commented Jul 31, 2025

@ErikEJ this might reduce memory usage for string reads. It might be worth benching the artifacts if the CI runs green.

@mdaigle
Copy link
Contributor

mdaigle commented Jul 31, 2025

/azp run

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 1, 2025

I've added an additional fix which is the same as the 0 length left in terminator case and which occurs on the varchar not nvarchar read path.

@mdaigle
Copy link
Contributor

mdaigle commented Aug 1, 2025

/azp run

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@apoorvdeshmukh apoorvdeshmukh added this to the 6.1.1 milestone Aug 4, 2025
@cheenamalhotra cheenamalhotra removed this from the 6.1.1 milestone Aug 12, 2025
force process sni compatibility mode by default
@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 12, 2025

@dotnet/sqlclientdevteam can I get a CI run on this please.

I've added a new commit which forces process sni mode to compatibility mode (and by extension, disabled async-continue mode) and adds in a fix for the pending read counter imbalance that we discussed and that @rhuijben has been assisting with tracking down today. This is a possible stable current codebase state to evaluate.

@mdaigle
Copy link
Contributor

mdaigle commented Aug 12, 2025

/azp run

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 12, 2025

I've aligned the appcontext switch test with the new defaults. Can i get another run please @dotnet/sqlclientdevteam

@paulmedynski
Copy link
Contributor

/azp run

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

Copy link
Contributor

@paulmedynski paulmedynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking for some clarity on >> 1 versus / 2.

@@ -13206,7 +13206,7 @@ bool writeDataSizeToSnapshot
if (stateObj._longlen == 0)
{
Debug.Assert(stateObj._longlenleft == 0);
totalCharsRead = 0;
totalCharsRead = startOffsetByteCount >> 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a division by 2 in disguise? Are you using a special property of right-bit-shift that divide-by-2 doesn't have? Something else?

If the former, please use startOffsetByteCount / 2 for clarity. If the either of the latter, please document why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No magic. Just using the same idiom as the containing methods. I've changed it to use division instead of shift.

I've also changed the multiplexer test detection of compatibility to match the library which should skip the multiplexer tests correctly now.

@rhuijben
Copy link

rhuijben commented Aug 13, 2025

@Wraith2 when I run the testcase from the other issue against this branch I get in DEBUG mode

 SearchDogCrash.SearchDogCrashTests.TestSearchDogCrash
   Source: Class1.cs line 11
   Duration: 947 ms

  Message: 
Microsoft.VisualStudio.TestPlatform.TestHost.DebugAssertException : Method Debug.Fail failed with 'Invalid token after performing CleanPartialRead: 04
', and was translated to Microsoft.VisualStudio.TestPlatform.TestHost.DebugAssertException to avoid terminating the process hosting the test.

  Stack Trace: 
SqlDataReader.TryCleanPartialRead() line 867
SqlDataReader.TryCloseInternal(Boolean closeReader) line 1058
SqlDataReader.Close() line 1009
SqlDataReader.Dispose(Boolean disposing) line 924
DbDataReader.DisposeAsync()
SearchDogCrashTests.Do() line 49
ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
Task.RunContinuations(Object continuationObject)
Task`1.TrySetResult(TResult result)
UnwrapPromise`1.TrySetFromTask(Task task, Boolean lookForOce)
UnwrapPromise`1.ProcessInnerTask(Task task)
Task.RunContinuations(Object continuationObject)
Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
ThreadPoolWorkQueue.Dispatch()
WorkerThread.WorkerThreadStart()

In release mode the test passes.
(Reverted to old version of the library for RepoDB and $dayjob)

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 13, 2025

multitasking here, can you link me to the exact repro you're talking about?

@rhuijben
Copy link

rhuijben commented Aug 13, 2025

multitasking here, can you link me to the exact repro you're talking about?

The testcase from
#3519 (comment)

I'm currently trying to get things reproduced against a docker instance of sqlserver 2019 so we can look at the same thing (and maybe even test this on github actions, like I do in the RepoDB project)

@rhuijben
Copy link

rhuijben commented Aug 13, 2025

This case rhuijben@1964bc1
(Extracted from #3519 (comment))

fails for me on this docker setup.
(If you have docker, running docker compose up -d in the directory with the compose script will give you a local sqlserver instance. The testcase then adds the schema and runs)

Too bad it is not the error I'm seeing myself, but it is still a valid testcase. Trying to extend this to include my case.

It fails on the first (smallest) packetsize of 512.

 SearchDogCrash.SearchDogCrashTests.OtherRepro
   Source: CrashTest.cs line 822
   Duration: 3,6 min

  Message: 
Microsoft.VisualStudio.TestPlatform.TestHost.DebugAssertException : Method Debug.Fail failed with 'partially read packets cannot be appended to the snapshot
', and was translated to Microsoft.VisualStudio.TestPlatform.TestHost.DebugAssertException to avoid terminating the process hosting the test.

  Stack Trace: 
StateSnapshot.AppendPacketData(Byte[] buffer, Int32 read) line 4158
TdsParserStateObject.ProcessSniPacketCompat(PacketHandle packet, UInt32 error) line 529
TdsParserStateObject.ProcessSniPacket(PacketHandle packet, UInt32 error) line 19
TdsParserStateObject.ReadAsyncCallback(IntPtr key, PacketHandle packet, UInt32 error) line 353
TdsParserStateObject.ReadSni(TaskCompletionSource`1 completion) line 3236
TdsParserStateObject.TryReadNetworkPacket() line 2818
TdsParserStateObject.TryPrepareBuffer() line 1299
TdsParserStateObject.TryReadByteArray(Span`1 buff, Int32 len, Int32& totalRead, Int32 startOffset, Boolean writeDataSizeToSnapshot) line 1492
TdsParserStateObject.TryReadByteArray(Span`1 buff, Int32 len, Int32& totalRead) line 1453
TdsParserStateObject.TryReadInt64(Int64& value) line 1721
<13 more frames...>
AwaitTaskContinuation.RunOrScheduleAction(Action action, Boolean allowInlining)
Task.RunContinuations(Object continuationObject)
Task`1.TrySetResult(TResult result)
TaskCompletionSource`1.TrySetResult(TResult result)
SqlDataReader.CompleteAsyncCall[T](Task`1 task, SqlDataReaderBaseAsyncCallContext`1 context) line 6123
SqlDataReaderBaseAsyncCallContext`1.CompleteAsyncCallCallback(Task`1 task, Object state) line 5822
ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
ThreadPoolWorkQueue.Dispatch()
WorkerThread.WorkerThreadStart()

@rhuijben
Copy link

rhuijben commented Aug 13, 2025

Debug.Assert(TdsEnums.HEADER_LEN + Packet.GetDataLengthFromHeader(buffer) == read, "partially read packets cannot be appended to the snapshot");

read=512
buffer = byte[512], (first and last byte are 0x04)
TdsEnums.HEADER_LEN = 8.
Packet.GetDataLengthFromHeader(buffer) returns 503

503+8 = 511, so mismatch.

Looks like the first byte of the next package is already in the buffer here.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 13, 2025

That assert will fire periodically when packet multiplexing is disabled. We should add in the context switch to the assertion.

That might be correct. I saw something similar while looking at the multipart xml reads with a weird packet size. If the packet status does not include the last packet bit, and the requiredlength is less than the total packet as long as the transferred data amount is the same as the buffer size it's technically correct, I think. I'm referring to this as padded packets. I hadn't seen them before 2 weeks ago but the spec doesn't preclude them. When i saw them the remaining data space in the packet buffer was filled with FF. This is part of the reason that i added the DumpPackets and DumpInBuff functions to my debug branch.

@rhuijben
Copy link

With packet size configured as 512 I see 511 byte packets (which fail these tests), but also one really large packet (>= 60 KB). Not sure if the debug assert does the right thing. It looks like the demultiplexer handles these cases just fine.

With this packet code you also always have to handle short-reads caused by network security and TCP packets. There are standard proxies for that last case so you can always get small (or large) packets from the network layer. The DotNet core project uses fuzzing with that to catch http errors, as do a lot of other libraries.

Looks like these asserts are on the wrong layer... as from the network you can have much smaller or larger packets than the TDS packets (smaller when processing really fast, and much longer when the network already delivered more data than a single packet... Which can also happen on slow networks when one packet got lost and is re-delivered, while others are already in the queue.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 15, 2025

I've pushed a bunch of new fixes. Can I get a CI run @dotnet/sqlclientdevteam and if that builds some testing by brave people who have reproduced known issues please?

@paulmedynski
Copy link
Contributor

/azp run

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

Copy link

codecov bot commented Aug 15, 2025

Codecov Report

❌ Patch coverage is 50.00000% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.55%. Comparing base (1e1c52a) to head (0d74d99).
⚠️ Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
...Data/SqlClient/TdsParserStateObject.Multiplexer.cs 0.00% 4 Missing ⚠️
...nt/netfx/src/Microsoft/Data/SqlClient/TdsParser.cs 0.00% 3 Missing ⚠️
...c/Microsoft/Data/SqlClient/TdsParserStateObject.cs 62.50% 3 Missing ⚠️
...icrosoft/Data/SqlClient/LocalAppContextSwitches.cs 66.66% 1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (1e1c52a) and HEAD (0d74d99). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (1e1c52a) HEAD (0d74d99)
addons 1 0
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3534      +/-   ##
==========================================
- Coverage   69.14%   63.55%   -5.59%     
==========================================
  Files         276      268       -8     
  Lines       62414    62154     -260     
==========================================
- Hits        43154    39504    -3650     
- Misses      19260    22650    +3390     
Flag Coverage Δ
addons ?
netcore 66.77% <57.89%> (-6.20%) ⬇️
netfx 63.44% <47.61%> (-4.99%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 15, 2025

All green apart from a timeout. @rhuijben if you get opportunity could you have a try with this branch or PR artifacts and see if you can repro any problems?

@rhuijben
Copy link

I'm away from my PC for a few days. Will follow up when I get back.

@ErikEJ
Copy link
Contributor

ErikEJ commented Aug 16, 2025

I will test the artifact with my repro in the coming week

@ErikEJ
Copy link
Contributor

ErikEJ commented Aug 18, 2025

@Wraith2 My repro code using the SQL 2019 VM that I gave you acces to with build 6.11.0-pull.122707 and
AppContext.SetSwitch("Switch.Microsoft.Data.SqlClient.UseCompatibilityProcessSni", false);
throws:

Unhandled exception. System.Xml.XmlException: Unexpected end tag. Line 1, position 39.
   at System.Xml.XmlTextReaderImpl.Throw(Exception e)
   at System.Xml.XmlTextReaderImpl.Throw(String res, String arg)
   at System.Xml.XmlTextReaderImpl.Throw(Int32 pos, String res)
   at System.Xml.XmlTextReaderImpl.ParseDocumentContent()
   at System.Xml.XmlWriter.WriteNode(XmlReader reader, Boolean defattr)
   at System.Data.SqlTypes.SqlXml.get_Value()
   at Microsoft.Data.SqlClient.SqlCachedBuffer.ToString()
   at Microsoft.Data.SqlClient.SqlBuffer.get_String()
   at Microsoft.Data.SqlClient.SqlDataReader.GetString(Int32 i)
   at DatabaseTester.GetAndCompareDataAsync(List`1 originalData) in C:\Users\ErikEjlskovJensen(De\source\repos\SqlClientB

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 18, 2025

Odd. I can't connect to your server at all, could you check and see if the IP address changed?

to confirm, are these the settings you're seeing errors with?

bool compat = false;
bool managed = false;
bool useContinue = false;
AppContext.SetSwitch("Switch.Microsoft.Data.SqlClient.UseManagedNetworkingOnWindows", managed);
AppContext.SetSwitch("Switch.Microsoft.Data.SqlClient.UseCompatibilityAsyncBehaviour", !useContinue);
AppContext.SetSwitch("Switch.Microsoft.Data.SqlClient.UseCompatibilityProcessSni", compat);

@ErikEJ
Copy link
Contributor

ErikEJ commented Aug 18, 2025

@Wraith2 I have just turned it back on 😄

AppContext.SetSwitch("Switch.Microsoft.Data.SqlClient.UseCompatibilityProcessSni", false); is the only swith I have in my code. Not sure how it relates to all the others. And I am running the test on Windows 11 PC (so using unmanaged networking)

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 18, 2025

Ok. Well I've run this branch against my server in sweden which is similar to yours for a while and no replication. Now i'm running directly against yours again and I'll let it go for a while but I'm not seeing any problems occurring.

@ErikEJ
Copy link
Contributor

ErikEJ commented Aug 18, 2025

@Wraith2 This is my code, and the issue repros right away!

using Microsoft.Data.SqlClient;

AppContext.SetSwitch("Switch.Microsoft.Data.SqlClient.UseManagedNetworkingOnWindows", false);
AppContext.SetSwitch("Switch.Microsoft.Data.SqlClient.UseCompatibilityAsyncBehaviour", false);
AppContext.SetSwitch("Switch.Microsoft.Data.SqlClient.UseCompatibilityProcessSni", false);

var connectionString = "Data Source=20.x.x.x,1433;Initial Catalog=TestDB;User Id=wraith2;Password=x;Encrypt=False;Trust Server Certificate=False;Command Timeout=30";

var dbTester = new DatabaseTester(connectionString);
        
var connectionsStringBuilder = new SqlConnectionStringBuilder(connectionString);
connectionsStringBuilder.Encrypt = true;
connectionsStringBuilder.TrustServerCertificate = true;
    
var dataSetTester = new DatabaseTester(connectionsStringBuilder.ConnectionString);
var originalRecords = await dataSetTester.GetAndCompareDataAsync(null);
while(true)
{
    Console.WriteLine("RUNNING");
    await dbTester.GetAndCompareDataAsync(originalRecords);
}

@ErikEJ
Copy link
Contributor

ErikEJ commented Aug 18, 2025

@Wraith2 If I change above to this, the issue goes away

AppContext.SetSwitch("Switch.Microsoft.Data.SqlClient.UseCompatibilityAsyncBehaviour", true);

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 18, 2025

I still can't replicate. I'm a few commits ahead locally though so It's possible i've fixed it. I didn't push over the weekend so that the artifact would be available for people to test but since we're seeing a problem i've pushed now. @dotnet/sqlclientdevteam can you run CI please.

@mdaigle
Copy link
Contributor

mdaigle commented Aug 18, 2025

/azp run

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 18, 2025

new artifacts are her https://sqlclientdrivers.visualstudio.com/904996cc-6198-4d39-8540-eca72bdf0b7b/_apis/build/builds/123164/artifacts?artifactName=Artifacts&api-version=7.1&%24format=zip if you could try them please.

@ErikEJ
Copy link
Contributor

ErikEJ commented Aug 18, 2025

@Wraith2 When using 6.11.0-pull.123002 my repro no longer crashes! 🎉

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 18, 2025

That's a relief. Thanks.

Edit: Though, reviewing the commits since that build I'm suspicious because none of them went near xml. I suspect that the problem was around continue mode with xml where I know there is a bug (I have a fix on my dev branch) so adding in the RequestContinue functionality is causing it to take the working vs broken path.

@edwardneal
Copy link
Contributor

I've been running my own test cases against the branch prior to the most recent changes, the results are available here. These covered most of the combinations I can think of:

  • Servers: Local SQL 2019 instance, @Wraith2's server in Sweden, and two Azure SQL instances (UKSouth and AustraliaEast)
  • Read types: read a blob as a byte array or as a stream; read a varchar as a string or as a TextReader, read an nvarchar as a string or as a TextReader
  • Data sizes: 1MB, 5MB, 25MB
  • Packet sizes: 512, 567, 8000
  • SqlDataReader behaviour: Default and SequentialAccess
  • Parallel threads: 1, 2, 8
  • Type: sync and async
  • AppContext switches: all at default

I wasn't able to reproduce any exceptions which were unique to this PR; I'll kick them off against the latest version later this evening.

@ErikEJ
Copy link
Contributor

ErikEJ commented Aug 19, 2025

@edwardneal

AppContext switches: all at default

I am a little confused about the current state of the switches - does "all at deafult" enable all the new async multi packet features?

Comment on lines +81 to +87
if (AppContext.TryGetSwitch(UseCompatibilityProcessSniString, out bool returnedValue) && !returnedValue)
{
s_useCompatibilityProcessSni = Tristate.True;
s_useCompatibilityProcessSni = Tristate.False;
}
else
{
s_useCompatibilityProcessSni = Tristate.False;
s_useCompatibilityProcessSni = Tristate.True;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ErikEJ in this PR the default is changed to that UseCompatibilityProcessSni is true by default. That means that new async behaviour is false by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, so in a repro context, we should use:

AppContext.SetSwitch("Switch.Microsoft.Data.SqlClient.UseCompatibilityAsyncBehaviour", false);
AppContext.SetSwitch("Switch.Microsoft.Data.SqlClient.UseCompatibilityProcessSni", false);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For public consumption the settings you give are the defaults.

If you can find an issue I want to know about it so I can fix it, regardless of the settings but you will need to tell me what the settings are so I can try to repro.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. No issues found with latest build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6.1.0: Errors while executing the query
8 participants