Skip to content

Extract Ms-cabs while reading instead of loading all datablocks into memory#56

Merged
mnadareski merged 21 commits intoSabreTools:mainfrom
HeroponRikiBestest:mscab-read-extract
Jan 24, 2026
Merged

Extract Ms-cabs while reading instead of loading all datablocks into memory#56
mnadareski merged 21 commits intoSabreTools:mainfrom
HeroponRikiBestest:mscab-read-extract

Conversation

@HeroponRikiBestest
Copy link
Contributor

Also fixes #44 somehow.

While this is marked as draft, it's basically done. The only things I still want to do are add failsafes so that as much extraction as possible can continue to happen if errors occur, fix some edge cases (at present, I only know of one file in one cab set that's a regression, the other issues seem to have already been present), and convert most of my TODOs into PR comments. While I did mention before that I wanted to split up the main extraction loop, trying to do so resulted in some really ugly code due to how much state needs to be shared, so I went back on that. I'm perfectly fine with splitting things up if still asked, though.

@HeroponRikiBestest
Copy link
Contributor Author

HeroponRikiBestest commented Jan 15, 2026

Note to self- figure out why now-reverted commit f3c3de0 caused so many regressions. It's probably simple

@HeroponRikiBestest
Copy link
Contributor Author

Note to self- now-reverted commit has been re-implemented, the issue was that the case of 0 bytes remaining needs to be handled with a while loop and not an if statement in order to properly process 0-byte files.

@HeroponRikiBestest
Copy link
Contributor Author

All regressions have been identified and removed, and all remaining preexisting issues I noticed are fixed, with two exceptions. There are a very small number of cabs that have ST.IO MsZIP issues, which as far as I can tell aren't due to ST.Serialization. I do not know enough about MsZIP on that side of things to figure out what's going wrong, so I'll open an issue for those. Beyond that, the only potential remaining issue is that BOS might have superfluous errors scanning empty 36 byte mscabs, but I'm not sure. It's not really much of an issue if that is the case.

The only things left will be placing proper failsafes so that as much extraction occurs as possible for partial sets and damaged cabs, and going through my remaining TODOs to convert most of them into PR comments. Then the PR will be ready for review.

@HeroponRikiBestest HeroponRikiBestest marked this pull request as ready for review January 16, 2026 19:59
@HeroponRikiBestest
Copy link
Contributor Author

HeroponRikiBestest commented Jan 16, 2026

PR is ready for review.

/// <returns>True if the file extracted, false otherwise</returns>
private static bool ExtractFile(string outputDirectory, Stream blockStream, CFFILE file, bool includeDebug)
/// <returns>True if all files extracted, false otherwise</returns>
private bool ExtractSet(string? cabFilename, string outputDirectory, bool ignorePrev, bool includeDebug)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that this is a top-of-method comment and not at any specific other point.

This entire method feels very difficult to parse. There are multiple nested bits of code that feel like they should be private helper methods. It also feels like this entire thing would benefit from having some sort of external state tracking, similar to IS-CAB's extraction helper class. The way that you have to juggle variables to make sure that you're tracking the cabinet, folder, block, and file feel overly-complex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. I'll have to look into how IS-CAB manages state, then. It'll probably be a while then since a bunch of logic will need to be re-written, I'll avoid any other fixes until I have that done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A full rewrite can be done in a followup PR. I was putting this comment here as a very general statement. If it's working, then let's get a working version of the code in before it gets retooled. Incremental changes are fine here since there's no rush to get a new stable version of the library out.

@HeroponRikiBestest
Copy link
Contributor Author

As far as I'm aware, I've addressed everything mentioned here so far. Let me know if I've missed anything, or done anything wrong.

@mnadareski mnadareski merged commit 2e5c5b3 into SabreTools:main Jan 24, 2026
1 check passed
HeroponRikiBestest added a commit to HeroponRikiBestest/SabreTools.Serialization that referenced this pull request Jan 27, 2026
…memory (SabreTools#56)

* WIP

* WIP 2

* Todo: you're missing a read somehow and getting misaligned by two bytes? maybe properly implementing the buffer will magically fix it

* continued blocks my behated

* Pre-major-testing

* Forgot to add summaries

* Attempt to properly roll back. The state i wanted to roll back to wasn't in a commit before.

* Figured out the issue with the rolled back commit, this has to be a while loop because of 0 byte files. Reimplemented clean code.

* Comment so I don't forget why it's like this

* Skip unsupported compression types before opening filestream.

* Reenable non-start cab skipping

* TODO so i don't forget

* TODO so I don't forget

* iterate on continued block correctly.

* Handle incomplete extraction better

* Remove TODOs to ready for PR comments

* Missed one

* Some minor fixes before rewriting everything

* Next round of fixes.

* Next set of fixes

* Fix debug output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Correct mszip extraction on some cabs currently relies on an exception being thrown

2 participants