Skip to content

Conversation

jcrussell
Copy link

I took a look at adding support for MSI (#1211) and ran into a couple of issues.

Pymsi requires a path or BytesIO -- is there a way to convert from an unblob file? I was looking at file_utils.OffsetFile which seems closer (might need pymsi to support file-like objects, @nightlark). I made the following changes:

--- a/python/unblob/file_utils.py
+++ b/python/unblob/file_utils.py
@@ -125,6 +125,16 @@ class OffsetFile:
    def tell(self):
        return self._file.tell() - self._offset

+    @property
+    def closed(self):
+        return self._file == None
+
+    def close(self):
+        if self._file:
+            self._file.close()
+            self._file = None

And in pymsi changed the __init__ for Package to:

class Package:
    def __init__(self, path_or_bytesio: Union[Path, io.BytesIO]):
        if hasattr(path_or_bytesio, "read"):
            self.path = None
            self.file = path_or_bytesio
        else:
            self.path = path_or_bytesio.resolve(True)
            self.file = self.path.open("rb")

This seems to work but I'm not sure if there's an easier way.

I'm getting an error related to sandboxing which I suspect is unrelated:

Activated FS access restrictions; rules=[Read("/"), ReadWrite("/dev/shm"), ReadWrite("/tmp/foo"), RemoveDir("/tmp/foo"), RemoveFile("/tmp/foo"), MakeDir("/tmp"), ReadWrite("unblob.log")], status=FullyEnforced pid=3928538
Processing file                path=/home/jon/Downloads/7z2501.msi pid=3928542 size=0x17dc00
...
  File "/usr/lib/python3.11/pathlib.py", line 1045, in open
    return io.open(self, mode, buffering, encoding, errors, newline)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PermissionError: [Errno 13] Permission denied: '/tmp/foo/7z2501.msi_extract/0-1513984.msi'

I've been testing with a 7zip MSI downloaded from here:

https://www.7-zip.org/a/7z2501.msi

Finally, I commented out an entry in DEFAULT_SKIP_MAGIC to get the new handler called. I don't know what other side effects this will have.

Any suggestions?

Basic scaffolding, running into a few issues.
@qkaiser qkaiser self-requested a review August 25, 2025 06:45
@qkaiser qkaiser self-assigned this Aug 25, 2025
@qkaiser qkaiser added enhancement New feature or request format:archive format:executable python Pull requests that update Python code format:vendor Custom vendor format labels Aug 25, 2025
@qkaiser
Copy link
Contributor

qkaiser commented Aug 25, 2025

pymsi

You should add this dependency in an independent commit with the following command:

uv add python-msi
git add pyproject.toml uv.lock
git commit -m 'chore(deps): add python-msi dependency

BytesIO and unblob File

I'd recommend submitting a PR to python-msi that does something along those lines:

diff --git a/src/pymsi/package.py b/src/pymsi/package.py
index 43ecaee..9f84c66 100644
--- a/src/pymsi/package.py
+++ b/src/pymsi/package.py
@@ -1,8 +1,10 @@
 import copy
 import io
+import mmap
 from pathlib import Path
 from typing import Iterator, Optional, Union
 
+
 import olefile
 
 from pymsi import streamname
@@ -18,13 +20,14 @@ from .summary import Summary
 
 
 class Package:
-    def __init__(self, path_or_bytesio: Union[Path, io.BytesIO]):
-        if isinstance(path_or_bytesio, io.BytesIO):
-            self.path = None
-            self.file = path_or_bytesio
-        else:
+    def __init__(self, path_or_bytesio: Union[Path, io.BytesIO, mmap.mmap]):
+        if isinstance(path_or_bytesio, Path):
             self.path = path_or_bytesio.resolve(True)
             self.file = self.path.open("rb")
+        else:
+            self.path = None
+            self.file = path_or_bytesio
+
         self.tables = {}
         self.ole = None
         self.summary = None

Reading from BytesIO or an mmap'ed file in python is pretty similar, inverting the check and extending the type hint is sufficient to make unblob work with code like this:

def calculate_chunk(self, file: File, start_offset: int) -> Optional[ValidChunk]:
        file.seek(start_offset, io.SEEK_SET)

        package = pymsi.Package(file)
        msi = pymsi.Msi(package, False)

        # MSI moves the file pointer
        msi_end_offset = file.tell()

        return ValidChunk(
                start_offset = start_offset,
                end_offset = msi_end_offset,
        )

The type hint change is not even required, just inverting the condition so that pymsi is more lax when it's not working on a Path is enough.

Integration Tests

In order to validate that the handler works, you must create a directory and put files in there so that we can check that unblob works properly and catch regression in the future:

# create directories
mkdir -p tests/integration/archive/msi/__input__
mkdir -p tests/integration/archive/msi/__output__
# create input file
wget -O tests/integration/archive/msi/__input__/package.msi https://www.7-zip.org/a/7z2501.msi
# create output files
unblob -vvv -e tests/integration/archive/msi/__output__ -f -k tests/integration/archive/msi/__input__/package.msi
# commit them
git add tests/integration/archive/msi/*
git commit ...

Skip Magic Change

That's okay to modify the skip magic list, as long as the file types you remove from the list are handled by a default handler (which is the case here).

You can simply remove the line, rather than commenting it.

Sandboxing Exception

We need to fix that, but if you run it twice it'll disappear. Getting sandboxing right is hard :)

Comment on lines 65 to 66
# MSI moves the file pointer
msi_end_offset = buf.tell()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not right. If you look at the output directory when run on the 7z MSI, you'll see that it carves two chunks:

0-1545728.msi
1545728-1563648.unknown

Looking into the unknown chunk, we see information that belongs to the Summary field:

hexdump -C 1545728-1563648.unknown
--snip--
00004470  1e 00 00 00 16 00 00 00  49 6e 73 74 61 6c 6c 61  |........Installa|
00004480  74 69 6f 6e 20 44 61 74  61 62 61 73 65 00 00 00  |tion Database...|
00004490  1e 00 00 00 0e 00 00 00  37 2d 5a 69 70 20 50 61  |........7-Zip Pa|
000044a0  63 6b 61 67 65 00 00 00  1e 00 00 00 0c 00 00 00  |ckage...........|
000044b0  49 67 6f 72 20 50 61 76  6c 6f 76 00 1e 00 00 00  |Igor Pavlov.....|
000044c0  0a 00 00 00 49 6e 73 74  61 6c 6c 65 72 00 00 00  |....Installer...|
000044d0  1e 00 00 00 0e 00 00 00  37 2d 5a 69 70 20 50 61  |........7-Zip Pa|
000044e0  63 6b 61 67 65 00 00 00  1e 00 00 00 0b 00 00 00  |ckage...........|
000044f0  49 6e 74 65 6c 3b 31 30  33 33 00 00 1e 00 00 00  |Intel;1033......|
00004500  27 00 00 00 7b 32 33 31  37 30 46 36 39 2d 34 30  |'...{23170F69-40|
00004510  43 31 2d 32 37 30 31 2d  32 35 30 31 2d 30 30 30  |C1-2701-2501-000|
00004520  30 30 32 30 30 30 30 30  30 7d 00 00 03 00 00 00  |002000000}......|
00004530  c8 00 00 00 03 00 00 00  02 00 00 00 03 00 00 00  |................|
00004540  02 00 00 00 40 00 00 00  80 8a 97 7e 8e 04 dc 01  |....@......~....|
00004550  40 00 00 00 80 8a 97 7e  8e 04 dc 01 1e 00 00 00  |@......~........|
00004560  31 00 00 00 57 69 6e 64  6f 77 73 20 49 6e 73 74  |1...Windows Inst|
00004570  61 6c 6c 65 72 20 58 4d  4c 20 76 32 2e 30 2e 33  |aller XML v2.0.3|
00004580  37 31 39 2e 30 20 28 63  61 6e 64 6c 65 2f 6c 69  |719.0 (candle/li|
00004590  67 68 74 29 00 00 00 00  00 00 00 00 00 00 00 00  |ght)............|
000045a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

You can check that by opening the file in https://pymsi.readthedocs.io/en/latest/msi_viewer.html and looking at the Summary tab.

I know very little about the MSI format, but looks like the end offset could be calculated based on the OLE format that the MSI is made off. Probably some magic involving sector sizes and sector counts.

jcrussell added a commit to jcrussell/pymsi that referenced this pull request Aug 25, 2025
Based on suggestion from @qkaiser. This will make it easier to integrate
with unblob (see onekey-sec/unblob#1244).
Requires this PR to (almost) work properly:

nightlark/pymsi#81
Don't assume that pymsi will actually read the entire file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request format:archive format:executable format:vendor Custom vendor format python Pull requests that update Python code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants