395 add the information about match context to the database #439

michalkrzem · 2024-11-22T09:04:59Z

Your checklist for this pull request

I've read the contributing guideline.
I've tested my changes by building and running mquery, and testing changed functionality (if applicable)
I've added automated tests for my change (if applicable, optional)
I've updated documentation to reflect my change (if applicable)

What is the current behaviour?

What is the new behaviour?

Test plan

Closing issues

fixes #issuenumber

msm-cert · 2024-12-05T11:25:32Z

src/tasks.py

+        """Reads a specific range of bytes from the already loaded file content around a given offset.
+
+        Args:
+            data (bytes): Data to read.
+            matched_length (int): Number of bytes to read.
+            offset (int): The offset in bytes from which to start reading.
+            byte_range (int): The range in bytes to read around the offset (default is 32).
+
+        Returns:
+            bytes: A chunk of bytes from the file, starting from the given offset minus bit_range
+                   and ending at offset plus matched_length and byte_range.


Generated by chatgpt? 😀

I don't think the docstrings here are very good (some parts are actually plain wrong), and we don't generally use that docstring format for arguments, so I think argument docs can/should be removed. Maybe:

Suggested change

"""Reads a specific range of bytes from the already loaded file content around a given offset.

Args:

data (bytes): Data to read.

matched_length (int): Number of bytes to read.

offset (int): The offset in bytes from which to start reading.

byte_range (int): The range in bytes to read around the offset (default is 32).

Returns:

bytes: A chunk of bytes from the file, starting from the given offset minus bit_range

and ending at offset plus matched_length and byte_range.

"""Return `matched_length` bytes from `offset`, along with `byte_range` bytes before and after the match.

msm-cert · 2024-12-05T11:28:50Z

src/tasks.py

+    @staticmethod
+    def read_file(file_path: str) -> bytes:
+        """Reads the entire file content.
+
+        Returns:
+            bytes: The content of the file.
+        """
+        with open(file_path, "rb") as file:
+            return file.read()


I don't think this method is necessary - you can just inline this.

Even if you prefer to keep it, it doesn't belong in this class (Agent). But IMO it's best to inline it.

msm-cert · 2024-12-12T11:45:16Z

src/tasks.py

+            return file.read()
+
+    @staticmethod
+    def read_bytes_from_offset(


Not a very good method name. Maybe read_bytes_with_context?

Also there's no reason to use @staticmethod here - it's generic, and has no relation to the Agent class. Please make it a global method

msm-cert · 2024-12-12T11:58:19Z

src/tasks.py

+                            "after": base64.b64encode(after).decode("utf-8"),
+                        }
+                    )
+                    context.update({str(yara_match): match_context})


str(yara_match)? Probably better to avoid relying on str() representation and to use something like yara_match.rule (or .name? I don't remember)

One thing that's missing is that you don't put string name anywhere - this makes it impossible to tell which string matched. In most cases there is a single yara rule per query, but multiple strings.

It definitely must be stored somewhere. Probably the easiest way is to use a dict for contetx instead of a list:

{ "rule_name_1": { "string_1": { "before": "CgkJCQkJcHJpbnQgU1RERVJSICRtc2c7CgkJCQkJdW4=", "match": "AbcDe9f=", "after": "CgkJCQkJcHJpbnQgU1RERVJSICRtc2c7CgkJCQkJdW4=", }, "string_2": { "before": "CgkJCQkJcHJpbnQgU1RERVJSICRtc2c7CgkJCQkJdW4=", "match": "AbcDe9f=", "after": "CgkJCQkJcHJpbnQgU1RERVJSICRtc2c7CgkJCQkJdW4=", } }, "rule_name_2": { "string_1": { "before": "CgkJCQkJcHJpbnQgU1RERVJSICRtc2c7CgkJCQkJdW4=", "match": "AbcDe9f=", "after": "CgkJCQkJcHJpbnQgU1RERVJSICRtc2c7CgkJCQkJdW4=", }, "string_2": { "before": "CgkJCQkJcHJpbnQgU1RERVJSICRtc2c7CgkJCQkJdW4=", "match": "AbcDe9f=", "after": "CgkJCQkJcHJpbnQgU1RERVJSICRtc2c7CgkJCQkJdW4=", } } }

Finally, this expression is overly complicated - I believe

context.update({A: B})

is equivalent to

context[A] = B

Here, in the match key, we have information about the string we are searching for. But okay, I’ll replace the list with a dictionry.

msm-cert · 2024-12-12T11:59:26Z

src/tasks.py

+            for string_match in yara_match.strings:
+                expression_keys = []
+                for expression_key in string_match.instances:
+                    if expression_key in expression_keys:


This check is a no-op - for (regular, with no eq override) python object object in SOMETHING will always be false.

I don't fully get the idea behind this. If the intention was to only add a single match to the context (as originally planned, I think) then you should check if string_match was already added instead.

Or even better, just take the first instance - no loop necessary

msm-cert · 2024-12-12T12:03:12Z

src/tasks.py

+        for yara_match in matches:
+            match_context = []
+            for string_match in yara_match.strings:
+                expression_keys = []


Since this is only used for presence checking, this should be a set instead of a list. But I think it's not necessary at all (see next comment)

msm-cert · 2024-12-12T12:07:57Z

src/tasks.py

+                    (before, matching, after,) = self.read_bytes_from_offset(
+                        data=data,
+                        offset=expression_key.offset,
+                        matched_length=expression_key.matched_length,
+                    )


Suggested change

(before, matching, after,) = self.read_bytes_from_offset(

data=data,

offset=expression_key.offset,

matched_length=expression_key.matched_length,

)

(before, matching, after) = self.read_bytes_from_offset(

data,

expression_key.offset,

expression_key.matched_length,

)

msm-cert · 2024-12-12T12:11:39Z

src/tasks.py


+    def get_match_context(
+        self, data: bytes, matches: List[yara.Match]
+    ) -> dict:


for now we use typing classes everywhere (and also misses key/value types)

Suggested change

) -> dict:

) -> Dict[str, ???]:

(value type to be fixed depending on how other comments are resolved)

msm-cert · 2024-12-12T12:14:26Z

src/tasks.py

                    self.update_metadata(
-                        job.id, orig_name, path, [r.rule for r in matches]
+                        job=job.id,
+                        orig_name=orig_name,


Why explicit parameter names everywhere? It's a bit verbose, but OK - just wondering.

I read somewhere that when we use such a notation, we don’t have to worry about the order of the arguments, and we already know during the function call what variables are being assigned to (if the naming differs). That was someone’s opinion; I’ll adapt to the project then. :)

…t-to-the-database

msm-cert

Thanks! Looks good. Sorry for waiting

src/tasks.py

msm-cert · 2024-12-19T12:24:36Z

src/tasks.py

+                (before, matching, after) = read_bytes_with_context(
+                    data, expression_key.matched_length, expression_key.offset
+                )
+                match_context[expression_key] = {


I'll fix the style in a second but... expression_key is an object. So match_context is not a Dict[str, ...], so context is not Dict[..., Dict[str, ...]], so something's wrong

msm-cert · 2024-12-19T13:00:42Z

I guess the PR was not fully tested, because:

                expression_key = string_match.instances[0]
                # ...
                match_context[expression_key] = { ... }

You're using a StringMatchInstance object as a key in JSON. This would raise a type error when trying to serialize...

...but it was not serialized, because typing on context model didn't match (it should be context: Dict[str, Dict[str, Dict[str, str]]]). It was then quietly dropped (so in the database context was always null).

Fixed both.

Add the information about match context to the database

michalkrzem added 3 commits October 30, 2024 22:00

get offset and matched_length

ffdd46c

Draft code witch context example

511a255

Matches offest, len

106eff0

michalkrzem linked an issue Nov 22, 2024 that may be closed by this pull request

Add the information about match context to the database #395

Closed

michalkrzem added 5 commits December 2, 2024 00:12

Matching witch before and after ocntext.

8021242

lint

e041af2

lint

7b1b2e8

lint

ebc9277

lint

a536f95

michalkrzem requested a review from msm-cert December 1, 2024 23:23

michalkrzem added 6 commits December 2, 2024 10:33

bytes into base64 modified

acd8587

.

c253dfa

Name of rule in loop modified

12228bd

bug fixed, refactoring

c1ce768

refactoring

e259807

log test deleted

fc5a2f6

msm-cert requested changes Dec 12, 2024

View reviewed changes

michalkrzem added 8 commits December 13, 2024 11:13

after review

7fdfe9e

logging context deleted

36aa90e

lint

d0cdcd6

lint

1af4050

lint

7bdcef9

lint

a5a3058

Merge branch 'master' into 395-add-the-information-about-match-contex…

830a5b4

…t-to-the-database

fix migration

57fd651

msm-cert approved these changes Dec 19, 2024

View reviewed changes

src/tasks.py Outdated Show resolved Hide resolved

src/tasks.py Outdated Show resolved Hide resolved

msm-cert added 2 commits December 19, 2024 12:17

Update src/tasks.py

9809521

Update src/tasks.py

be6fd0b

msm-cert requested changes Dec 19, 2024

View reviewed changes

fix black/style

1dc9bd9

msm-code added 3 commits December 19, 2024 13:29

fix black/style

cde58bb

Fix the PR

5ade1a8

black

685e2b0

msm-cert marked this pull request as ready for review December 19, 2024 13:02

msm-cert approved these changes Dec 19, 2024

View reviewed changes

msm-cert merged commit 291a041 into master Dec 19, 2024
10 checks passed

msm-cert deleted the 395-add-the-information-about-match-context-to-the-database branch December 19, 2024 13:05

mickol34 pushed a commit that referenced this pull request Dec 19, 2024

Add the information about match context to the database (#439)

d931c43

Add the information about match context to the database

395 add the information about match context to the database #439

395 add the information about match context to the database #439

Uh oh!

Conversation

michalkrzem commented Nov 22, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msm-cert left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msm-cert commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

msm-cert commented Dec 19, 2024 •

edited

Loading