Skip to content

Conversation

PookieBuns
Copy link

Thanks to #610 we have covered a large step in being able to use insta for schema snapshotting as per #475 . However, currently binary files cannot be diffed when running cargo test or cargo insta review. This makes it difficult to compare and review changes when examining schema changes. This PR allows insta to try to utf-8 decode binary files that are text based to allow this subset of files to be diffed in the workflow

@max-sixty
Copy link
Collaborator

There's definitely something useful here; I can see some good cases for using the binary snapshot format for text files; for example when we want a raw file without a header.

I'm a bit concerned that this could lead to some unexpected behavior though:

  • If the binary snapshots are multiple megabytes, how does this behave?
    • We could add a limit, though then could be confusing why some snapshots are presented for a review and some aren't
  • What are the chances of some bytes getting through the String::from_utf8(new.to_vec()).ok())) check and the screen being filled with meaningless characters?

@PookieBuns
Copy link
Author

PookieBuns commented Jan 14, 2025

I think we can draw some inspiration to how diff behaves as part of GNU. Section 1.7 of this manual states

diff determines whether a file is text or binary by checking the first few bytes in the
file; the exact number of bytes is system dependent, but it is typically several thousand. If
every byte in that part of the file is non-null, diff considers the file to be text; otherwise
it considers the file to be binary.

You could also force the diff to compare files as text using the --text (-a) option. I think this would be a good start for this feature, as it would only be opt-in.

I could add an option in the Tool Config File that could enable diffing binary files as text with a behavior.text_diff option that by default is turned off. We could also add this as an option in cargo insta when reviewing files.

In regards to addressing your concerns in the case we make it opt-in

  1. It appears that if you force a text diff on large files (I tested with 2 video files), It will still try to diff them and produce erroneous output (at least this is the behavior on MacOS although its not technically GNU diff).

  2. I think if you do opt in for text_diff, you probably either

    1. Know for sure your files are text
    2. Are prepared to see some erroneous output

@max-sixty
Copy link
Collaborator

max-sixty commented Jan 27, 2025

Yes I think that could be viable!

That said, I'm a bit concerned that we're adding more options to our interface when there's a quite reasonable alternative — accepting the snapshot and running a git diff on the files.

If there's a way of building this so it's either:

  • really well-behaved by default (i.e. doesn't show a mess of characters), or
  • it's a single option in the config file and it doesn't have any really bad behavior (i.e. doesn't freeze on a multi-megabyte file)

...then I would support this

Also ofc open to others' views!

@PookieBuns
Copy link
Author

PookieBuns commented Feb 15, 2025

I updated the PR to accept a behavior.force_text or INSTA_FORCE_TEXT for clients to be able to opt in to text diffs on binary files. Also cargo-insta can take a --text argument to achieve the same effect. Note these are all opt in features which don't modify current behavior.

In regards to the better alternative in accepting then git diff, under the use case of a CI job, this would not be applicable and a pretty formatted textual diff would be a very nice-to-have when your test CI job fails

Some follow up

  1. Do you have resources where I can look at in regards to testing the SnapshotPrinter for these changes? I couldn't find any in the tests in the repo.

  2. As per really well behaved by default, gnu diff if you do choose to opt in and you end up comparing non-text files, it will also show a mess of characters, which is what I basically tried mimicking for the entire MR.

Thanks for the feedback and am as always super open on iterating more on this to achieve the desired behavior!

@max-sixty
Copy link
Collaborator

Thanks @PookieBuns

I'm still -0.2 about this — it seems to add settings to get to a config that a) is only marginally useful relative to alternatives, even taking into account CI b) is a setting that's better per-snapshot than per-project, because some binary snapshots are going to be too large or show mangled characters.

I'm very open to other thoughts — does anyone else have a view here?


I wonder whether there's some snapshot type that's like plain which uses the same architecture as binary snapshots, but it's text, and it's expected to be diff-ed. That would be useful for projects that want to commit generated text artifacts and test they match the output. (but not sure, possibly there are similar ideas that will develop)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants