You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Magika is a novel AI-powered file type detection tool that relies on the recent advance of deep learning to provide accurate detection. Under the hood, Magika employs a custom, highly optimized model that only weighs about a few MBs, and enables precise file identification within milliseconds, even when running on a single CPU. Magika has been trained and evaluated on a dataset of ~100M samples across 200+ content types (covering both binary and textual file formats), and it achieves an average ~99% accuracy on our test set.
Magika is a novel AI-powered file type detection tool that relies on the recent advance of deep learning to provide accurate detection. Under the hood, Magika employs a custom, highly optimized model that only weighs about a few MBs, and enables precise file identification within milliseconds, even when running on a single CPU. Magika has been trained and evaluated on a dataset of ~100M samples across 200+ content types (covering both binary and textual file formats), and it achieves an average ~99% accuracy on our test set.
13
16
@@ -19,12 +22,12 @@ You can find more information on which content types are supported, extended doc
19
22
20
23
> **IMPORTANT**: This latest 0.6.1 version has a few breaking changes from the latest stable version, 0.5.1. Please consult the [CHANGELOG.md](https://github.com/google/magika/blob/main/python/CHANGELOG.md#061---2025-03-19) and the [migration guide](https://github.com/google/magika/blob/main/python/CHANGELOG.md#breaking-changes-and-migration-guide).
21
24
22
-
23
25
## Installing Magika
24
26
25
27
Magika is available as `magika` on [PyPI](https://pypi.org/project/magika):
26
28
27
29
To install the most recent stable version:
30
+
28
31
```shell
29
32
$ pip install magika
30
33
```
@@ -33,7 +36,6 @@ If you intend to use Magika only as a command line, you may want to use `$ pipx
33
36
34
37
If you want to test out the latest release candidate, you can install it with `pip install --pre magika`.
35
38
36
-
37
39
## Using Magika as a command-line tool
38
40
39
41
> Beginning with version `0.6.0`, the magika Python package includes a pre-compiled Rust-based command-line tool, replacing the previous Python version. This binary is distributed as platform-specific wheels for most common architectures. For unsupported platforms, a pure-Python wheel is also available, providing the legacy Python client as a fallback.
@@ -168,10 +170,8 @@ Options:
168
170
Print version
169
171
```
170
172
171
-
172
173
Check the [Rust CLI docs](https://github.com/google/magika/blob/main/rust/cli/README.md) for more information.
173
174
174
-
175
175
## Using Magika as a Python module
176
176
177
177
> Note: The Python API introduced in version `0.6.0` closely resembles the previous version, but includes several enhancements and a few breaking changes. Migrating existing clients should be relatively straightforward. Where possible, we have maintained compatibility with the old API and added deprecation warnings. For a complete list of changes and migration guidance, consult the [CHANGELOG.md](https://github.com/google/magika/blob/main/python/CHANGELOG.md).
@@ -203,26 +203,26 @@ ini
203
203
ini
204
204
```
205
205
206
-
207
206
## Documentation on core concepts
208
207
209
208
To get the most out of Magika, it's worth learning about its core concepts. You can read about the models, prediction modes, output structure, and content type knowledge base in the documentation [here](https://github.com/google/magika/blob/main/docs/concepts.md).
210
209
211
-
212
210
### API documentation
213
211
214
212
First, create a `Magika` instance: `magika = Magika()`.
215
213
216
214
The constructor accepts the following optional arguments:
215
+
217
216
-`model_dir`: path to a model to use; defaults to the latest available model.
218
217
-`prediction_mode`: which prediction mode to use; defaults to `PredictionMode.HIGH_CONFIDENCE`.
219
218
-`no_dereference`: controls whether symlinks should be dereferenced; defaults to `False`.
220
219
221
220
Once instantiated, the `Magika` object exposes methods to identify the content type of a `bytes` object, of files identified by their paths, and of an already-open binary stream:
221
+
222
222
-`magika.identify_bytes(b"test")`: takes as input a stream of bytes and predict its content type.
223
223
-`magika.identify_path("test.txt")`: takes as input one `str | os.PathLike` object and predicts its content type.
224
224
-`magika.identify_paths(["test.txt", "test2.txt"])`: takes as input a list of `str | os.PathLike` objects and returns the predicted type for each of them.
225
-
-`magika.identify_stream(stream: typing.BinaryIO)`: takes as input an *already open* binary file-like object (e.g., the output of `open(file_path, 'rb')`) and returns its predicted content type. Keep in mind that Magika will `seek()` around the stream, and that the stream *is not closed* (closing is the responsibility of the caller).
225
+
-`magika.identify_stream(stream: typing.BinaryIO)`: takes as input an _already open_ binary file-like object (e.g., the output of `open(file_path, 'rb')`) and returns its predicted content type. Keep in mind that Magika will `seek()` around the stream, and that the stream _is not closed_ (closing is the responsibility of the caller).
226
226
227
227
If you are dealing with large files, the `identify_path`, `identify_paths`, and `identify_stream` variants are generally better: their implementation `seek()`s around the file/stream to extract the needed features, without loading the entire content in memory.
228
228
@@ -267,25 +267,24 @@ class ContentTypeLabel(StrEnum):
267
267
[...]
268
268
```
269
269
270
-
271
270
### Additional APIs
272
271
273
272
-`get_output_content_types()`: Returns a list of all possible content type labels that Magika can output (i.e., the possible values of `MagikaResult.prediction.output.label`). This is the recommended method for most users that want to have a list of what is the output space of Magika.
274
-
-`get_model_content_types()`: Returns a list of all possible content type labels the *deep learning model* can output (i.e., `MagikaResult.prediction.dl.label`). Useful for debugging, most users should refer to `get_output_content_types()`.
273
+
-`get_model_content_types()`: Returns a list of all possible content type labels the _deep learning model_ can output (i.e., `MagikaResult.prediction.dl.label`). Useful for debugging, most users should refer to `get_output_content_types()`.
275
274
-`get_module_version()` and `get_model_version()`: Returns the module version and the model's name being used, respectively.
276
275
277
-
278
276
## Development setup
279
277
280
278
-`magika` uses `uv` as a project and dependency managment tool. To install all the dependencies: `$ cd python; uv sync`.
281
279
- To run the tests suite: `$ cd python; uv run pytest tests -m "not slow"`. Check the github action workflows for more information.
282
280
- We use the `maturin` backend to combine the Rust CLI with the python codebase in the `magika` python package. This process is automated via the [build python package GitHub action](https://github.com/google/magika/blob/main/.github/workflows/python-build-package.yml).
283
281
284
-
285
282
## Research Paper and Citation
283
+
286
284
We describe how we developed Magika and the choices we made in our research paper, which was accepted at the International Conference on Software Engineering (ICSE) 2025. A pre-print of our paper is available on arxiv: [https://arxiv.org/abs/2409.13768](https://arxiv.org/abs/2409.13768).
287
285
288
286
If you use this software for your research, please cite it as:
287
+
289
288
```bibtex
290
289
@InProceedings{fratantonio25:magika,
291
290
author = {Yanick Fratantonio and Luca Invernizzi and Loua Farah and Kurt Thomas and Marina Zhang and Ange Albertini and Francois Galilee and Giancarlo Metitieri and Julien Cretin and Alexandre Petit-Bianco and David Tao and Elie Bursztein},
0 commit comments