Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions docs/analyzer/languages.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,3 +75,28 @@ the `docker build` phase and the models defined in it are installed automaticall

For `transformers` based models, the configuration [can be found here](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/transformers.yaml).
A docker file supporting transformers models [can be found here](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/Dockerfile.transformers).

### Building custom Docker images for more languages

If you want to support languages beyond English in a custom Docker image, start with the NLP configuration file that the image copies during build:

- `presidio-analyzer/presidio_analyzer/conf/default.yaml` for the standard spaCy-based image
- `presidio-analyzer/presidio_analyzer/conf/transformers.yaml` for the transformers image
- `presidio-analyzer/presidio_analyzer/conf/stanza.yaml` for the Stanza image

Comment on lines +81 to +86
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section suggests only editing/passing NLP_CONF_FILE, but enabling additional languages in the container also typically requires updating supported_languages in presidio-analyzer/presidio_analyzer/conf/default_analyzer.yaml and ensuring the language has appropriate entries enabled in presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml. Since the Dockerfiles already expose ANALYZER_CONF_FILE and RECOGNIZER_REGISTRY_CONF_FILE build args, consider documenting those alongside NLP_CONF_FILE (and mentioning that all three configs must be consistent) to avoid unsupported-language errors and the recognizer warnings mentioned below.

Copilot uses AI. Check for mistakes.
Then pass that file to the Docker build through `NLP_CONF_FILE`. For example:

```bash
docker build -f presidio-analyzer/Dockerfile \
--build-arg NLP_CONF_FILE=presidio_analyzer/conf/default.yaml \
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docker build example is likely incorrect for a repo-root build context (.): --build-arg NLP_CONF_FILE=presidio_analyzer/conf/default.yaml won’t exist at that relative path. Either change the build context to ./presidio-analyzer (so presidio_analyzer/conf/... resolves), or keep . as the context and pass presidio-analyzer/presidio_analyzer/conf/default.yaml (and similarly for other args).

Suggested change
--build-arg NLP_CONF_FILE=presidio_analyzer/conf/default.yaml \
--build-arg NLP_CONF_FILE=presidio-analyzer/presidio_analyzer/conf/default.yaml \

Copilot uses AI. Check for mistakes.
-t presidio-analyzer-custom .
```

The same pattern works for the other analyzer Dockerfiles, such as `Dockerfile.transformers` and `Dockerfile.stanza`.

Practical tips:

- Add a few languages at a time and verify the image still builds cleanly.
- Keep the `models` list in the YAML file aligned with the languages you enable.
- If you see recognizer warnings such as a language missing an NLP recognizer, make sure the recognizer registry and NLP configuration define the same language set.
- For very large language sets, split the build into smaller steps so the Docker build has less work to do at once.
Loading