risksyn

⚠️ This is a research prototype. Avoid or be extra careful when using in production.

Synthetic data generation with interpretable privacy guarantees in terms of attack risk.

This library provides privacy-preserving synthetic tabular data generation using differential privacy, with privacy specified in terms of interpretable attack risk (advantage, TPR/FPR) rather than the standard epsilon-delta parameters.

Intended Uses

The library is designed to work in the following settings:

Low-dimensional tabular data up to ≈32 features. This library is designed to work with the state-of-the-art Select-Measure-Generate paradigm for privacy-preserving generation of synthetic data, which is competitive with non-private generation but loses utility and speed to, e.g., privacy-preserving GANs, in high-dimensional domains¹.
Validity for low-degree marginals. Replacing real data with any kind of synthetic data in general does not preserve the validity of statistical analyses performed on the substituted data, i.e., the users of synthetic data cannot know how close or far are the results of the analysis to those on the actual real data². The Select-Measure-Generate architecture, however, does preserve the validity of marginals with a given degree—queries like "how many men over 60 with diabetes are in the dataset"—which are guaranteed to be close to the original, depending on the privacy level. At the moment, the library does not support the derivation of confidence intervals for these queries.
Each record corresponds to one individual. At the moment, each record must correspond to one individual for the privacy guarantee to be meaningful at the level of individuals.

Installation

Currently, you need to get a copy of the repo, e.g., with git clone, and install locally:

pip install -e .

For local development:

uv sync --dev

Quickstart

There are two ways to use the library:

Basic. Generate synthetic data using AIM³, one of the state-of-the-art algorithms for generating privacy-preserving synthetic tabular data. Unlike other implementations of this algorithm, this library enables to specify the privacy requirement in terms of interpretable attack risk.
Customized. Calibrate the noise parameters to ensure a given level of attack risk that can then be passed as input to the dpmm library calls for precise control over the generation pipeline.

Specifying Target Risk

The main feature of this library is the ability to specify the target level of privacy risk in terms of interpretable attack success rates instead of the classical approach of using epsilon-delta parameters. This is done using the Risk class. We detail and explain all of the options in the risk specification and modeling guide.

from risksyn import Risk

# Membership inference attack error rates
risk = Risk.from_err_rates(tpr=0.6, fpr=0.1)  # Max 60% TPR at 10% FPR

# Inference attack advantage
risk = Risk.from_advantage(0.2)  # 20% max advantage

# Inference attack advantage at a given baseline
risk = Risk.from_advantage_at_baseline(advantage=0.2, baseline=0.5)

# Inference attack success rate at a given baseline
risk = Risk.from_success_at_baseline(success=0.7, baseline=0.5)

# Combine multiple risk requirements (takes the most restrictive)
risk = Risk.from_advantage(0.01) | Risk.from_advantage_at_baseline(0.05, 0.001)

# For advanced users: Check the converted zCDP value
print(f"zCDP rho: {risk.zcdp:.4f}")

Basic Usage

Here is an example of the basic usage. We generate synthetic data to preserve all 3-way marginals well and ensure that the maximum additive advantage of any inference attack aimed to learn information about the real records based on the synthetic data is at most 20 percentage points:

import pandas as pd
from risksyn import Risk, AIMGenerator

# Sample data
df = pd.DataFrame({
    "age": [25, 30, 35, 40, 45],
    "income": [50000, 60000, 70000, 80000, 90000],
    "city": ["NYC", "LA", "NYC", "SF", "LA"],
})

# Specify domain bounds for numeric columns
domain = {
    "age": {"lower": 18, "upper": 100},
    "income": {"lower": 0, "upper": 500000},
    "city": {"categories": ["NYC", "LA", "SF"]},
}

# Create generator with risk specification
risk = Risk.from_advantage(0.2)  # Ensure max 20p.p. advantage
gen = AIMGenerator(risk=risk, degree=3)
gen.fit(df, domain=domain)

# Generate synthetic records
synthetic_df = gen.generate(count=100)
print(synthetic_df.head())

We detail the options and parameters in the generation guide.

Customized Generation Pipelines with Calibration Utilities

For direct control over the dpmm generation pipeline, we provide calibrate_parameters_to_risk that converts a Risk object into intermediate calibrated noise parameters. If numeric columns need private domain estimation, pass proc_epsilon to reserve part of the budget for preprocessing—otherwise the privacy guarantee may not hold. See the generation guide for details.

from risksyn import Risk, calibrate_parameters_to_risk
from dpmm.pipelines import AIMPipeline

risk = Risk.from_advantage(0.2)
params = calibrate_parameters_to_risk(risk, proc_epsilon=0.1)

# Verbose for illustrative purposes:
pipeline = AIMPipeline(
    epsilon=params["epsilon"],
    delta=params["delta"],
    gen_kwargs={"degree": 3},
)
# Simpler usage:
pipeline = AIMPipeline(
    gen_kwargs={"degree": 3},
    **calibrate_parameters_to_risk(risk)
)

pipeline.fit(df, domain)
synthetic_df = pipeline.generate(n_records=10)

Graphical vs. Deep Generative Models: Measuring the Impact of DP on Utility. ACM CCS 2024. ↩
Should I use Synthetic Data for That? 2025. ↩
AIM: An Adaptive and Iterative Mechanism for Differentially Private Synthetic Data. VLDB 2022. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
docs		docs
risksyn		risksyn
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
.readthedocs.yml		.readthedocs.yml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

risksyn

Intended Uses

Installation

Quickstart

Specifying Target Risk

Basic Usage

Customized Generation Pipelines with Calibration Utilities

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

risksyn

Intended Uses

Installation

Quickstart

Specifying Target Risk

Basic Usage

Customized Generation Pipelines with Calibration Utilities

Footnotes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages