--corrupt <level> corrupts generated values: whitespace noise, encoding errors, OCR artifacts, truncation, masking, field swaps. 15 types across 3 severity tiers.
- Usage
- Levels — low, mid, high, extreme
- Types — 15 corruptions by severity
- In config files
seedfaker name email --corrupt mid -n 5 --seed demoEach field is independently corrupted with probability equal to the level rate. Corrupted fields receive 1 to N passes (stacking), capped by the level.
| Level | Rate | Max passes | Types available |
|---|---|---|---|
low |
5% | 2 | 0-4 |
mid |
15% | 3 | 0-9 |
high |
45% | 3 | 0-14 |
extreme |
95% | 5 | 0-14 |
| # | Type | Example |
|---|---|---|
| 0 | Extra spaces | John Smith |
| 1 | Invisible characters | ZWSP, NBSP, soft hyphen |
| 2 | Unicode decomposition | é → e + combining accent |
| 3 | Merged words | JohnSmith |
| 4 | Duplication | John Smith John Smith |
| # | Type | Example |
|---|---|---|
| 5 | OCR substitution | J0hn 5m!th |
| 6 | Mojibake | Müller |
| 7 | HTML entities | O'Brien |
| 8 | Garbled suffix | john@x.comR4a |
| 9 | Field swap | phone in email column |
| # | Type | Example |
|---|---|---|
| 10 | Empty value | (blank) |
| 11 | Truncation | John Sm |
| 12 | Star redaction | ou*****tlook.co***m |
| 13 | Partial mask | ***-**-1234 |
| 14 | X-masking | XXXXXXXXXXXX1234 |
For NER/PII annotations with byte-offset spans and original values, see --annotated.
options:
corrupt: highCorruption is deterministic — same seed + same level = same corrupted output. Base values are generated first, then corruption is applied. Base values are identical with or without --corrupt.
- Training and evaluation datasets — noisy training data with byte-offset spans