Feature Description
Add support for OpenAI Privacy Filter model https://huggingface.co/openai/privacy-filter
Use Case
PII Detection. Privacy Filter predicts spans across eight categories:
- private_person
- private_address
- private_email
- private_phone
- private_url
- private_date
- account_number
- secret
Proposed Solution
Privacy Filter is a small model with frontier personal data detection capability. It is designed for high-throughput privacy workflows, and is able to perform context-aware detection of PII in unstructured text. It can run locally, which means that PII can be masked or redacted without leaving your machine. It processes long inputs efficiently, making redaction decisions in a quick, single pass.
Alternatives Considered
Much better results than existing Regex based solution. On the PII-Masking-300k(opens in a new window) benchmark, Privacy Filter achieves an F1 score of 96% (94.04% precision and 98.04% recall). On a corrected version of the benchmark that accounts for dataset annotation issues identified during review, the F1 score is 97.43% (96.79% precision and 98.08% recall).
Example Usage
# How you imagine using this feature
from localmod.classifiers.pii import PIIDetector
detector = PIIDetector()
# ...
Additional Context
None
Feature Description
Add support for OpenAI Privacy Filter model https://huggingface.co/openai/privacy-filter
Use Case
PII Detection. Privacy Filter predicts spans across eight categories:
Proposed Solution
Privacy Filter is a small model with frontier personal data detection capability. It is designed for high-throughput privacy workflows, and is able to perform context-aware detection of PII in unstructured text. It can run locally, which means that PII can be masked or redacted without leaving your machine. It processes long inputs efficiently, making redaction decisions in a quick, single pass.
Alternatives Considered
Much better results than existing Regex based solution. On the PII-Masking-300k(opens in a new window) benchmark, Privacy Filter achieves an F1 score of 96% (94.04% precision and 98.04% recall). On a corrected version of the benchmark that accounts for dataset annotation issues identified during review, the F1 score is 97.43% (96.79% precision and 98.08% recall).
Example Usage
Additional Context
None