Regex validators filter extracted spans to ensure they match expected patterns, improving extraction quality and reducing false positives.
from gliner2 import GLiNER2, RegexValidator
extractor = GLiNER2.from_pretrained("your-model")
# Create validator and apply to field
email_validator = RegexValidator(r"^[\w\.-]+@[\w\.-]+\.\w+$")
schema = (extractor.create_schema()
.structure("contact")
.field("email", dtype="str", validators=[email_validator])
)- pattern: Regex pattern (string or compiled Pattern)
- mode:
"full"(exact match) or"partial"(substring match) - exclude:
False(keep matches) orTrue(exclude matches) - flags: Regex flags like
re.IGNORECASE(for string patterns only)
email_validator = RegexValidator(r"^[\w\.-]+@[\w\.-]+\.\w+$")
text = "Contact: john@company.com, not-an-email, jane@domain.org"
# Output: ['john@company.com', 'jane@domain.org']phone_validator = RegexValidator(r"\(\d{3}\)\s\d{3}-\d{4}", mode="partial")
text = "Call (555) 123-4567 or 5551234567"
# Output: ['(555) 123-4567'] # Second number filtered outurl_validator = RegexValidator(r"^https?://", mode="partial")
text = "Visit https://example.com or www.site.com"
# Output: ['https://example.com'] # www.site.com filtered outno_test_validator = RegexValidator(r"^(test|demo|sample)", exclude=True, flags=re.IGNORECASE)
text = "Products: iPhone, Test Phone, Samsung Galaxy"
# Output: ['iPhone', 'Samsung Galaxy'] # Test Phone excludedlength_validator = RegexValidator(r"^.{5,50}$") # 5-50 characters
text = "Names: Jo, Alexander, A Very Long Name That Exceeds Fifty Characters"
# Output: ['Alexander'] # Others filtered by length# All validators must pass
username_validators = [
RegexValidator(r"^[a-zA-Z0-9_]+$"), # Alphanumeric + underscore
RegexValidator(r"^.{3,20}$"), # 3-20 characters
RegexValidator(r"^(?!admin)", exclude=True, flags=re.IGNORECASE) # No "admin"
]
schema = (extractor.create_schema()
.structure("user")
.field("username", dtype="str", validators=username_validators)
)
text = "Users: ab, john_doe, user@domain, admin, valid_user123"
# Output: ['john_doe', 'valid_user123']| Use Case | Pattern | Mode |
|---|---|---|
r"^[\w\.-]+@[\w\.-]+\.\w+$" |
full | |
| Phone (US) | r"\(\d{3}\)\s\d{3}-\d{4}" |
partial |
| URL | r"^https?://" |
partial |
| Numbers only | r"^\d+$" |
full |
| No spaces | r"^\S+$" |
full |
| Min length | r"^.{5,}$" |
full |
| Alphanumeric | r"^[a-zA-Z0-9]+$" |
full |
- Use specific patterns - More specific = fewer false positives
- Test your regex - Validate patterns before deployment
- Combine validators - Chain multiple simple validators
- Consider case sensitivity - Use
re.IGNORECASEwhen needed - Start simple - Begin with basic patterns, refine as needed
- Validators run after span extraction but before formatting
- Failed validation simply excludes the span (no errors)
- Multiple validators use short-circuit evaluation (stops at first failure)
- Compiled patterns are cached automatically