Introduce Symbolic Constraint Solver for SQL-Driven Data Generation by wmoustafa · Pull Request #564 · linkedin/coral

wmoustafa · 2025-12-07T07:14:24Z

Introduce Symbolic Constraint Solver for SQL-Driven Data Generation

Overview

This PR introduces coral-data-generation, a symbolic constraint solver that inverts SQL expressions to derive input domain constraints. Instead of forward evaluation (generate → test → reject), it solves backward from predicates to derive what inputs must satisfy, enabling efficient test data generation with guaranteed constraint satisfaction.

Motivation

Problem: Traditional test data generation uses rejection sampling—generate random values, evaluate SQL predicates, discard mismatches. This is inefficient for complex nested expressions and cannot detect unsatisfiable queries.

Solution: Symbolic inversion treats SQL expressions as mathematical transformations with inverse functions. Starting from output constraints (e.g., = '50'), the system walks expression trees inward, applying inverse operations to derive input domains.

Examples

1. Nested String Operations

WHERE LOWER(SUBSTRING(name, 1, 3)) = 'abc'
→ name ∈ RegexDomain("^[aA][bB][cC].*$")
Generates: "Abc", "ABC123", "abcdef"

2. Cross-Domain Arithmetic

WHERE CAST(age * 2 AS STRING) = '50'
→ age ∈ IntegerDomain([25])

3. Date Extraction with Type Casting

WHERE SUBSTRING(CAST(birthdate AS STRING), 1, 4) = '2000'
→ birthdate ∈ DateDomain intersect RegexDomain("^2000-.*$")
Generates: 2000-01-15, 2000-12-31, 2000-06-20

4. Complex Nested Substring

WHERE SUBSTRING(SUBSTRING(product_code, 5, 10), 1, 3) = 'XYZ'
→ product_code must have 'XYZ' at positions 5-7
→ product_code ∈ RegexDomain("^.{4}XYZ.*$")

5. Contradiction Detection

WHERE SUBSTRING(name, 1, 4) = '2000' AND SUBSTRING(name, 1, 4) = '1999'
→ Empty domain (unsatisfiable - no data generated)

6. Date String Pattern Matching

WHERE CAST(order_date AS STRING) LIKE '2024-12-%'
→ order_date ∈ RegexDomain("^2024-12-.*$") ∩ DateFormatConstraint
Generates: 2024-12-01, 2024-12-15, 2024-12-31

Key Components

1. Domain System

Domain<T, D>: Abstract constraint representation supporting intersection, union, emptiness checking
RegexDomain: Automaton-backed string constraints (powered by dk.brics.automaton)
IntegerDomain: Interval-based numeric constraints with arithmetic closure
Cross-domain conversions: CastRegexTransformer bridges string ↔ numeric types

2. Transformer Architecture

Pluggable symbolic inversion functions implementing DomainTransformer:

SubstringRegexTransformer: Inverts SUBSTRING(x, start, len) with positional constraints
LowerRegexTransformer: Inverts LOWER(x) via case-insensitive regex generation
CastRegexTransformer: Cross-domain CAST inversion (string ↔ integer ↔ date)
PlusRegexTransformer: Arithmetic inversion: x + c = value → x = value - c
TimesRegexTransformer: Multiplication inversion: x * c = value → x = value / c

3. Relational Preprocessing

Normalizes Calcite RelNode trees for symbolic analysis:

ProjectPullUpController: Fixed-point projection normalization
CanonicalPredicateExtractor: Extracts predicates with global field indexing
DnfRewriter: Converts to Disjunctive Normal Form for independent disjunct solving

4. Solver

DomainInferenceProgram: Top-down expression tree traversal with domain refinement at each step, detecting contradictions via empty domain intersection.

Technical Approach

Symbolic Inversion: For nested expression f(g(h(x))) = constant:

Create output domain from constant
Apply f⁻¹ → intermediate domain
Apply g⁻¹ → refined domain
Apply h⁻¹ → input constraint on x

Contradiction Detection: Multiple predicates on same variable → domain intersection. Empty result = unsatisfiable query.

Extensibility: Architecture supports multi-table inference (join propagation), fixed-point iteration (recursive constraints), and arbitrary domain types (date, decimal, enum).

Testing

Integration Tests (RegexDomainInferenceProgramTest): 14+ test scenarios covering simple/nested transformations, cross-domain CAST operations, arithmetic inversion, and contradiction detection. All tests validate generated samples satisfy original SQL predicates.

Documentation

This module comes with aomprehensive README with conceptual model, examples, and API reference.

Future Extensibility

The architecture naturally extends to additional domains (DecimalDomain, DateDomain), more transformers (CONCAT, REGEXP_EXTRACT), multi-table inference (join constraint propagation), and aggregate support (cardinality constraints).

simbadzina

Half way through the README.md. Will continue reading and then proceed to the code.

How does the system in general handle expressions where the values depend on each other.
Eg.
SELECT * FROM test.suitcase WHERE width + height + length < 25

Does this need a new domain type?

simbadzina · 2026-01-15T05:25:27Z

+/**
+ * Copyright 2025 LinkedIn Corporation. All rights reserved.
+ * Licensed under the BSD-2 Clause license.
+ * See LICENSE in the project root for license information.
+ */
+package com.linkedin.coral.datagen.domain;
+
+import java.util.Arrays;
+import java.util.List;
+
+import org.testng.annotations.Test;
+
+
+/**
+ * Tests for IntegerDomain class.
+ */
+public class IntegerDomainTest {
+
+  @Test
+  public void testSingleValue() {
+    System.out.println("\n=== Single Value Test ===");
+    IntegerDomain domain = IntegerDomain.of(42);
+    System.out.println("Domain: " + domain);
+    System.out.println("Is empty: " + domain.isEmpty());
+    System.out.println("Contains 42: " + domain.contains(42));
+    System.out.println("Contains 43: " + domain.contains(43));
+    System.out.println("Samples: " + domain.sampleValues(5));
+  }
+
+  @Test
+  public void testSingleInterval() {
+    System.out.println("\n=== Single Interval Test ===");
+    IntegerDomain domain = IntegerDomain.of(10, 20);
+    System.out.println("Domain: " + domain);
+    System.out.println("Contains 10: " + domain.contains(10));
+    System.out.println("Contains 15: " + domain.contains(15));
+    System.out.println("Contains 20: " + domain.contains(20));
+    System.out.println("Contains 21: " + domain.contains(21));
+    System.out.println("Samples: " + domain.sampleValues(5));
+  }
+
+  @Test
+  public void testMultipleIntervals() {
+    System.out.println("\n=== Multiple Intervals Test ===");
+    List<IntegerDomain.Interval> intervals = Arrays.asList(new IntegerDomain.Interval(1, 5),
+        new IntegerDomain.Interval(10, 15), new IntegerDomain.Interval(20, 30));
+    IntegerDomain domain = IntegerDomain.of(intervals);
+    System.out.println("Domain: " + domain);
+    System.out.println("Contains 3: " + domain.contains(3));
+    System.out.println("Contains 7: " + domain.contains(7));
+    System.out.println("Contains 12: " + domain.contains(12));
+    System.out.println("Contains 25: " + domain.contains(25));
+    System.out.println("Samples: " + domain.sampleValues(10));
+  }
+
+  @Test
+  public void testIntersection() {
+    System.out.println("\n=== Intersection Test ===");
+    IntegerDomain domain1 = IntegerDomain.of(1, 20);
+    IntegerDomain domain2 = IntegerDomain.of(10, 30);
+    IntegerDomain intersection = domain1.intersect(domain2);
+    System.out.println("Domain 1: " + domain1);
+    System.out.println("Domain 2: " + domain2);
+    System.out.println("Intersection: " + intersection);
+    System.out.println("Samples: " + intersection.sampleValues(5));
+  }
+
+  @Test
+  public void testUnion() {
+    System.out.println("\n=== Union Test ===");
+    IntegerDomain domain1 = IntegerDomain.of(1, 10);
+    IntegerDomain domain2 = IntegerDomain.of(20, 30);
+    IntegerDomain union = domain1.union(domain2);
+    System.out.println("Domain 1: " + domain1);
+    System.out.println("Domain 2: " + domain2);
+    System.out.println("Union: " + union);
+    System.out.println("Samples: " + union.sampleValues(10));
+  }
+
+  @Test
+  public void testAddConstant() {
+    System.out.println("\n=== Add Constant Test ===");
+    IntegerDomain domain = IntegerDomain.of(10, 20);
+    IntegerDomain shifted = domain.add(5);
+    System.out.println("Original domain: " + domain);
+    System.out.println("After adding 5: " + shifted);
+    System.out.println("Samples: " + shifted.sampleValues(5));
+  }
+
+  @Test
+  public void testMultiplyConstant() {
+    System.out.println("\n=== Multiply Constant Test ===");
+    IntegerDomain domain = IntegerDomain.of(10, 20);
+    IntegerDomain scaled = domain.multiply(2);
+    System.out.println("Original domain: " + domain);
+    System.out.println("After multiplying by 2: " + scaled);
+    System.out.println("Samples: " + scaled.sampleValues(5));
+  }
+
+  @Test
+  public void testNegativeMultiply() {
+    System.out.println("\n=== Negative Multiply Test ===");
+    IntegerDomain domain = IntegerDomain.of(10, 20);
+    IntegerDomain scaled = domain.multiply(-1);
+    System.out.println("Original domain: " + domain);
+    System.out.println("After multiplying by -1: " + scaled);
+    System.out.println("Samples: " + scaled.sampleValues(5));
+  }
+
+  @Test
+  public void testOverlappingIntervalsMerge() {
+    System.out.println("\n=== Overlapping Intervals Merge Test ===");
+    List<IntegerDomain.Interval> intervals = Arrays.asList(new IntegerDomain.Interval(1, 10),
+        new IntegerDomain.Interval(5, 15), new IntegerDomain.Interval(20, 30));
+    IntegerDomain domain = IntegerDomain.of(intervals);
+    System.out.println("Input intervals: [1, 10], [5, 15], [20, 30]");
+    System.out.println("Merged domain: " + domain);
+    System.out.println("Samples: " + domain.sampleValues(10));
+  }
+
+  @Test
+  public void testAdjacentIntervalsMerge() {
+    System.out.println("\n=== Adjacent Intervals Merge Test ===");
+    List<IntegerDomain.Interval> intervals = Arrays.asList(new IntegerDomain.Interval(1, 10),
+        new IntegerDomain.Interval(11, 20), new IntegerDomain.Interval(30, 40));
+    IntegerDomain domain = IntegerDomain.of(intervals);
+    System.out.println("Input intervals: [1, 10], [11, 20], [30, 40]");
+    System.out.println("Merged domain: " + domain);
+    System.out.println("Samples: " + domain.sampleValues(10));
+  }
+
+  @Test
+  public void testEmptyDomain() {
+    System.out.println("\n=== Empty Domain Test ===");
+    IntegerDomain empty = IntegerDomain.empty();
+    System.out.println("Empty domain: " + empty);
+    System.out.println("Is empty: " + empty.isEmpty());
+    System.out.println("Samples: " + empty.sampleValues(5));
+  }
+
+  @Test
+  public void testIntersectionEmpty() {
+    System.out.println("\n=== Intersection Empty Test ===");
+    IntegerDomain domain1 = IntegerDomain.of(1, 10);
+    IntegerDomain domain2 = IntegerDomain.of(20, 30);
+    IntegerDomain intersection = domain1.intersect(domain2);
+    System.out.println("Domain 1: " + domain1);
+    System.out.println("Domain 2: " + domain2);
+    System.out.println("Intersection: " + intersection);
+    System.out.println("Is empty: " + intersection.isEmpty());
+  }
+
+  @Test
+  public void testComplexArithmetic() {
+    System.out.println("\n=== Complex Arithmetic Test ===");
+    // Solve: 2*x + 5 = 25, where x in [0, 100]
+    // => 2*x = 20
+    // => x = 10
+    IntegerDomain output = IntegerDomain.of(25);
+    IntegerDomain afterSubtract = output.add(-5); // x = 20
+    IntegerDomain solution = afterSubtract.multiply(1).intersect(IntegerDomain.of(0, 100));
+
+    System.out.println("Equation: 2*x + 5 = 25");
+    System.out.println("Output domain: " + output);
+    System.out.println("After subtracting 5: " + afterSubtract);
+    System.out.println("Solution (x must be in [0, 100]): " + solution);
+
+    // Verify
+    if (!solution.isEmpty()) {
+      long x = solution.sampleValues(1).get(0);
+      System.out.println("Sample x: " + x);
+      System.out.println("Verification: 2*" + x + " + 5 = " + (2 * x + 5));
+    }
+  }
+
+  @Test
+  public void testMultiIntervalIntersection() {
+    System.out.println("\n=== Multi-Interval Intersection Test ===");
+    List<IntegerDomain.Interval> intervals1 =
+        Arrays.asList(new IntegerDomain.Interval(1, 20), new IntegerDomain.Interval(30, 50));
+    List<IntegerDomain.Interval> intervals2 =
+        Arrays.asList(new IntegerDomain.Interval(10, 35), new IntegerDomain.Interval(45, 60));
+
+    IntegerDomain domain1 = IntegerDomain.of(intervals1);
+    IntegerDomain domain2 = IntegerDomain.of(intervals2);
+    IntegerDomain intersection = domain1.intersect(domain2);
+
+    System.out.println("Domain 1: " + domain1);
+    System.out.println("Domain 2: " + domain2);
+    System.out.println("Intersection: " + intersection);
+    System.out.println("Expected: [10, 20] ∪ [30, 35] ∪ [45, 50]");
+    System.out.println("Samples: " + intersection.sampleValues(15));
+  }
+}


These tests don't have assertions. Some other files have tests like these too.

simbadzina · 2026-01-15T05:43:04Z

+  @Test
+  public void testArithmeticExpression() {
+    testDomainInference("Arithmetic Expression Test", "SELECT * FROM test.T WHERE age * 2 + 5 = 25", inputDomain -> {
+      assertTrue(inputDomain instanceof IntegerDomain, "Should be IntegerDomain");


When there is an error, a new test I'm adding still passes

@Test public void testMultiVariateArithmeticExpression() { testDomainInference("Arithmetic Expression Test", "SELECT * FROM test.suitcase WHERE width + height + length < 25", inputDomain -> { assertTrue(inputDomain instanceof IntegerDomain, "Should be IntegerDomain"); IntegerDomain intDomain = (IntegerDomain) inputDomain; System.out.println(intDomain); assertTrue(intDomain.contains(10), "Should contain 10 (since 10 * 2 + 5 = 25)"); assertTrue(intDomain.contains(10), "Should contain 10 (since 10 * 2 + 5 = 25)"); assertTrue(intDomain.isSingleton(), "Should be singleton"); }); }

Good catch. The old testDomainInference helper used if guards (if (disjunct instanceof RexCall), if (operator == EQUALS)) that would silently skip the assertion lambda when the structure didn't match, making any test pass vacuously.

I've refactored the helper to replace those if guards with hard assertions (assertTrue(..., "disjunct should be a RexCall"), assertEquals(..., EQUALS, "operator should be EQUALS")), so a test like your testMultiVariateArithmeticExpression example would now fail explicitly at the operator check instead of passing silently.

wmoustafa · 2026-03-08T20:33:08Z

Half way through the README.md. Will continue reading and then proceed to the code.

How does the system in general handle expressions where the values depend on each other. Eg. SELECT * FROM test.suitcase WHERE width + height + length < 25

Does this need a new domain type?

Thanks for the review. This sounds like a type of "domain propagation" which is used for joins (e.g., when we resolve one variable, we resolve the other based on the relationship between them). However, this cases is a bit more complex than join because all expressions mutually depend on each other, and there is no obvious expression to start from and propagate to the rest. We will tackle this separately.

simbadzina · 2026-04-06T14:30:18Z

+ * - Single interval: [10, 20]
+ * - Multiple intervals: [1, 5] ∪ [10, 15] ∪ [20, 30]
+ */
+public class IntegerDomain extends Domain<Long, IntegerDomain> {


IntegerDomain.class and IntegerDomain$Interval.class has also committed, need to be removed.

Good catch. Thanks. Removed.

simbadzina · 2026-04-06T15:07:38Z

+      if (max == Long.MAX_VALUE || min == Long.MIN_VALUE) {
+        return Long.MAX_VALUE; // Unbounded
+      }
+      return max - min + 1;


Potential overflow: max - min + 1 can overflow for large intervals where neither bound is exactly Long.MIN_VALUE/Long.MAX_VALUE. For example, Interval(Long.MIN_VALUE + 1, Long.MAX_VALUE - 1) bypasses both guards but the arithmetic overflows.

Now throws exception and handles gracefully.

simbadzina · 2026-04-06T15:07:38Z

+    }
+
+    public boolean isAdjacent(Interval other) {
+      return this.max + 1 == other.min || other.max + 1 == this.min;


Potential overflow: this.max + 1 overflows when max == Long.MAX_VALUE (wraps to Long.MIN_VALUE), which could cause two non-adjacent intervals to incorrectly appear adjacent and get merged during normalization.

Rearranged.

simbadzina · 2026-04-06T15:07:38Z

+      { ':', '@' }, // colon to @ (common punctuation)
+      { '[', '`' }, // [ to backtick
+      { '{', '~' } // { to tilde
+  };


Nit: ALPHABET_RANGES is hard to read — requires knowledge of ASCII table gaps. Consider deriving the alphabet programmatically from the printable ASCII range:

private static final String ALPHABET; static { char printableStart = ' '; // 0x20 — first printable ASCII character char printableEnd = '~'; // 0x7E — last printable ASCII character StringBuilder sb = new StringBuilder(); for (char c = printableStart; c <= printableEnd; c++) { sb.append(c); } ALPHABET = sb.toString(); }

This covers the same set of characters and makes the intent explicit.

enumerateAllowedChars can then just refer to the above.

I would keep it simple for now.

simbadzina · 2026-04-06T15:16:57Z

+   */
+  private RegexDomain(Automaton automaton) {
+    this.regex = automaton.toString();
+    this.automaton = automaton;


dk.brics Automaton is mutable — methods like determinize() and reduce() can modify internal state. Since intersection() and union() return new instances this is low-risk today, but as a defensive measure consider cloning:

private RegexDomain(Automaton automaton) { this.automaton = automaton.clone(); this.regex = this.automaton.toString(); }

I would keep it simple since the above pattern does not exist.

simbadzina · 2026-04-06T15:16:58Z

+   * Creates a RegexDomain from an existing automaton.
+   */
+  private RegexDomain(Automaton automaton) {
+    this.regex = automaton.toString();


automaton.toString() returns a debug representation of states/transitions, not a valid regex pattern. This means this.regex here is not equivalent to the regex stored in the public constructor (line 41). This could be confusing if regex is used for display or debugging.

One option is to pass a descriptive string from the call sites, e.g.:

private RegexDomain(Automaton automaton, String regex) { ... } // In intersect(): return new RegexDomain(intersection, "(" + this.regex + ")&(" + other.regex + ")"); // In union(): return new RegexDomain(union, "(" + this.regex + ")|(" + other.regex + ")");

Up to you whether this is worth addressing now.

This also affects LowerRegexTransformer: if a RegexDomain produced by intersect()/union() is passed as the output domain, getRegex() returns the debug string. isLiteral() will return false (since the debug string contains special characters), so the transformer silently skips the case-insensitive inversion and returns the domain unchanged. Fixing toString() here would fix that too.

No longer applicable. The regex String field has been eliminated entirely. RegexDomain is now purely automaton-driven. isLiteral() uses automaton.getFiniteStrings(2) and getLiteralValue() uses automaton.getFiniteStrings(1).

simbadzina · 2026-04-06T15:22:33Z

+   */
+  @Override
+  public List<String> sample(int limit) {
+    return sampleStrings(limit, 100);


Nit: consider extracting 100 to a named constant (e.g. DEFAULT_MAX_SAMPLE_LENGTH) to make the intent clearer.

Extracted to DEFAULT_MAX_SAMPLE_LENGTH = 100. sample() now throws IllegalStateException when the automaton is non-empty but DFS yields zero samples.

simbadzina · 2026-04-06T15:24:00Z

+   */
+  @Override
+  public List<String> sample(int limit) {
+    return sampleStrings(limit, 100);


If the regex only matches strings longer than maxLength (100), this silently returns an empty list. The caller can't distinguish "empty domain" from "domain exists but all valid strings exceed the length cap." Consider logging a warning or throwing when the domain is non-empty but no samples could be generated.

simbadzina · 2026-04-06T15:29:52Z

+
+    // Initialize generic domain inference program with all transformers
+    program = new DomainInferenceProgram(Arrays.asList(new LowerRegexTransformer(), new SubstringRegexTransformer(),
+        new PlusRegexTransformer(), new TimesRegexTransformer(), new CastRegexTransformer()));


As new transformers are added, every caller that assembles this list needs to remember to include them — easy to miss silently.

Consider adding a factory method to DomainInferenceProgram as the single source of truth:

public static DomainInferenceProgram withDefaultTransformers() { return new DomainInferenceProgram(List.of( new LowerRegexTransformer(), new SubstringRegexTransformer(), new PlusRegexTransformer(), new TimesRegexTransformer(), new CastRegexTransformer())); }

You could also add a test that verifies all DomainTransformer implementations are included in the default list, so adding a new transformer without registering it fails a test.

I was considering if we need a ServiceLoader but they seems like overkill.

Agreed, ServiceLoader is overkill. Added DomainInferenceProgram.withDefaultTransformers() as the single source of truth for the default transformer list.

simbadzina · 2026-04-06T15:31:22Z

+   * Convenience method for deriving IntegerDomain constraints.
+   * Throws if the result is not an IntegerDomain.
+   */
+  public IntegerDomain deriveInputInteger(RexNode expr, IntegerDomain outputInteger) {


This method doesn't appear to be called anywhere. Consider removing it to avoid dead code, or adding test coverage if it's intended for future use.

simbadzina · 2026-04-06T15:33:51Z

+
+  @Override
+  public boolean canHandle(RexNode expr) {
+    return expr instanceof RexCall && ((RexCall) expr).getOperator() == SqlStdOperatorTable.LOWER;


Should we also handle UPPER? The inversion logic is the same — both produce a case-insensitive regex. Consider generalizing this into a CaseRegexTransformer that handles both SqlStdOperatorTable.LOWER and SqlStdOperatorTable.UPPER, to avoid duplicating the class.

Deferring to a follow-up. The logic is identical, but adding UPPER support can be done cleanly when it's actually needed.

simbadzina · 2026-04-06T15:37:38Z

+      if (isStringType(targetTypeName) && !isStringType(sourceTypeName)) {
+        if (isDateType(sourceTypeName)) {
+          // Date to String
+          String dateFormatRegex = "^[0-9]{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|30)$";


Date end should handle 31 too. Right now handling ends at 30.

simbadzina · 2026-04-06T15:45:03Z

+    // For complex patterns, wrap with case-insensitive flag
+    // Note: Java regex doesn't have inline (?i) in automaton library,
+    // so we return the pattern as-is and rely on character-level matching
+    return outputRegex;


Consider adding a disabled test to track this known limitation, so it doesn't get silently forgotten:

@Test(enabled = false, description = "Non-literal LOWER patterns not yet case-insensitive inverted — see LowerRegexTransformer line 74") public void testLowerWithComplexPattern() { // LOWER(name) LIKE '%abc%' should produce .*[aA][bB][cC].* }

This way it shows up in test reports as skipped rather than being invisible.

The non-literal LOWER fallback (returning as-is for complex patterns like .*abc.*) is a known feature gap. A proper fix would walk the automaton transitions and expand alphabetic ranges to include both cases. Deferring to a follow-up alongside UPPER support.

simbadzina · 2026-04-06T15:50:47Z

+    return new Output(scans, remapped);
+  }
+
+  private static final Map<RexNode, RelNode> predicateOriginMap = new IdentityHashMap<>();


P0 — Thread-safety issue: predicateOriginMap is a static mutable field shared across all invocations. extract() calls .clear() (line 42) then populates it during collection (line 81). If two threads call extract() concurrently, they corrupt each other's data.

Fix: Make it a local variable inside extract() and pass it through as a parameter, or make the class non-static with an instance field.

Fixed. Made predicateOriginMap a local variable in extract() and passed it as a parameter to collectPredicates(). The class remains fully static with a private constructor — no API change since collectPredicates is private.

simbadzina · 2026-04-06T15:57:21Z

+    }
+    // Remove escape sequences
+    return result.replaceAll("\\\\(.)", "$1");
+  }


Dead code: unescapeLiteral is defined but never called within this class. Consider removing it.

simbadzina · 2026-04-06T15:57:24Z

+    }
+    // Remove escape sequences
+    return result.replaceAll("\\\\(.)", "$1");
+  }


Duplicated code: this identical unescapeLiteral method also appears in SubstringRegexTransformer (line 119) and CastRegexTransformer (line 253, dead there). Consider extracting to a shared utility (e.g., RegexUtils.unescapeLiteral).

Leaving as-is. Duplication across 2 active files is tolerable; the method is small and stable.

simbadzina · 2026-04-06T15:59:06Z

+    }
+  }
+
+  private static class Node {


Dead code: inner class Node is never referenced anywhere. Likely a remnant of an earlier BFS-based sampling approach that was replaced by dfsCollect. Consider removing it.

Gone after refactor.

simbadzina · 2026-04-06T16:00:56Z

+      return interval.min;
+    }
+
+    long range = interval.max - interval.min;


Potential overflow: interval.max - interval.min can overflow for ranges exceeding Long.MAX_VALUE. For example, Interval(-1, Long.MAX_VALUE) produces a negative range, which would cause the range < Integer.MAX_VALUE check to behave unexpectedly and lead to incorrect sampling.

simbadzina · 2026-04-06T16:00:57Z

+ * x + 5 in [20, 30]
+ * produces: x in [15, 25]
+ */
+public class PlusRegexTransformer implements DomainTransformer {


Misleading name: PlusRegexTransformer operates on IntegerDomain and throws if given RegexDomain. The "Regex" prefix is a historical artifact. Consider renaming to PlusIntegerTransformer. Same applies to TimesRegexTransformer → TimesIntegerTransformer.

Renamed. PlusRegexTransformer → PlusIntegerTransformer, TimesRegexTransformer → TimesIntegerTransformer.

simbadzina · 2026-04-06T16:16:09Z

@@ -0,0 +1,659 @@
+/**


AI-Assisted Review Summary (In addition to the above comments)

Note: This is an automated review comment generated by AI (Claude). It is intended for consumption by both humans and AI agents. Author: please reply with a sub-comment indicating which issues should be fixed (e.g., fix: 1, 2, 5 or fix: all or skip: all).

P1 — Should Fix

# File Line Issue

1 SubstringRegexTransformer.java 33 Operator name matching uses hardcoded "substr" but Hive's SUBSTRING may be registered as "substring" in Calcite. If mismatched, this transformer silently never fires.

2 RegexToIntegerDomainConverter.java 462-465 createIntegerDomainFromBounds computes a single [min, max] interval for large domains. Pattern ^(10|20|30)$ becomes [10, 30] including 20 false values. Should document the over-approximation explicitly.

3 RegexToIntegerDomainConverter.java 109 pattern.matches(".*[*+?].*") rejects *, +, ? anywhere — including inside character classes like [*+?] where they are literal. Could incorrectly reject valid patterns.

4 RegexToIntegerDomainConverter.java 289 No validation that quantifier min <= max. Pattern {5,2} is silently accepted; downstream loops iterate from min to max, producing no iterations rather than erroring.

5 CastRegexTransformer.java 216-218 break on first large interval (>100 values) loses remaining intervals. [10, 20] ∪ [1000, 2000] silently becomes -?[0-9]+, discarding the first interval's precision.

6 CanonicalPredicateExtractor.java 84-89 Join condition remapping applies a single base offset to all RexInputRefs. Correct only if Projects are already pulled up (stated in Javadoc precondition). No runtime validation — wrong results if called without pull-up.

7 ProjectPullUpRewriter.java 241 rightCount = join.getRowType().getFieldCount() - leftCount uses original join output type, but leftCount comes from leftProj.getRowType() (Project output, not input). Mismatch if the left Project changes field count.

P2 — Consider

# File Line Issue

8 RegexToIntegerDomainConverter.java 331-342 estimateSize sums across repetition counts: [0-9]{2,3} gives 10^2 + 10^3 = 1100 but actual domain is 1000 values. Over-estimates and may unnecessarily trigger bounds path, losing precision.

9 CastRegexTransformer.java 206-219 convertIntegerDomainToRegex emits -?[0-9]+ (matches ANY integer) for ranges >100 values. [1000, 9999] loses all constraint info. Known limitation but severe.

10 RegexToIntegerDomainConverter.java 551, 590 computeMinString/computeMaxString for AlternationNode: if alt.alternatives is empty, returns null → NPE in caller's sb.append(). Parser shouldn't produce empty alternations but not validated.

11 build.gradle 2, 5 Uses deprecated compile/testCompile instead of implementation/testImplementation.

12 rel/ package — No unit tests for ProjectPullUpRewriter, CanonicalPredicateExtractor, DnfRewriter. Only tested indirectly via integration tests.

break on first large interval (>100 values) loses remaining intervals. [10, 20] ∪ [1000, 2000] silently becomes -?[0-9]+, discarding the first interval's precision.

Fixed with IntegerRangeAutomaton — a ~80-line utility that builds a minimal automaton accepting exactly the decimal representations of integers in [lo, hi] using recursive digit-by-digit construction. All intervals are now processed precisely via parts.add(IntegerRangeAutomaton.build(min, max)). The break and the generic -?[0-9]+ fallback are both gone. Added 19 tests in IntegerRangeAutomatonTest covering single values, cross-digit boundaries, large ranges, negative/mixed ranges, and leading-zero rejection.

Join condition remapping applies a single base offset to all RexInputRefs. Correct only if Projects are already pulled up (stated in Javadoc precondition). No runtime validation — wrong results if called without pull-up.

The Javadoc precondition is sufficient. The single production caller always applies ProjectPullUpController.applyUntilFixedPoint() before extract().

rightCount = join.getRowType().getFieldCount() - leftCount uses original join output type, but leftCount comes from leftProj.getRowType() (Project output, not input). Mismatch if the left Project changes field count.

Fixed. The "only left had Project" and "only right had Project" branches now use newLeft.getRowType().getFieldCount() for pass-through RexInputRef offsets and newJoin.getRowType() for field types.

Uses deprecated compile/testCompile instead of implementation/testImplementation.

Fixed. Replaced throughout.

P1-2, P1-3, P1-4: RegexToIntegerDomainConverter issues (over-approximation, naive [*+?] check, no quantifier validation).

No longer applicable. The class was completely rewritten to use dk.brics automaton directly. isConvertible() checks a.isFinite() && isDigitOnly(a), convert() uses getFiniteStrings(5000) for enumeration with DFS fallback. The custom regex parser, estimateSize(), checkForInvalidConstructs(), computeMinString/computeMaxString, and createIntegerDomainFromBounds are all gone.

break on first large interval (>100 values) loses remaining intervals. [10, 20] ∪ [1000, 2000] silently becomes -?[0-9]+, discarding the first interval's precision.

Fixed via IntegerRangeAutomaton (see CastRegexTransformer.java:216-218 reply above).

estimateSize sums across repetition counts: [0-9]{2,3} gives 10^2 + 10^3 = 1100 but actual domain is 1000 values. Over-estimates and may unnecessarily trigger bounds path, losing precision.

No longer applicable (class rewritten, estimateSize removed).

convertIntegerDomainToRegex emits -?[0-9]+ (matches ANY integer) for ranges >100 values. [1000, 9999] loses all constraint info. Known limitation but severe.

Fixed via IntegerRangeAutomaton.

computeMinString/computeMaxString for AlternationNode: if alt.alternatives is empty, returns null → NPE in caller's sb.append(). Parser shouldn't produce empty alternations but not validated.

No longer applicable (class rewritten, methods removed).

simbadzina · 2026-04-06T17:10:18Z

Bug Bash: Failing Test Cases for Already-Identified Issues

Here are test cases for bugs already flagged in the review. They all fail on the current code and can be copy-pasted directly.

Tests to add to `CastRegexTransformerTest.java`

(Also add import java.util.Arrays; to the imports)

  @Test
  public void testCastStringToInteger_LargeRangePreservesConstraints() {
    // Bug: convertIntegerDomainToRegex (CastRegexTransformer.java:206-218) uses "-?[0-9]+"
    // for ranges >= 100, which matches ANY integer string, losing all constraint information.
    // For example, IntegerDomain([100, 200]) produces "^(-?[0-9]+)$" which matches -5, 0, 5000, etc.
    //
    // Fix: Generate proper digit-level regex patterns for large ranges (e.g., for [100, 200]
    // generate "(1[0-9]{2}|200)") or keep the IntegerDomain and convert lazily.

    RexNode inputRef = rexBuilder.makeInputRef(typeFactory.createSqlType(SqlTypeName.VARCHAR), 0);
    RexNode castExpr = rexBuilder.makeCast(typeFactory.createSqlType(SqlTypeName.INTEGER), inputRef);

    IntegerDomain outputDomain = IntegerDomain.of(Collections.singletonList(new IntegerDomain.Interval(100, 200)));

    Domain<?, ?> inputDomain = transformer.refineInputDomain(castExpr, outputDomain);

    assertTrue(inputDomain instanceof RegexDomain, "Should be RegexDomain");
    RegexDomain regexInput = (RegexDomain) inputDomain;

    // The regex should NOT match values outside [100, 200]
    assertTrue(regexInput.intersect(RegexDomain.literal("-5")).isEmpty(),
        "RegexDomain for [100,200] should reject '-5'");
    assertTrue(regexInput.intersect(RegexDomain.literal("0")).isEmpty(),
        "RegexDomain for [100,200] should reject '0'");
    assertTrue(regexInput.intersect(RegexDomain.literal("5000")).isEmpty(),
        "RegexDomain for [100,200] should reject '5000'");

    // Should still accept values within range
    assertFalse(regexInput.intersect(RegexDomain.literal("150")).isEmpty(),
        "RegexDomain for [100,200] should accept '150'");
  }

  @Test
  public void testCastStringToInteger_MultiIntervalWithLargeRange() {
    // Bug: The break statement in convertIntegerDomainToRegex (CastRegexTransformer.java:218)
    // exits the entire loop when a large interval (>= 100) is encountered, causing all
    // subsequent intervals to be dropped. Combined with Bug 1, the generic "-?[0-9]+" pattern
    // replaces all interval-specific constraints.
    //
    // Fix: Remove the 'break' at line 218 and handle each interval independently,
    // generating proper bounded patterns for each.

    RexNode inputRef = rexBuilder.makeInputRef(typeFactory.createSqlType(SqlTypeName.VARCHAR), 0);
    RexNode castExpr = rexBuilder.makeCast(typeFactory.createSqlType(SqlTypeName.INTEGER), inputRef);

    IntegerDomain outputDomain = IntegerDomain.of(Arrays.asList(
        new IntegerDomain.Interval(1, 5),
        new IntegerDomain.Interval(100, 200)));

    Domain<?, ?> inputDomain = transformer.refineInputDomain(castExpr, outputDomain);

    assertTrue(inputDomain instanceof RegexDomain, "Should be RegexDomain");
    RegexDomain regexInput = (RegexDomain) inputDomain;

    // Values in [1,5] should be accepted
    assertFalse(regexInput.intersect(RegexDomain.literal("3")).isEmpty(),
        "RegexDomain for [1,5]∪[100,200] should accept '3'");

    // Values in [100,200] should be accepted
    assertFalse(regexInput.intersect(RegexDomain.literal("150")).isEmpty(),
        "RegexDomain for [1,5]∪[100,200] should accept '150'");

    // Values in neither interval should be rejected
    assertTrue(regexInput.intersect(RegexDomain.literal("50")).isEmpty(),
        "RegexDomain for [1,5]∪[100,200] should reject '50'");
    assertTrue(regexInput.intersect(RegexDomain.literal("-999")).isEmpty(),
        "RegexDomain for [1,5]∪[100,200] should reject '-999'");
  }

  @Test
  public void testCastDateToString_AcceptsDay31() {
    // Bug: Date format regex at CastRegexTransformer.java:138 is
    // "^[0-9]{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|30)$"
    // The day alternation (0[1-9]|[12][0-9]|30) covers days 01-30 but NOT 31.
    // Valid dates like 2024-01-31 or 2024-03-31 are incorrectly excluded.
    //
    // Fix: Change the day portion from "(0[1-9]|[12][0-9]|30)" to "(0[1-9]|[12][0-9]|3[01])".

    RexNode inputRef = rexBuilder.makeInputRef(typeFactory.createSqlType(SqlTypeName.DATE), 0);
    RexNode castExpr = rexBuilder.makeCast(typeFactory.createSqlType(SqlTypeName.VARCHAR), inputRef);

    RegexDomain outputDomain = RegexDomain.literal("2024-01-31");

    Domain<?, ?> inputDomain = transformer.refineInputDomain(castExpr, outputDomain);

    assertTrue(inputDomain instanceof RegexDomain, "Should be RegexDomain");
    RegexDomain regexInput = (RegexDomain) inputDomain;

    // 2024-01-31 is a valid date; the domain should not be empty
    assertFalse(regexInput.isEmpty(),
        "Domain for CAST(date AS string) should accept '2024-01-31' -- day 31 is valid");
  }

simbadzina · 2026-04-06T23:55:36Z

Consider moving all transformer implementations (CastRegexTransformer, LowerRegexTransformer, PlusRegexTransformer, SubstringRegexTransformer, TimesRegexTransformer) into a transformer subpackage (e.g., com.linkedin.coral.datagen.domain.transformer). As more transformers get added, the domain package will get crowded with implementation classes. Keeping the DomainTransformer interface in the domain package and the implementations in a subpackage would make the structure cleaner.

simbadzina · 2026-04-07T00:08:52Z

+  }
+
+  @Override
+  public Domain<?, ?> refineInputDomain(RexNode expr, Domain<?, ?> outputDomain) {


Suggestion: Refactor to avoid the growing if-else chain

This method dispatches on (sourceType, targetType, domainType) via a series of if blocks. As more types are supported (decimal, timestamp, etc.), this will grow unwieldy.

Option A — Handler registry (simplest):

@FunctionalInterface interface CastHandler { Domain<?, ?> handle(RexCall call, Domain<?, ?> outputDomain); } enum TypeCategory { STRING, INTEGER, DATE, NUMERIC_OTHER } // Register handlers keyed by (source, target, domainClass) private final Map<CastKey, CastHandler> handlers = new HashMap<>(); private void registerHandlers() { handlers.put(key(STRING, INTEGER, IntegerDomain.class), this::castStringToIntegerWithIntegerDomain); handlers.put(key(STRING, INTEGER, RegexDomain.class), this::castStringToIntegerWithRegexDomain); handlers.put(key(INTEGER, STRING, RegexDomain.class), this::castIntegerToStringWithRegexDomain); handlers.put(key(DATE, STRING, RegexDomain.class), this::castDateToStringWithRegexDomain); // ...add new entries as needed }

Then refineInputDomain becomes:

public Domain<?, ?> refineInputDomain(RexNode expr, Domain<?, ?> outputDomain) { // ... extract sourceType, targetType ... if (sourceTypeName == targetTypeName) return outputDomain; CastHandler handler = handlers.get(key(categorize(sourceTypeName), categorize(targetTypeName), outputDomain.getClass())); if (handler != null) return handler.handle(call, outputDomain); // fallback for numeric-to-numeric, unknown combos, etc. return outputDomain; }

Option B — Two-level dispatch: Group by (sourceCategory, targetCategory) into strategy objects, each of which handles the domain type internally. Better if strategies within a type pair share logic.

Option A is probably the right starting point — adding support for a new type is just adding a map entry and a private method, with no risk of breaking existing branches.

Deferring. The current 8-branch if-else chain is manageable. Will refactor to a handler registry when it reaches ~15+ branches as more types are supported.

A handler registry sounds like a good approach as a follow up when more branches are neede.

simbadzina · 2026-04-07T00:10:10Z

+ * 
+ * @deprecated Use {@link DomainInferenceProgram} for full cross-domain support
+ */
+public class RegexDomainInferenceProgram {


This is deadcode, no callers.

Good catch. Removed.

simbadzina · 2026-04-07T00:31:26Z

+ * <p>Usage pattern:</p>
+ * <pre>
+ *   CanonicalPredicateExtractor.Output extracted = CanonicalPredicateExtractor.extract(rel);
+ *   CanonicalPredicateDnf.Output dnf = CanonicalPredicateDnf.convert(extracted, rexBuilder);


Nit: Javadoc references CanonicalPredicateDnf.Output / CanonicalPredicateDnf.convert but the class is DnfRewriter.

Thanks. Fixed.

simbadzina · 2026-04-07T00:31:28Z

+    }
+
+    List<RelNode> newInputs = new ArrayList<>();
+    boolean anyChanged = false;


Dead code: anyChanged is assigned at line 152 but never read — the method returns immediately after setting it.

simbadzina · 2026-04-07T00:31:29Z

+
+    // Inline right Project if present
+    if (rightProj != null) {
+      int leftFieldCount = newLeft.getRowType().getFieldCount();


Same category of bug as line 241: after newLeft = leftProj.getInput() (line 205), newLeft.getRowType().getFieldCount() gives the field count of the left project's input (child), not its output. If the left project changes field count (adds/drops columns), this offset will be wrong for right-side condition inlining.

Good catch. Fixed. Captured leftFieldCount from leftProj.getRowType().getFieldCount() before newLeft is reassigned. Added a unit test (testJoinLeftProjectPullUp_FieldCountChanged) that confirms the fix.

wmoustafa · 2026-04-15T04:21:29Z

Thanks for the Bug Bash tests. Added them. They pass after the above fixes.

wmoustafa · 2026-04-15T04:21:57Z

Consider moving all transformer implementations (CastRegexTransformer, LowerRegexTransformer, PlusRegexTransformer, SubstringRegexTransformer, TimesRegexTransformer) into a transformer subpackage (e.g., com.linkedin.coral.datagen.domain.transformer). As more transformers get added, the domain package will get crowded with implementation classes. Keeping the DomainTransformer interface in the domain package and the implementations in a subpackage would make the structure cleaner.

Moved to own package.

- Move transformers to dedicated subpackage (domain/transformer/) - Rename PlusRegexTransformer → PlusIntegerTransformer, TimesRegexTransformer → TimesIntegerTransformer - Add DomainInferenceProgram.withDefaultTransformers() factory method - Rewrite RegexToIntegerDomainConverter with automaton-based approach and add IntegerRangeAutomaton for precise integer range constraints - Add unit tests for CanonicalPredicateExtractor, ProjectPullUpRewriter, and IntegerRangeAutomaton - Fix DnfRewriter Javadoc, extract RegexDomain magic number, make IntegerRangeAutomaton public for cross-package access - Remove dead code: deriveInputRegex(), deprecated sampleValues() - Normalize test conventions: setup method naming, section separators, camelCase method names, assertEquals parameter order, and IntegerDomain construction style

simbadzina

LGTM. Just merge conflicts need to be addressed.

wmoustafa added 4 commits November 28, 2025 13:15

Add sampling based data generation

8f85e39

Add further features and fix end-to-end implementation

923f340

Add hive.xml

286ade3

Fix build

7a8549a

simbadzina reviewed Jan 15, 2026

View reviewed changes

Address review comments

f0e77e2

simbadzina reviewed Apr 6, 2026

View reviewed changes

simbadzina reviewed Apr 7, 2026

View reviewed changes

simbadzina approved these changes Apr 21, 2026

View reviewed changes

#	File	Line	Issue
1	`SubstringRegexTransformer.java`	33	Operator name matching uses hardcoded `"substr"` but Hive's SUBSTRING may be registered as `"substring"` in Calcite. If mismatched, this transformer silently never fires.
2	`RegexToIntegerDomainConverter.java`	462-465	`createIntegerDomainFromBounds` computes a single `[min, max]` interval for large domains. Pattern `^(10\|20\|30)$` becomes `[10, 30]` including 20 false values. Should document the over-approximation explicitly.
3	`RegexToIntegerDomainConverter.java`	109	`pattern.matches(".[+?].")` rejects ``, `+`, `?` anywhere — including inside character classes like `[*+?]` where they are literal. Could incorrectly reject valid patterns.
4	`RegexToIntegerDomainConverter.java`	289	No validation that quantifier `min <= max`. Pattern `{5,2}` is silently accepted; downstream loops iterate from min to max, producing no iterations rather than erroring.
5	`CastRegexTransformer.java`	216-218	`break` on first large interval (>100 values) loses remaining intervals. `[10, 20] ∪ [1000, 2000]` silently becomes `-?[0-9]+`, discarding the first interval's precision.
6	`CanonicalPredicateExtractor.java`	84-89	Join condition remapping applies a single base offset to all `RexInputRef`s. Correct only if Projects are already pulled up (stated in Javadoc precondition). No runtime validation — wrong results if called without pull-up.
7	`ProjectPullUpRewriter.java`	241	`rightCount = join.getRowType().getFieldCount() - leftCount` uses original join output type, but `leftCount` comes from `leftProj.getRowType()` (Project output, not input). Mismatch if the left Project changes field count.

#	File	Line	Issue
8	`RegexToIntegerDomainConverter.java`	331-342	`estimateSize` sums across repetition counts: `[0-9]{2,3}` gives `10^2 + 10^3 = 1100` but actual domain is 1000 values. Over-estimates and may unnecessarily trigger bounds path, losing precision.
9	`CastRegexTransformer.java`	206-219	`convertIntegerDomainToRegex` emits `-?[0-9]+` (matches ANY integer) for ranges >100 values. `[1000, 9999]` loses all constraint info. Known limitation but severe.
10	`RegexToIntegerDomainConverter.java`	551, 590	`computeMinString`/`computeMaxString` for `AlternationNode`: if `alt.alternatives` is empty, returns null → NPE in caller's `sb.append()`. Parser shouldn't produce empty alternations but not validated.
11	`build.gradle`	2, 5	Uses deprecated `compile`/`testCompile` instead of `implementation`/`testImplementation`.
12	`rel/` package	—	No unit tests for `ProjectPullUpRewriter`, `CanonicalPredicateExtractor`, `DnfRewriter`. Only tested indirectly via integration tests.

Conversation

wmoustafa commented Dec 7, 2025

Introduce Symbolic Constraint Solver for SQL-Driven Data Generation

Overview

Motivation

Examples

Key Components

1. Domain System

2. Transformer Architecture

3. Relational Preprocessing

4. Solver

Technical Approach

Testing

Documentation

Future Extensibility

Uh oh!

simbadzina left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wmoustafa commented Mar 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simbadzina Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simbadzina left a comment •

edited

Loading

simbadzina Apr 6, 2026 •

edited

Loading