Learn about the data validation process used by the Data Flywheel Foundational Blueprint. This validation process ensures that all data follows the OpenAI Chat Completion format and meets quality standards before being used for training or evaluation.
┌─────────────────────────┐
│ Start Validation │
│ Input: Records, │
│ Workload Type, Limit │
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ validate_records_count │
│ - Check min_records │
│ - Check eval_size │
│ - Check limit validity │
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ Separate by Format │
│ - Valid OpenAI format │
│ - Invalid format │
└─────────────────────────┘
│
┌──────┴──────┐
│ │
Valid Invalid
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│Valid OpenAI │ │ Track │
│ Records │ │ Invalid │
└─────────────┘ │ Count │
│ └─────────────┘
▼
┌─────────────────────────┐
│ Apply Quality Filters │
│ Based on Workload Type │
└─────────────────────────┘
│
┌──────┴──────┐
│ │
TOOL_CALLING GENERIC
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Validate: │ │ No Special │
│ - Has tool │ │ Validation │
│ calls │ │ │
│ - Valid │ │ │
│ function │ │ │
│ args JSON │ │ │
│ - Parse │ │ │
│ args to │ │ │
│ objects │ │ │
└─────────────┘ └─────────────┘
│ │
└──────┬──────┘
▼
┌─────────────────────────┐
│ Remove Duplicates │
│ Based on User Queries │
└─────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Check: Have >= min_total_records │
│ valid records? │
└─────────────────────────────────────┘
│
┌──────┴──────┐
│ │
Yes│ No │
▼ ▼
┌─────────┐ ┌────────────────┐
│ Random │ │ FAIL: │
│ Select │ │ Not Enough │
│ limit │ │ Valid Records │
│ records │ └────────────────┘
└─────────┘
│
▼
┌─────────────────────────┐
│ Log Statistics │
│ Return Records │
└─────────────────────────┘The first validation step checks the input parameters and record counts:
- Minimum Records Check: Ensures total records ≥
min_total_records - Eval Size Check: Ensures
eval_size≤ total records available - Limit Validity Check: Ensures
limit≥min_total_records
If any of these checks fail, the validation stops with a descriptive error message.
The validator checks for a complete OpenAI format structure:
- Top-level:
request,response,workload_id,client_idfields - Request: Must have
messageslist with valid message objects - Response: Must have
choiceslist with valid choice objects - Each message: Must have
roleand appropriate content fields - Each choice: Must have
messagewith eithercontentortool_calls
Based on workload type:
- Record must have tool calls in response
- Tool Properties Limit: Tool function parameter definitions are limited to a maximum of 8 properties (for known NIM bug)
- Function arguments must be valid JSON
- Arguments are parsed from strings to JSON objects
- No special validation required
Removes duplicate records based on user queries:
- Handles both string and multimodal content
- Creates unique keys from user messages
- If valid records ≥ limit: Randomly selects
limitrecords - If valid records < min_total_records: Raises an error with detailed statistics
Use the load_test_data.py script to load data with OpenAI format conversion:
python src/scripts/load_test_data.py --workload-id <workload_id> --file data/aiva-test.jsonlThe validator implements the following logic based on the limit parameter from the configuration:
- If valid records ≥ limit: Randomly selects
limitrecords from valid ones - If valid records < min_total_records: Raises an exception asking for more valid records
{
"total_records": 0,
"valid_openai_format": 0,
"invalid_format": 0,
"removed_quality_filters": 0,
"deduplicated_queries": 0,
"final_selected": 0,
}When there are insufficient records:
Insufficient valid records. Found {deduplicated} but need {limit}.
Total records: {total}, valid OpenAI format: {valid_openai},
after quality filters: {quality_filtered}.
Please provide more valid records.Add the following to your config/config.yaml:
data_split_config:
eval_size: 100
val_ratio: 0.1
min_total_records: 50
random_seed: null
limit: 1000
parse_function_arguments: TrueThe validation is implemented in two classes:
-
OpenAIFormatValidator: Validates the OpenAI Chat Completion format
- Validates the request and response structure
- Checks message roles and content
- Validates tool calls if present
- Applies workload-specific quality filters
-
DataValidator: Orchestrates the validation process
- Uses OpenAIFormatValidator for format validation
- Applies quality filters based on workload type
- Handles deduplication
- Performs final selection
from src.lib.integration.data_validator import DataValidator
from src.api.models import WorkloadClassification
validator = DataValidator()
validated_records = validator.validate_records(
records=raw_records,
workload_type=WorkloadClassification.TOOL_CALLING,
limit=1000,
min_total_records=50
)The DataSplitConfig object, defined in src/config.py, controls parameters such as eval_size, min_total_records, limit, and others used during validation. Adjust these values in your configuration file (config/config.yaml) to change validation behavior.
All validation steps and errors are logged using the project's logging system. Check your log output (as configured in the project) for detailed statistics and troubleshooting information.
Setting the random_seed parameter in your configuration ensures that the random selection of records is reproducible across runs. This is useful for consistent evaluation and debugging.
To add new workload types or custom validation rules, extend the DataValidator or OpenAIFormatValidator classes in src/lib/integration/. Add new methods or modify existing ones as needed to implement your custom logic.
Example datasets for testing can be found in the tests/fixtures/ directory or generated using the provided scripts (see notebooks/ and src/scripts/).
Deduplication is performed using the content of user messages only. Non-textual or multimodal content is not currently hashed or compared; only the text content of user messages is used to identify duplicates.
Validation statistics and errors are output to the log (see your project logging configuration for details on where logs are written).