AI Data Validation¶
Overview¶
AI Data Validation automatically detects data quality issues in your records using machine learning. It can identify typos, format errors, duplicate entries, and semantic inconsistencies - then suggest or auto-apply fixes.
Key Features¶
- Real-time Validation: Records are validated as they're created or updated
- Batch Scanning: Scan all existing records to find issues
- Auto-Correct Mode: Automatically fix high-confidence issues
- Configurable Per Structure: Enable/disable and configure for each structure independently
Issue Types¶
| Type | Description | Examples |
|---|---|---|
| Format | Rule-based validation for standard formats | Invalid email, malformed URL, incorrect date format |
| Typo | AI-powered spelling and typo detection | "Jonh" → "John", "recieve" → "receive" |
| Duplicate | Potential duplicate records detected | Two customers with same email but different names |
| Semantic | Logical inconsistencies in data | End date before start date, negative quantities |
Configuration¶
AI Validation is configured per structure in the Console UI under the AI Validation tab.
Validation Mode¶
- Advisory: Issues are flagged for manual review. No automatic changes are made.
- Auto-Correct: High-confidence issues are automatically fixed. You set the confidence threshold (70%-100%).
Auto-Correct Threshold¶
When using auto-correct mode, only suggestions with confidence above your threshold are applied automatically. Lower confidence issues are still flagged for manual review.
Recommended thresholds: - 90%+ for production data (conservative) - 80%+ for development/testing (balanced) - 70%+ for initial data cleanup (aggressive)
Using AI Validation¶
Enable Validation¶
- Navigate to your structure in the Console
- Click the AI Validation tab
- Toggle Enable AI Validation
- Select which issue types to detect
- Choose your validation mode
- Click Save Configuration
Run a Batch Scan¶
To scan all existing records:
- Go to the AI Validation tab
- Click Start Scan
- Progress is shown in real-time
- You can navigate away - the scan continues in the background
Review Suggestions¶
Pending suggestions appear in the Validation Suggestions table:
- Accept: Apply the suggested fix to the record
- Reject: Mark the suggestion as not applicable
- Bulk Actions: Select multiple suggestions to accept/reject at once
Suggestion Details¶
Each suggestion shows: - Record: Link to the affected record - Field: The field with the issue - Original Value: Current value in the record - Suggested Value: Recommended correction - Confidence: How confident the AI is (0-100%) - Reason: Explanation of why this was flagged (hover to view)
Real-time Notifications¶
When validation is enabled, you receive notifications for: - New validation suggestions created - Batch scan completion with summary
Subscribe to validation events in the Notification Settings.
SDK Usage¶
The SDK provides methods to work with validation:
// Trigger a batch validation scan
const batch = await centrali.validation.triggerScan('orders');
console.log('Scan started:', batch.data.batchId);
// List pending suggestions
const suggestions = await centrali.validation.listSuggestions({
status: 'pending',
issueType: 'typo'
});
// Accept a suggestion (applies the fix)
await centrali.validation.accept('suggestion-id');
// Reject a suggestion
await centrali.validation.reject('suggestion-id');
// Bulk accept high-confidence suggestions
const highConfidence = suggestions.data.filter(s => s.confidence >= 0.95);
await centrali.validation.bulkAccept(highConfidence.map(s => s.id));
// Get validation summary
const summary = await centrali.validation.getSummary();
Realtime Events¶
Subscribe to validation events via SSE:
const subscription = centrali.realtime.subscribe({
structures: ['orders'],
events: ['validation_suggestion_created', 'validation_batch_completed'],
onEvent: (event) => {
if (event.event === 'validation_suggestion_created') {
console.log('New issue:', event.data.field, event.data.issueType);
}
}
});
Best Practices¶
Start with Advisory Mode¶
Begin with advisory mode to understand what issues exist in your data before enabling auto-correct.
Use High Thresholds for Production¶
Set auto-correct threshold to 90%+ for production data to minimize false positives.
Regular Batch Scans¶
Run batch scans periodically (weekly/monthly) to catch issues that may have been missed during real-time validation.
Review Low-Confidence Suggestions¶
Low-confidence suggestions often reveal edge cases or unusual but valid data. Review these carefully.
Limits¶
| Limit | Value |
|---|---|
| Batch scan size | All records in structure |
| Concurrent scans per workspace | 1 |
| Suggestion retention | 90 days |
| Max suggestions per batch | 10,000 |
Related Documentation¶
- Anomaly Insights - AI-powered anomaly detection
- Schema Discovery - Automatic schema evolution
- Structures & Records - Data schemas and entries