Schema Discovery¶
Overview¶
Schema Discovery allows structures to automatically evolve as new data patterns emerge. Instead of defining your schema upfront, you can let Centrali infer field types from incoming data and suggest schema changes - perfect for dynamic data ingestion scenarios.
Key Features¶
- Schemaless Mode: Accept any data without predefined schema
- Auto-Evolving Mode: Automatically detect and suggest new fields
- AI-Powered Type Inference: Intelligently determines field types from sample data
- Review Before Apply: Suggestions require approval before modifying schema
Schema Discovery Modes¶
| Mode | Description | Use Case |
|---|---|---|
| Strict | All fields must be defined in schema | Production data with known structure |
| Schemaless | Accept any data, infer schema later | Initial development, data exploration |
| Auto-Evolving | Accept new fields, suggest schema updates | Dynamic integrations, evolving APIs |
Strict Mode (Default)¶
Records must conform to the defined schema. Unknown fields are rejected.
// Schema defines: name, email
// This record is REJECTED (unknown field: phone)
{
"name": "John",
"email": "john@example.com",
"phone": "555-1234"
}
Schemaless Mode¶
Records are accepted regardless of schema. Unknown fields are stored and analyzed.
// Schema defines: name, email
// This record is ACCEPTED (phone is stored as unknown)
{
"name": "John",
"email": "john@example.com",
"phone": "555-1234"
}
After enough records, the AI analyzes patterns and suggests adding phone as a string field.
Auto-Evolving Mode¶
Like schemaless, but designed for ongoing evolution. New fields trigger automatic suggestions.
Configuration¶
Schema Discovery is configured per structure in the Console UI under the Schema Discovery tab.
Setting the Mode¶
- Navigate to your structure in the Console
- Click the Schema Discovery tab
- Select your desired mode from the dropdown
- Click Save Configuration
Inference Batch Size¶
For schemaless and auto-evolving modes, you can set how many records to collect before triggering AI analysis. Default is 10 records.
- Smaller batches (5-10): Faster suggestions, less data to analyze
- Larger batches (50-100): More accurate type inference, slower suggestions
Using Schema Discovery¶
Scan Existing Records¶
To discover unknown fields in existing data:
- Go to the Schema Discovery tab
- Click Scan Existing Records
- The system analyzes records and generates suggestions
Trigger Manual Inference¶
To process buffered records immediately:
- Check the Buffer Count in the summary
- Click Trigger Inference to analyze buffered data
Review Suggestions¶
Schema suggestions appear in the Schema Suggestions table:
- Accept: Add the suggested field to your schema
- Reject: Dismiss the suggestion
- View Details: See sample values and inferred type
Suggestion Types¶
| Operation | Description |
|---|---|
| Add Field | New field detected in incoming data |
| Update Type | Field exists but type mismatch detected |
| Initial Schema | Full schema inferred for schemaless structure |
Suggestion Details¶
Each suggestion includes: - Field Name: The discovered field - Inferred Type: Suggested type (string, number, boolean, etc.) - Confidence: How confident the AI is in the type inference - Sample Values: Actual values found in records
How Type Inference Works¶
The AI analyzes sample values to determine the most appropriate type:
| Sample Values | Inferred Type |
|---|---|
"hello", "world" | String |
123, 456, 789 | Number |
true, false | Boolean |
"2024-01-15", "2024-02-20" | DateTime |
"john@example.com" | String (with email pattern hint) |
{"nested": "object"} | Object |
[1, 2, 3] | Array |
Best Practices¶
Start Schemaless, Migrate to Strict¶
For new integrations: 1. Start in schemaless mode 2. Collect real data 3. Accept schema suggestions 4. Switch to strict mode for production
Review Suggestions Carefully¶
AI inference is accurate but not perfect. Review suggested types against your expected data patterns.
Use Auto-Evolving for APIs¶
When integrating with external APIs that may add fields over time, auto-evolving mode ensures you capture new data without breaking ingestion.
Set Appropriate Batch Sizes¶
- Prototype/testing: Small batches (5-10)
- Production: Larger batches (20-50) for accuracy
Handle Type Conflicts¶
If the same field appears with different types, the AI suggests the most permissive type. Review these carefully.
Limits¶
| Limit | Value |
|---|---|
| Max fields per suggestion | 100 |
| Max nesting depth | 3 levels |
| Max string length sampled | 1000 characters |
| Suggestion retention | 90 days |
| Max buffered records | 1000 |
Example Workflow¶
Scenario: New Webhook Integration¶
You're receiving webhook data from a third-party service with an evolving schema.
- Create Structure in schemaless mode
- Configure Integration to POST data to Centrali
- Collect Data - let records flow in
- Review Suggestions - AI detects fields like
event_type,payload,timestamp - Accept Suggestions - fields are added to schema
- Switch to Auto-Evolving - new fields are detected automatically going forward
// Initial webhook data
{
"event_type": "order.created",
"timestamp": "2024-01-15T10:30:00Z",
"payload": {
"order_id": "ord_123",
"amount": 99.99
}
}
// AI suggests:
// - event_type: String
// - timestamp: DateTime
// - payload: Object with nested fields
Related Documentation¶
- AI Validation - Data quality validation
- Anomaly Insights - Anomaly detection
- Structures & Records - Data schemas and entries
- Structures API - API reference