Schema Discovery¶

Overview¶

Schema Discovery allows structures to automatically evolve as new data patterns emerge. Instead of defining your schema upfront, you can let Centrali infer field types from incoming data and suggest schema changes - perfect for dynamic data ingestion scenarios.

Key Features¶

Schemaless Mode: Accept any data without predefined schema
Auto-Evolving Mode: Automatically detect and suggest new fields
AI-Powered Type Inference: Intelligently determines field types from sample data
Review Before Apply: Suggestions require approval before modifying schema

Schema Discovery Modes¶

Mode	Description	Use Case
Strict	All fields must be defined in schema	Production data with known structure
Schemaless	Accept any data, infer schema later	Initial development, data exploration
Auto-Evolving	Accept new fields, suggest schema updates	Dynamic integrations, evolving APIs

Strict Mode (Default)¶

Records must conform to the defined schema. Unknown fields are rejected.

// Schema defines: name, email
// This record is REJECTED (unknown field: phone)
{
  "name": "John",
  "email": "john@example.com",
  "phone": "555-1234"
}

Schemaless Mode¶

Records are accepted regardless of schema. Unknown fields are stored and analyzed.

// Schema defines: name, email
// This record is ACCEPTED (phone is stored as unknown)
{
  "name": "John",
  "email": "john@example.com",
  "phone": "555-1234"
}

After enough records, the AI analyzes patterns and suggests adding phone as a string field.

Auto-Evolving Mode¶

Like schemaless, but designed for ongoing evolution. New fields trigger automatic suggestions.

Configuration¶

Schema Discovery is configured per structure in the Console UI under the Schema Discovery tab.

Setting the Mode¶

Navigate to your structure in the Console
Click the Schema Discovery tab
Select your desired mode from the dropdown
Click Save Configuration

Inference Batch Size¶

For schemaless and auto-evolving modes, you can set how many records to collect before triggering AI analysis. Default is 10 records.

Smaller batches (5-10): Faster suggestions, less data to analyze
Larger batches (50-100): More accurate type inference, slower suggestions

Using Schema Discovery¶

Scan Existing Records¶

To discover unknown fields in existing data:

Go to the Schema Discovery tab
Click Scan Existing Records
The system analyzes records and generates suggestions

Trigger Manual Inference¶

To process buffered records immediately:

Check the Buffer Count in the summary
Click Trigger Inference to analyze buffered data

Review Suggestions¶

Schema suggestions appear in the Schema Suggestions table:

Accept: Add the suggested field to your schema
Reject: Dismiss the suggestion
View Details: See sample values and inferred type

Suggestion Types¶

Operation	Description
Add Field	New field detected in incoming data
Update Type	Field exists but type mismatch detected
Initial Schema	Full schema inferred for schemaless structure

Suggestion Details¶

Each suggestion includes: - Field Name: The discovered field - Inferred Type: Suggested type (string, number, boolean, etc.) - Confidence: How confident the AI is in the type inference - Sample Values: Actual values found in records

How Type Inference Works¶

The AI analyzes sample values to determine the most appropriate type:

Sample Values	Inferred Type
`"hello"`, `"world"`	String
`123`, `456`, `789`	Number
`true`, `false`	Boolean
`"2024-01-15"`, `"2024-02-20"`	DateTime
`"john@example.com"`	String (with email pattern hint)
`{"nested": "object"}`	Object
`[1, 2, 3]`	Array

Best Practices¶

Start Schemaless, Migrate to Strict¶

For new integrations: 1. Start in schemaless mode 2. Collect real data 3. Accept schema suggestions 4. Switch to strict mode for production

Review Suggestions Carefully¶

AI inference is accurate but not perfect. Review suggested types against your expected data patterns.

Use Auto-Evolving for APIs¶

When integrating with external APIs that may add fields over time, auto-evolving mode ensures you capture new data without breaking ingestion.

Set Appropriate Batch Sizes¶

Prototype/testing: Small batches (5-10)
Production: Larger batches (20-50) for accuracy

Handle Type Conflicts¶

If the same field appears with different types, the AI suggests the most permissive type. Review these carefully.

Limits¶

Limit	Value
Max fields per suggestion	100
Max nesting depth	3 levels
Max string length sampled	1000 characters
Suggestion retention	90 days
Max buffered records	1000

Example Workflow¶

Scenario: New Webhook Integration¶

You're receiving webhook data from a third-party service with an evolving schema.

Create Structure in schemaless mode
Configure Integration to POST data to Centrali
Collect Data - let records flow in
Review Suggestions - AI detects fields like event_type, payload, timestamp
Accept Suggestions - fields are added to schema
Switch to Auto-Evolving - new fields are detected automatically going forward

// Initial webhook data
{
  "event_type": "order.created",
  "timestamp": "2024-01-15T10:30:00Z",
  "payload": {
    "order_id": "ord_123",
    "amount": 99.99
  }
}

// AI suggests:
// - event_type: String
// - timestamp: DateTime
// - payload: Object with nested fields

AI Validation - Data quality validation
Anomaly Insights - Anomaly detection
Structures & Records - Data schemas and entries
Structures API - API reference