Skip to content

Schema Discovery

Overview

Schema Discovery allows structures to automatically evolve as new data patterns emerge. Instead of defining your schema upfront, you can let Centrali infer field types from incoming data and suggest schema changes - perfect for dynamic data ingestion scenarios.

Key Features

  • Schemaless Mode: Accept any data without predefined schema
  • Auto-Evolving Mode: Automatically detect and suggest new fields
  • AI-Powered Type Inference: Intelligently determines field types from sample data
  • Review Before Apply: Suggestions require approval before modifying schema

Schema Discovery Modes

Mode Description Use Case
Strict All fields must be defined in schema Production data with known structure
Schemaless Accept any data, infer schema later Initial development, data exploration
Auto-Evolving Accept new fields, suggest schema updates Dynamic integrations, evolving APIs

Strict Mode (Default)

Records must conform to the defined schema. Unknown fields are rejected.

// Schema defines: name, email
// This record is REJECTED (unknown field: phone)
{
  "name": "John",
  "email": "john@example.com",
  "phone": "555-1234"
}

Schemaless Mode

Records are accepted regardless of schema. Unknown fields are stored and analyzed.

// Schema defines: name, email
// This record is ACCEPTED (phone is stored as unknown)
{
  "name": "John",
  "email": "john@example.com",
  "phone": "555-1234"
}

After enough records, the AI analyzes patterns and suggests adding phone as a string field.

Auto-Evolving Mode

Like schemaless, but designed for ongoing evolution. New fields trigger automatic suggestions.

Configuration

Schema Discovery is configured per structure in the Console UI under the Schema Discovery tab.

Setting the Mode

  1. Navigate to your structure in the Console
  2. Click the Schema Discovery tab
  3. Select your desired mode from the dropdown
  4. Click Save Configuration

Inference Batch Size

For schemaless and auto-evolving modes, you can set how many records to collect before triggering AI analysis. Default is 10 records.

  • Smaller batches (5-10): Faster suggestions, less data to analyze
  • Larger batches (50-100): More accurate type inference, slower suggestions

Using Schema Discovery

Scan Existing Records

To discover unknown fields in existing data:

  1. Go to the Schema Discovery tab
  2. Click Scan Existing Records
  3. The system analyzes records and generates suggestions

Trigger Manual Inference

To process buffered records immediately:

  1. Check the Buffer Count in the summary
  2. Click Trigger Inference to analyze buffered data

Review Suggestions

Schema suggestions appear in the Schema Suggestions table:

  • Accept: Add the suggested field to your schema
  • Reject: Dismiss the suggestion
  • View Details: See sample values and inferred type

Suggestion Types

Operation Description
Add Field New field detected in incoming data
Update Type Field exists but type mismatch detected
Initial Schema Full schema inferred for schemaless structure

Suggestion Details

Each suggestion includes: - Field Name: The discovered field - Inferred Type: Suggested type (string, number, boolean, etc.) - Confidence: How confident the AI is in the type inference - Sample Values: Actual values found in records

How Type Inference Works

The AI analyzes sample values to determine the most appropriate type:

Sample Values Inferred Type
"hello", "world" String
123, 456, 789 Number
true, false Boolean
"2024-01-15", "2024-02-20" DateTime
"john@example.com" String (with email pattern hint)
{"nested": "object"} Object
[1, 2, 3] Array

Best Practices

Start Schemaless, Migrate to Strict

For new integrations: 1. Start in schemaless mode 2. Collect real data 3. Accept schema suggestions 4. Switch to strict mode for production

Review Suggestions Carefully

AI inference is accurate but not perfect. Review suggested types against your expected data patterns.

Use Auto-Evolving for APIs

When integrating with external APIs that may add fields over time, auto-evolving mode ensures you capture new data without breaking ingestion.

Set Appropriate Batch Sizes

  • Prototype/testing: Small batches (5-10)
  • Production: Larger batches (20-50) for accuracy

Handle Type Conflicts

If the same field appears with different types, the AI suggests the most permissive type. Review these carefully.

Limits

Limit Value
Max fields per suggestion 100
Max nesting depth 3 levels
Max string length sampled 1000 characters
Suggestion retention 90 days
Max buffered records 1000

Example Workflow

Scenario: New Webhook Integration

You're receiving webhook data from a third-party service with an evolving schema.

  1. Create Structure in schemaless mode
  2. Configure Integration to POST data to Centrali
  3. Collect Data - let records flow in
  4. Review Suggestions - AI detects fields like event_type, payload, timestamp
  5. Accept Suggestions - fields are added to schema
  6. Switch to Auto-Evolving - new fields are detected automatically going forward
// Initial webhook data
{
  "event_type": "order.created",
  "timestamp": "2024-01-15T10:30:00Z",
  "payload": {
    "order_id": "ord_123",
    "amount": 99.99
  }
}

// AI suggests:
// - event_type: String
// - timestamp: DateTime
// - payload: Object with nested fields