Designing the Data Schema
How to translate a business requirement into a Data Ocean SQL table schema — from identifying entities and attributes, to adding the standard metadata columns, to creating the migration script.
Step 1 — Identify the Business Entity
Start with the question: what real-world thing are we tracking? Each entity becomes a table. Common patterns:
- People: Customer, Employee, Contact, Applicant
- Events: Order, Invoice, Appointment, SupportTicket
- Assets: Product, InventoryItem, Document, Vehicle
- Relationships: CustomerOrder, EmployeeProject (junction tables)
For a lead management application, the primary entity is Lead. A lead represents a potential customer who has expressed interest but has not yet been qualified.
Step 2 — List the Business Attributes
For each entity, list the attributes that need to be captured. Focus on what the business actually needs, not what the technology allows:
-- Lead entity — business attributes identified with stakeholders:
-- Full name of the lead
-- Company name
-- Email address (required, unique per company)
-- Phone number
-- Source of the lead (web form, referral, conference, cold outreach)
-- Lead status (New, Contacted, Qualified, Disqualified, Converted)
-- Assigned sales representative (user ID)
-- Estimated deal value
-- Notes (free text)
-- Follow-up date
Step 3 — Design the SQL Schema
Translate the business attributes into SQL columns with correct types, constraints, and the required Data Ocean metadata columns:
CREATE TABLE Lead (
-- Primary key
LeadId uniqueidentifier NOT NULL DEFAULT NEWSEQUENTIALID(),
-- ── Required Data Ocean metadata columns ──
TenantId int NOT NULL,
CreatedAt datetime2 NOT NULL DEFAULT GETUTCDATE(),
UpdatedAt datetime2 NOT NULL DEFAULT GETUTCDATE(),
CreatedBy nvarchar(200) NULL,
IsDeleted bit NOT NULL DEFAULT 0,
DeletedAt datetime2 NULL,
-- ── Business columns ──
FullName nvarchar(300) NOT NULL,
CompanyName nvarchar(300) NULL,
Email nvarchar(300) NOT NULL,
Phone nvarchar(50) NULL,
Source nvarchar(100) NOT NULL DEFAULT 'Unknown',
Status nvarchar(50) NOT NULL DEFAULT 'New',
AssignedToUserId nvarchar(200) NULL,
EstimatedValue decimal(18,2) NULL,
Notes nvarchar(max) NULL,
FollowUpDate date NULL,
-- ── Optional AI enhancement columns ──
SummaryText nvarchar(max) NULL,
ClassificationLabel nvarchar(200) NULL, -- e.g., "High Value", "Unlikely to Convert"
SentimentScore decimal(5,4) NULL,
EmbeddingRef nvarchar(500) NULL,
AiProcessedAt datetime2 NULL,
AiModelVersion nvarchar(100) NULL,
-- Constraints
CONSTRAINT PK_Lead PRIMARY KEY CLUSTERED (LeadId)
);
-- ── Indexes ──
CREATE INDEX IX_Lead_TenantId_Status
ON Lead (TenantId, Status, UpdatedAt DESC)
INCLUDE (FullName, Email, AssignedToUserId, EstimatedValue)
WHERE IsDeleted = 0;
CREATE INDEX IX_Lead_TenantId_AssignedTo
ON Lead (TenantId, AssignedToUserId, Status)
WHERE IsDeleted = 0;
CREATE INDEX IX_Lead_TenantId_FollowUpDate
ON Lead (TenantId, FollowUpDate)
WHERE IsDeleted = 0 AND FollowUpDate IS NOT NULL;
CREATE INDEX IX_Lead_TenantId_AiProcessedAt
ON Lead (TenantId, AiProcessedAt)
WHERE AiProcessedAt IS NULL AND IsDeleted = 0;
Schema Design Decisions
| Decision | Choice Made | Rationale |
|---|---|---|
| Primary key type | uniqueidentifier (GUID) | GUIDs allow distributed ID generation — records created offline or in batch can be merged without collision |
| Primary key default | NEWSEQUENTIALID() | Sequential GUIDs avoid clustered index fragmentation (vs. random NEWID()) |
| Soft delete | IsDeleted + DeletedAt | Data Ocean convention — never physically delete records for audit trail |
| Status column type | nvarchar(50) with default | String status is more readable in SQL than integer status codes |
| Notes column type | nvarchar(max) | Notes are unbounded in length; nvarchar(max) handles this |
| FollowUpDate type | date (not datetime2) | Follow-up dates are day-granularity business data — storing time is unnecessary and confusing |
Even if you do not plan to use AI enrichment immediately, add the AI enhancement columns (SummaryText, ClassificationLabel, EmbeddingRef, AiProcessedAt) to every table from the start. Adding them later requires a schema migration and a backfill job. They are nullable, so they have zero impact on current functionality while keeping the door open for AI features.
Reference Data Tables
For columns like Source and Status that have a controlled set of valid values, create a companion reference data table rather than hardcoding values in application logic:
CREATE TABLE LeadStatus (
LeadStatusId int NOT NULL IDENTITY(1,1),
TenantId int NOT NULL,
Code nvarchar(50) NOT NULL,
Label nvarchar(200) NOT NULL,
SortOrder int NOT NULL DEFAULT 0,
IsActive bit NOT NULL DEFAULT 1,
CONSTRAINT PK_LeadStatus PRIMARY KEY (LeadStatusId),
CONSTRAINT UQ_LeadStatus_TenantCode UNIQUE (TenantId, Code)
);
-- Default values (seed via migration)
INSERT INTO LeadStatus (TenantId, Code, Label, SortOrder)
VALUES
(0, 'New', 'New Lead', 1), -- TenantId 0 = system defaults
(0, 'Contacted', 'Contacted', 2),
(0, 'Qualified', 'Qualified', 3),
(0, 'Disqualified', 'Disqualified', 4),
(0, 'Converted', 'Converted to Customer', 5);