Data Modelling

Designing Schemas for Messy Real-World Business Logic

The database schema is the foundation everything else builds on. Get it right, and your application flows naturally. Get it wrong, and you'll fight the data model for the lifetime of the system.

Good data modelling captures business reality accurately. It handles the exceptions, edge cases, and "but sometimes..." scenarios that real businesses have. Poor data modelling forces workarounds, creates data integrity issues, and makes simple queries complicated.

Architecture Depth
Level 1: The Tutorial View.

Standard relational model. Users have Orders. Orders have Items. This works for demos, but fails when business logic gets messy.
🔍 JSON Column Inspector
users
id BIGINT
email VARCHAR
tier ENUM
orders
id BIGINT
user_id BIGINT
status VARCHAR
order_lines
id BIGINT
order_id BIGINT
qty INT
pricing_rules 👁️
id BIGINT
name VARCHAR
conditions JSONB
actions JSONB
products
id BIGINT
sku VARCHAR
attributes JSONB
services
id BIGINT
duration INT
renew_at DATE
audit_log 👁️
id BIGINT
subject_id BIGINT
subject_type TEXT
changes JSONB

The Constraint: Business Logic is Messy

Tutorial schemas are clean. A user has many orders. An order has many line items. Foreign keys point where they should. The happy path works.

Then reality arrives. A customer merges with another company, and you need to combine their order histories without losing audit trails. A product gets discontinued, but existing orders still reference it. An invoice needs to reflect the price at the time of sale, not the current price. A manager needs to see "what did this record look like last Tuesday?"

These aren't edge cases. They're the normal state of business software that runs for more than six months. The schema either handles them gracefully or forces increasingly desperate workarounds in application code.

The pattern: Most schema problems aren't discovered during development. They surface six months later when someone asks for a report that the schema makes impossible, or when a business rule changes and you realise the data model assumed it never would.

Five Decisions That Shape Every Schema

Before diving into patterns and SQL, these are the decisions you will face on every project. The rest of this page gives you the frameworks to make them deliberately rather than by accident.

1. Normalisation level

Start normalised. Denormalise only for measured performance problems, using materialised views or snapshot-on-write. Document every trade-off.

2. Temporal data approach

Will anyone ask "what was true on this date?" If yes (and they will), choose between effective dating, snapshot-on-write, or system-versioned tables before you have live data.

3. Soft delete strategy

Soft deletes add query complexity and index bloat. Consider partial indexes, view abstractions, archive tables, or hard deletes with audit logging. For UK businesses, factor in GDPR right-to-erasure obligations.

4. Tenancy pattern

Row-level tenant_id, schema-per-tenant, or database-per-tenant. This decision cascades into every constraint, index, migration, and backup strategy.

5. Constraint philosophy

The database is the last line of defence. Encode business rules as constraints (CHECK, UNIQUE, EXCLUSION, foreign keys) rather than relying on application code alone. Application code has bugs. Constraints don't.


Domain Modelling: Start with the Language

Before writing any SQL, map the language of the business. What do they call things? An "account" might mean a customer to sales, a login to IT, and a billing entity to finance. A "project" in one department might overlap with "job" in another.

These aren't pedantic distinctions. They determine table names, relationship structures, and the queries you'll write for years. Get them wrong, and every developer who touches the codebase will need a translation guide.

1

Interview the domain experts

Talk to the people who do the work. They know the exceptions. "Usually orders have one customer, but sometimes we split billing between two." These details determine schema structure.

2

Build a glossary

Document terms and their meanings across departments. When sales says "account" and support says "account", do they mean the same entity? Resolve ambiguities before encoding them in tables.

3

Map the relationships

Draw the connections. Can a customer have multiple billing addresses? Can a line item span multiple orders? Can an employee report to multiple managers? Cardinality matters.

4

Find the lifecycle events

What happens when an order is cancelled? When a customer churns? When a product is discontinued? These events reveal requirements that happy-path discussions miss.

The result is a ubiquitous language: terms that mean the same thing in code, in the database, and in conversations with the business. When the code says Invoice, everyone knows what that means. No translation required. This shared vocabulary is essential for building a single source of truth.


Normalisation: The Naive Approach vs. Deliberate Design

The Naive Approach: Normalise Everything

Database courses teach normalisation as dogma. Third normal form. No redundant data. Every fact stored once. In isolation, this is correct.

In production, it creates problems. A simple "show me the order summary" query joins six tables. The report that runs overnight because it aggregates across perfectly normalised tables. The API endpoint that times out because it needs to reconstruct data from fifteen related records.

The Other Naive Approach: Denormalise Everything

The pendulum swings. "Normalisation is slow, let's just copy data everywhere." Now you have synchronisation bugs. The customer's name is different in the orders table than in the customers table. Someone updated one and not the other. Which is correct? No one knows.

The Better Pattern: Normalise First, Then Denormalise Deliberately

Start with a normalised design. Understand the canonical structure. Then, with specific, measured performance problems in hand, introduce denormalisation where the trade-offs are worth it.

Scenario Normalised Approach When to Denormalise
Customer name on orders Join to customers table Copy at order time if the order must preserve the commercial record at the time of sale. This is about snapshot semantics, not join cost.
Order totals SUM(line_items.price * quantity) Store computed total if the calculation is expensive and totals are queried frequently
Product categories Junction table for many-to-many Keep the junction table as the source of truth. If reads are hot, add a materialised view or a cached display column.
Reporting aggregates Calculate on demand Materialised views or summary tables for dashboards that run constantly

The key word is "deliberately". Every denormalisation is a trade-off: storage vs. computation, consistency vs. performance. Document why you made the trade-off. Future developers (including yourself in six months) will need to know.

The Decision Framework: Normalised vs. Denormalised vs. JSONB

Three storage approaches, each with a specific sweet spot. The wrong choice costs you either performance or data integrity.

Question Normalised columns Denormalised copy JSONB column
Do you filter or sort by this data? Yes. B-tree indexes, type checking, constraints. Possible, but sync risk on every write. Slower for most queries. GIN indexes work well for containment and existence checks, but range queries and sorting remain expensive compared to proper columns.
Does the structure vary per record? No. Fixed columns cannot adapt. No. Same limitation. Yes. Form builder responses, API payloads, user-defined config.
Do you need referential integrity? Yes. Foreign keys enforced by the database. Partial. Copies break referential links. No. Cannot reference other rows from inside JSON.
Is read performance the bottleneck? Joins can be slow for complex aggregations. Pre-computed values avoid joins. Use materialised views as a safer middle ground. Single-row reads are fast, but filtering across rows is slow.

If you find yourself reaching for JSONB because you're unsure of the structure, pause. In most cases, the structure is known but the developer hasn't spoken to the domain experts yet. Talk to them first. The data model usually reveals itself.


Temporal Data: The Problem Everyone Ignores Until It's Too Late

Business data changes over time. Prices change. Employees change roles. Customers change addresses. A schema that only stores current state will eventually fail someone who needs to answer: "What was true on this date?"

The Constraint

An auditor asks: "What was this customer's credit limit when this order was approved?" The order was placed in March. The credit limit was updated in April. Your schema stores only the current credit limit. You cannot answer the question.

Effective Dating

Records that are valid for a period use effective dates: effective_from and effective_to. The current record has effective_to = NULL or a far-future date.

-- Required for EXCLUDE with BIGINT inside a GiST index
CREATE EXTENSION IF NOT EXISTS btree_gist;

-- Price history table
CREATE TABLE product_prices (
    id BIGINT PRIMARY KEY,
    product_id BIGINT REFERENCES products(id),
    price DECIMAL(10,2) NOT NULL,
    effective_from DATE NOT NULL,
    effective_to DATE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    CONSTRAINT no_overlap EXCLUDE USING gist (
        product_id WITH =,
        daterange(effective_from, effective_to, '[)') WITH &&
    )
);

The exclusion constraint (PostgreSQL) prevents overlapping date ranges for the same product. This is the database enforcing business rules that application code would otherwise need to check on every insert.

Querying Effective-Dated Data

To find the price on a specific date:

SELECT price
FROM product_prices
WHERE product_id = 123
  AND daterange(effective_from, effective_to, '[)') @> DATE '2025-03-15';

This pattern applies to any data that changes over time: employee roles, customer addresses, tax rates, exchange rates, pricing tiers.

Snapshot vs. Temporal Tables

Snapshot approach

Copy relevant data at transaction time. The order stores the price, customer name, and address as they were when the order was placed. Simple to query. Duplicates data.

Use when: You need fast reads and the data is copied at a clear point in time (order placement, invoice generation).

Temporal tables

Maintain history in a separate table or with effective dates. Query "as of" any point in time. More complex queries. Row versions are duplicated into the history table, but the source of truth remains a single record at any given point.

Use when: You need to query historical state at arbitrary points, or when the same data is referenced by many records.

SQL Server offers system-versioned temporal tables that handle history automatically. PostgreSQL takes a different approach: use effective-dated rows, history tables, or triggers. PostgreSQL 18 adds temporal constraints such as WITHOUT OVERLAPS, but does not generate history tables automatically. For MySQL or SQLite, you'll implement temporal patterns manually.


Polymorphic Relationships: Several Implementation Options

A "note" that can belong to either a customer or an order. An "attachment" linked to invoices, projects, or support tickets. These are polymorphic relationships: one table relates to multiple other tables.

The Naive Approach: Morph Columns

Laravel's morphTo pattern stores notable_type and notable_id:

CREATE TABLE notes (
    id BIGINT PRIMARY KEY,
    notable_type VARCHAR(255),  -- 'App\Models\Customer' or 'App\Models\Order'
    notable_id BIGINT,
    content TEXT,
    created_at TIMESTAMP
);

This works at the application level but sacrifices database integrity. There's no foreign key constraint. notable_id might reference a row that doesn't exist. The database cannot enforce referential integrity.

The Better Pattern: Multiple Foreign Keys

Use nullable foreign keys for each possible parent:

CREATE TABLE notes (
    id BIGINT PRIMARY KEY,
    customer_id BIGINT REFERENCES customers(id),
    order_id BIGINT REFERENCES orders(id),
    content TEXT,
    created_at TIMESTAMP,
    CONSTRAINT exactly_one_parent CHECK (
        (customer_id IS NOT NULL)::int +
        (order_id IS NOT NULL)::int = 1
    )
);

The check constraint ensures exactly one parent is set. The foreign keys ensure the parent exists. The database enforces the business rules.

The Junction Table Pattern

For true many-to-many polymorphism, use separate junction tables:

CREATE TABLE customer_notes (
    customer_id BIGINT REFERENCES customers(id),
    note_id BIGINT REFERENCES notes(id),
    PRIMARY KEY (customer_id, note_id)
);

CREATE TABLE order_notes (
    order_id BIGINT REFERENCES orders(id),
    note_id BIGINT REFERENCES notes(id),
    PRIMARY KEY (order_id, note_id)
);

More tables, but full referential integrity and the ability for a note to belong to multiple parents if the business requires it.

Trade-off: Morph columns are convenient for rapid development. Multiple foreign keys are safer for production systems where data integrity matters. Choose based on how much you trust your application code to never make mistakes.


Soft Deletes and Archival Strategies

Deleting a record removes it permanently. Sometimes that's correct. Often it isn't. Audit requirements, reporting needs, and the ability to undo mistakes all argue for soft deletes.

The Implementation

ALTER TABLE customers ADD COLUMN deleted_at TIMESTAMP NULL;

A record with deleted_at IS NULL is active. A record with a timestamp is soft-deleted. Every query that should exclude deleted records needs:

SELECT * FROM customers WHERE deleted_at IS NULL;

The Problem: Query Complexity

Every developer must remember to add the filter. Every query, every join, every subquery. Miss one, and deleted records appear in reports. ORMs like Laravel handle this with global scopes, but raw queries bypass them.

The Better Pattern: Partial Indexes and Views

Create a view for active records:

CREATE VIEW active_customers AS
SELECT * FROM customers WHERE deleted_at IS NULL;

Application code queries the view by default. The underlying table is available when you need historical data.

Add a partial index for queries that filter on active records:

CREATE INDEX idx_customers_active_email
ON customers(email)
WHERE deleted_at IS NULL;

This index only includes active records, making it smaller and faster for the common case.

Archival for Large Tables

Soft deletes keep everything in one table. For high-volume tables (millions of rows), this slows queries even with indexes. The archival pattern moves old data to separate tables:

-- Archive table with defaults and CHECK constraints (does not copy indexes or foreign keys)
CREATE TABLE orders_archive (LIKE orders INCLUDING DEFAULTS INCLUDING CONSTRAINTS);

-- Atomic move using DELETE ... RETURNING
WITH moved_rows AS (
    DELETE FROM orders
    WHERE created_at < NOW() - INTERVAL '2 years'
    RETURNING *
)
INSERT INTO orders_archive
SELECT * FROM moved_rows;

The CTE deletes and inserts in a single atomic operation. Active queries hit the smaller table. Historical reports union both tables when needed. For very large tables, consider PostgreSQL's native range partitioning: if designed up front, the query planner automatically prunes old partitions and routes new rows to the correct partition. It won't archive data for you, but it removes the need for manual data movement.

Soft Delete
Add deleted_at column. Simple. Every query needs filter.
View Abstraction
Query active_* views. Deleted records hidden by default.
Table Archival
Separate tables for old data. Best for high volume.

Data Retention and GDPR Considerations

For UK businesses, soft deletes create a specific legal exposure. Under GDPR, a customer who exercises their right to erasure expects their personal data to be deleted, not just hidden behind a deleted_at timestamp. The ICO's storage limitation principle requires that personal data is kept only as long as necessary for its original purpose.

This means your soft-delete strategy needs a data lifecycle plan.

Anonymise rather than delete: Replace personal data with anonymised values while preserving the record for reporting and referential integrity. The order history stays intact; the customer name becomes "Customer #12847".
Define retention periods by data category: Transactional records (invoices, payments) have legal retention requirements. Marketing preferences don't. Set different retention policies and automate the lifecycle.
Separate personal data from business data: If customer contact details are in a separate table from order records, you can purge personal data without losing transactional history.

A schema that treats "deleted" as a single concept will struggle with these requirements. Design the data lifecycle into the schema from the start, not as a compliance afterthought.


JSON Columns vs. Relational Tables

PostgreSQL's JSONB and MySQL's JSON columns offer schema flexibility within a relational database. Used correctly, they simplify certain patterns. Used incorrectly, they recreate all the problems of document databases without the benefits.

Good Uses for JSON Columns

Configuration that varies by record

Each customer has different notification preferences. Each product has different configurable options. The structure varies, and you don't query by these fields. Store as JSON, validate in application code.

External API responses

You store webhook payloads or API responses for debugging and audit purposes. You don't query by their contents. JSON preserves the original structure without schema maintenance.

Truly variable structures

Form builder responses where each form has different fields. Survey answers where questions vary. The structure is user-defined and cannot be predicted at schema design time.

Poor Uses for JSON Columns

Data you filter or sort by

Extracting values from JSON for WHERE clauses is slow. Even with JSON path indexes, it's slower than a proper column. If you query by it, it should be a column.

Data with consistent structure

If every record has the same fields, use columns. You get type checking, constraints, and indexing. JSON provides flexibility you're not using.

Relationships between records

Storing foreign keys in JSON arrays prevents referential integrity. The database cannot enforce that referenced records exist. Use junction tables for relationships.

The Hybrid Pattern

Use columns for data you query, JSON for data you don't:

CREATE TABLE orders (
    id BIGINT PRIMARY KEY,
    customer_id BIGINT REFERENCES customers(id),
    status VARCHAR(50) NOT NULL,           -- Queried frequently
    total DECIMAL(10,2) NOT NULL,          -- Queried frequently
    shipping_address_id BIGINT REFERENCES addresses(id),  -- Relational
    metadata JSONB,                         -- Webhook data, notes, flexible config
    created_at TIMESTAMP
);

-- Index the columns you query
CREATE INDEX idx_orders_status ON orders(status);
CREATE INDEX idx_orders_customer ON orders(customer_id);

-- Optional: index specific JSON paths if you must query them
CREATE INDEX idx_orders_source ON orders((metadata->>'source'));

This gives you relational rigour where it matters and flexibility where it doesn't.


Indexing Strategies for Common Query Patterns

Indexes speed up reads and slow down writes. Every index must earn its place. The strategy depends on your query patterns, not abstract best practices.

The Naive Approach: Index Everything

Add an index to every column. Queries are fast. Inserts and updates are slow. The database maintains indexes it never uses. Storage grows.

The Other Naive Approach: Index Nothing

Let the database figure it out. It won't. Full table scans on every query. The application slows to a crawl as data grows.

The Better Pattern: Index for Your Queries

Analyse your actual query patterns. Index the columns that appear in WHERE clauses, JOIN conditions, and ORDER BY clauses. Use EXPLAIN ANALYZE to verify indexes are being used.

Single-Column Indexes

The basics. Index columns you filter by:

-- If you frequently query orders by status
CREATE INDEX idx_orders_status ON orders(status);

-- If you frequently query customers by email
CREATE UNIQUE INDEX idx_customers_email ON customers(email);

Composite Indexes

For queries that filter on multiple columns:

-- Queries like: WHERE customer_id = ? AND status = ?
CREATE INDEX idx_orders_customer_status ON orders(customer_id, status);

-- Column order matters: this index helps queries that filter by customer_id alone,
-- but not queries that filter only by status

The leftmost columns in a composite index can satisfy queries independently. A (customer_id, status, created_at) index helps queries filtering by customer_id, or by customer_id + status, or by all three. It does not help queries filtering only by status or only by created_at.

Covering Indexes

Include columns the query selects to avoid table lookups:

-- Query: SELECT status, total FROM orders WHERE customer_id = ?
CREATE INDEX idx_orders_customer_covering
ON orders(customer_id) INCLUDE (status, total);

The index contains all the data the query needs. PostgreSQL can use an index-only scan and avoid most heap lookups when the visibility map allows it. Faster reads at the cost of larger indexes.

Partial Indexes

Index a subset of rows:

-- Most queries only care about active orders
CREATE INDEX idx_orders_active_customer
ON orders(customer_id)
WHERE status NOT IN ('cancelled', 'completed');

-- Most queries exclude soft-deleted records
CREATE INDEX idx_customers_active
ON customers(email)
WHERE deleted_at IS NULL;

Smaller indexes. Faster maintenance. Only useful if your queries include the same filter condition.

Expression Indexes

Index computed values:

-- Queries like: WHERE LOWER(email) = ?
CREATE INDEX idx_customers_email_lower ON customers(LOWER(email));

-- Queries like: WHERE DATE(created_at) = ?
CREATE INDEX idx_orders_created_date ON orders(DATE(created_at));

The query must use the exact same expression for the index to apply.

Measure, don't guess: Use EXPLAIN ANALYZE to see which indexes are actually used. Check pg_stat_user_indexes (PostgreSQL) to find unused indexes. Remove indexes that cost more than they help.


State Machines and Status Fields

Orders move through states: draft, submitted, approved, shipped, completed, cancelled. Modelling state transitions correctly prevents impossible states and provides audit trails.

The Naive Approach: Status Column Only

ALTER TABLE orders ADD COLUMN status VARCHAR(50) DEFAULT 'draft';

The application updates the status. No record of when transitions happened. No enforcement of valid transitions. An order can jump from "draft" to "shipped" if the code has a bug.

The Better Pattern: Status with Transition History

CREATE TABLE order_status_transitions (
    id BIGINT PRIMARY KEY,
    order_id BIGINT REFERENCES orders(id),
    from_status VARCHAR(50),
    to_status VARCHAR(50) NOT NULL,
    transitioned_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    transitioned_by BIGINT REFERENCES users(id),
    notes TEXT
);

-- The orders table stores current status for fast queries
ALTER TABLE orders ADD COLUMN current_status VARCHAR(50) DEFAULT 'draft';
ALTER TABLE orders ADD COLUMN current_status_at TIMESTAMP;

Insert a transition record whenever status changes. The orders table has the current status for filtering. The transitions table has the complete history for audit and reporting. For system-wide change history, see our audit trails guide.

Valid Transition Enforcement

Define which transitions are allowed:

CREATE TABLE valid_status_transitions (
    from_status VARCHAR(50),
    to_status VARCHAR(50),
    PRIMARY KEY (from_status, to_status)
);

INSERT INTO valid_status_transitions VALUES
    ('draft', 'submitted'),
    ('submitted', 'approved'),
    ('submitted', 'rejected'),
    ('approved', 'shipped'),
    ('shipped', 'completed'),
    ('draft', 'cancelled'),
    ('submitted', 'cancelled');

Application code checks this table before allowing a transition. Or use a trigger to enforce it at the database level:

CREATE FUNCTION check_status_transition() RETURNS TRIGGER AS $$
BEGIN
    -- Skip check if status hasn't changed
    IF NEW.current_status IS NOT DISTINCT FROM OLD.current_status THEN
        RETURN NEW;
    END IF;

    IF NOT EXISTS (
        SELECT 1 FROM valid_status_transitions
        WHERE from_status = OLD.current_status
        AND to_status = NEW.current_status
    ) THEN
        RAISE EXCEPTION 'Invalid status transition from % to %',
            OLD.current_status, NEW.current_status;
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER orders_check_status_transition
BEFORE UPDATE OF current_status ON orders
FOR EACH ROW
EXECUTE FUNCTION check_status_transition();

The trigger fires on every status update. If the transition isn't in the allowed list, the database rejects it. This prevents bugs in application code from creating invalid states.


Multi-Tenancy Patterns

Multiple clients sharing one database. Three approaches, each with trade-offs.

Tenant ID on every table

Add tenant_id to every tenant-owned table. Scope unique constraints by tenant, include tenant_id in tenant-owned foreign keys, and enforce isolation with RLS. Risk of data leakage if a query misses the filter and RLS is not enabled.

Best for: Shared infrastructure, cost-sensitive deployments, tenants with similar data volumes.

Separate schemas

Each tenant gets their own PostgreSQL schema. Queries don't need tenant filters. Schema changes must be applied to all schemas.

Best for: Moderate isolation requirements, tenants who want some data separation guarantees.

Separate databases

Complete isolation. Each tenant has their own database. Maximum security. Higher operational cost. Cross-tenant queries require application-level aggregation.

Best for: Strict compliance requirements, large enterprise tenants, data residency requirements.

Implementing Tenant ID Safely

If using the tenant ID pattern, use Row Level Security (PostgreSQL) to enforce isolation at the database level. We cover multi-tenancy implementation in more detail in our multi-tenant Laravel guide.

-- Enable RLS (applies to non-owner roles)
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;

-- FORCE applies RLS even to the table owner (critical for admin operations)
ALTER TABLE orders FORCE ROW LEVEL SECURITY;

-- Policy restricts both reads (USING) and writes (WITH CHECK)
CREATE POLICY tenant_isolation ON orders
    USING (tenant_id = current_setting('app.current_tenant_id')::bigint)
    WITH CHECK (tenant_id = current_setting('app.current_tenant_id')::bigint);

-- Set tenant context at connection time
SET app.current_tenant_id = '123';

Without FORCE ROW LEVEL SECURITY, the table owner bypasses all RLS policies. This is a common source of data leakage during admin operations, migrations, and background jobs. The WITH CHECK clause prevents a tenant from inserting or updating rows belonging to another tenant. Even if application code forgets the WHERE clause, the database enforces tenant isolation.


Constraints: Your Last Line of Defence

Application code has bugs. ORMs have bugs. API consumers send unexpected data. Constraints are the database saying "this data is invalid" regardless of how it arrived.

Constraint Types

  • NOT NULL: Column must have a value. Prevents incomplete records.
  • UNIQUE: No duplicate values. Email addresses, order numbers, SKUs.
  • FOREIGN KEY: Referenced row must exist. Prevents orphaned records.
  • CHECK: Custom validation. CHECK (price >= 0), CHECK (end_date > start_date).
  • EXCLUSION: No overlapping values. Date ranges, scheduled resources, room bookings.
-- btree_gist required for BIGINT inside a GiST index
CREATE EXTENSION IF NOT EXISTS btree_gist;

CREATE TABLE bookings (
    id BIGINT PRIMARY KEY,
    room_id BIGINT REFERENCES rooms(id) NOT NULL,
    starts_at TIMESTAMP NOT NULL,
    ends_at TIMESTAMP NOT NULL,
    booked_by BIGINT REFERENCES users(id) NOT NULL,

    -- End must be after start
    CONSTRAINT valid_time_range CHECK (ends_at > starts_at),

    -- No double-booking (PostgreSQL with btree_gist extension)
    CONSTRAINT no_overlap EXCLUDE USING gist (
        room_id WITH =,
        tsrange(starts_at, ends_at) WITH &&
    )
);

The database prevents double-bookings. No race conditions in application code. No "oops, two people booked the same room" support tickets.

Quick Reference: Choosing the Right Constraint

Business Rule Constraint Type Example
Every order must have a customer FOREIGN KEY + NOT NULL customer_id BIGINT NOT NULL REFERENCES customers(id)
No two customers share an email UNIQUE UNIQUE(email)
Prices cannot be negative CHECK CHECK (price >= 0)
End date must be after start date CHECK CHECK (ends_at > starts_at)
No overlapping booking times EXCLUSION (PostgreSQL) EXCLUDE USING gist (room_id WITH =, tsrange(starts_at, ends_at) WITH &&)
No overlapping price periods EXCLUSION (PostgreSQL) EXCLUDE USING gist (product_id WITH =, daterange(effective_from, effective_to) WITH &&)
Status must be a known value CHECK or ENUM CHECK (status IN ('draft','submitted','approved')) or PostgreSQL native ENUM

The investment: A missing constraint in a customer database allowed duplicate entries for years. Cleaning up the resulting data cost more than the original development. Constraints are cheap. Data problems are expensive.


Anti-Patterns to Recognise

Patterns that seem reasonable but cause problems at scale or over time.

The "god" table: One table representing multiple concepts. A "things" table with type columns and nullable fields for each type. Impossible to constrain. Queries become CASE statement nightmares.
Entity-Attribute-Value (EAV): Storing everything as key-value pairs. (entity_id, attribute_name, attribute_value). Seems flexible. Actually makes queries impossible to optimise and prevents meaningful constraints.
Comma-separated values in columns: Storing multiple values in one column (tags = 'red,blue,green'). Cannot be indexed properly. Cannot enforce referential integrity. Use a junction table.
Implicit typing: Storing different data types in the same column based on context. A "value" column that's sometimes a number, sometimes a date, sometimes text. Cannot be indexed or validated.
Premature denormalisation: Copying data "for performance" before measuring. Creates synchronisation bugs. Only denormalise with a specific, measured performance problem.

If You Inherited One of These Patterns

Recognising anti-patterns is only half the work. The harder part is migrating away from them on a live system without breaking production. Here's the general approach.

God table: Split by concept

Create new tables for each concept the god table represents. Add foreign keys back to the original. Run a dual-write period where both the old table and new tables receive data. Migrate historical data in batches. Cut over when the new tables are verified. Drop the old columns last.

EAV: Extract to typed columns

Identify the most-queried attributes. Add them as proper columns with constraints. Backfill from the EAV rows. Once verified, stop writing to EAV for those attributes. Keep the EAV table for genuinely variable attributes (if any exist), remove it entirely when it's empty.

JSONB overuse: Promote to columns

For each JSON path you filter or sort by, add a generated column or a proper column with a backfill migration. Add constraints and indexes on the new columns. Update application code to read from columns instead of JSON. Keep JSON only for data that genuinely varies per record.

String foreign keys: Add surrogate keys

Add an integer or UUID primary key to the referenced table. Add a new foreign key column to each referencing table. Backfill the new column using the string match. Switch application code to use the new key. Drop the string column when nothing references it.

Every remediation follows the same pattern: add the new structure alongside the old, dual-write, backfill, verify, cut over, clean up. Rushing any step risks data loss or downtime. Plan for each migration to take longer than you expect, and test with a copy of production data first.


The Business Link: Why This Matters

Schema work isn't glamorous. It doesn't produce visible features. The payoff is indirect but substantial.

  • Fewer bugs reach production Constraints catch invalid data before it corrupts reports and confuses users.
  • Queries stay fast as data grows Proper indexes and structure mean the system that works with 10,000 records still works with 10 million.
  • New features are easier to build A well-structured schema makes new queries straightforward. A poorly-structured one makes every feature a battle.
  • Reports answer the actual questions Historical data is available when auditors ask. Temporal queries return accurate results.
  • Migrations are less painful A schema designed for change can be modified without rebuilding the entire system.

The cost of getting schema right is paid once, during design. The cost of getting it wrong is paid repeatedly, on every feature, every report, every migration, for the lifetime of the system.


Further Reading


Design Your Data Model

We design data models that capture your business reality accurately. Schemas that handle the exceptions, edge cases, and "but sometimes..." scenarios that real businesses have. The foundation everything else builds on.

Let's talk about data modelling →
Graphic Swish