The database schema is the foundation everything else builds on. Get it right, and your application flows naturally. Get it wrong, and you'll fight the data model for the lifetime of the system.
Good data modelling captures business reality accurately. It handles the exceptions, edge cases, and "but sometimes..." scenarios that real businesses have. Poor data modelling forces workarounds, creates data integrity issues, and makes simple queries complicated.
The Constraint: Business Logic is Messy
Tutorial schemas are clean. A user has many orders. An order has many line items. Foreign keys point where they should. The happy path works.
Then reality arrives. A customer merges with another company, and you need to combine their order histories without losing audit trails. A product gets discontinued, but existing orders still reference it. An invoice needs to reflect the price at the time of sale, not the current price. A manager needs to see "what did this record look like last Tuesday?"
These aren't edge cases. They're the normal state of business software that runs for more than six months. The schema either handles them gracefully or forces increasingly desperate workarounds in application code.
The pattern: Most schema problems aren't discovered during development. They surface six months later when someone asks for a report that the schema makes impossible, or when a business rule changes and you realise the data model assumed it never would.
Domain Modelling: Start with the Language
Before writing any SQL, map the language of the business. What do they call things? An "account" might mean a customer to sales, a login to IT, and a billing entity to finance. A "project" in one department might overlap with "job" in another.
These aren't pedantic distinctions. They determine table names, relationship structures, and the queries you'll write for years. Get them wrong, and every developer who touches the codebase will need a translation guide.
Interview the domain experts
Talk to the people who do the work. They know the exceptions. "Usually orders have one customer, but sometimes we split billing between two." These details determine schema structure.
Build a glossary
Document terms and their meanings across departments. When sales says "account" and support says "account", do they mean the same entity? Resolve ambiguities before encoding them in tables.
Map the relationships
Draw the connections. Can a customer have multiple billing addresses? Can a line item span multiple orders? Can an employee report to multiple managers? Cardinality matters.
Find the lifecycle events
What happens when an order is cancelled? When a customer churns? When a product is discontinued? These events reveal requirements that happy-path discussions miss.
The result is a ubiquitous language: terms that mean the same thing in code, in the database, and in conversations with the business. When the code says Invoice, everyone knows what that means. No translation required. This shared vocabulary is essential for building a single source of truth.
Normalisation: The Naive Approach vs. Deliberate Design
The Naive Approach: Normalise Everything
Database courses teach normalisation as dogma. Third normal form. No redundant data. Every fact stored once. In isolation, this is correct.
In production, it creates problems. A simple "show me the order summary" query joins six tables. The report that runs overnight because it aggregates across perfectly normalised tables. The API endpoint that times out because it needs to reconstruct data from fifteen related records.
The Other Naive Approach: Denormalise Everything
The pendulum swings. "Normalisation is slow, let's just copy data everywhere." Now you have synchronisation bugs. The customer's name is different in the orders table than in the customers table. Someone updated one and not the other. Which is correct? No one knows.
The Robust Pattern: Normalise First, Then Denormalise Deliberately
Start with a normalised design. Understand the canonical structure. Then, with specific, measured performance problems in hand, introduce denormalisation where the trade-offs are worth it.
| Scenario | Normalised Approach | When to Denormalise |
|---|---|---|
| Customer name on orders | Join to customers table | Copy at order time if you need historical accuracy and the join costs more than the storage |
| Order totals | SUM(line_items.price * quantity) | Store computed total if the calculation is expensive and totals are queried frequently |
| Product categories | Junction table for many-to-many | JSON array if you never query by category and need fast reads |
| Reporting aggregates | Calculate on demand | Materialised views or summary tables for dashboards that run constantly |
The key word is "deliberately". Every denormalisation is a trade-off: storage vs. computation, consistency vs. performance. Document why you made the trade-off. Future developers (including yourself in six months) will need to know.
Temporal Data: The Problem Everyone Ignores Until It's Too Late
Business data changes over time. Prices change. Employees change roles. Customers change addresses. A schema that only stores current state will eventually fail someone who needs to answer: "What was true on this date?"
The Constraint
An auditor asks: "What was this customer's credit limit when this order was approved?" The order was placed in March. The credit limit was updated in April. Your schema stores only the current credit limit. You cannot answer the question.
Effective Dating
Records that are valid for a period use effective dates: effective_from and effective_to. The current record has effective_to = NULL or a far-future date.
-- Price history table
CREATE TABLE product_prices (
id BIGINT PRIMARY KEY,
product_id BIGINT REFERENCES products(id),
price DECIMAL(10,2) NOT NULL,
effective_from DATE NOT NULL,
effective_to DATE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT no_overlap EXCLUDE USING gist (
product_id WITH =,
daterange(effective_from, effective_to, '[)') WITH &&
)
);
The exclusion constraint (PostgreSQL) prevents overlapping date ranges for the same product. This is the database enforcing business rules that application code would otherwise need to check on every insert.
Querying Effective-Dated Data
To find the price on a specific date:
SELECT price
FROM product_prices
WHERE product_id = 123
AND effective_from <= '2025-03-15'
AND (effective_to IS NULL OR effective_to > '2025-03-15');
This pattern applies to any data that changes over time: employee roles, customer addresses, tax rates, exchange rates, pricing tiers.
Snapshot vs. Temporal Tables
Snapshot approach
Copy relevant data at transaction time. The order stores the price, customer name, and address as they were when the order was placed. Simple to query. Duplicates data.
Use when: You need fast reads and the data is copied at a clear point in time (order placement, invoice generation).
Temporal tables
Maintain history in a separate table or with effective dates. Query "as of" any point in time. More complex queries. No duplication.
Use when: You need to query historical state at arbitrary points, or when the same data is referenced by many records.
PostgreSQL and SQL Server both offer system-versioned temporal tables that handle this automatically. For MySQL or SQLite, you'll implement it manually with triggers or application logic.
Polymorphic Relationships: Several Implementation Options
A "note" that can belong to either a customer or an order. An "attachment" linked to invoices, projects, or support tickets. These are polymorphic relationships: one table relates to multiple other tables.
The Naive Approach: Morph Columns
Laravel's morphTo pattern stores notable_type and notable_id:
CREATE TABLE notes (
id BIGINT PRIMARY KEY,
notable_type VARCHAR(255), -- 'App\Models\Customer' or 'App\Models\Order'
notable_id BIGINT,
content TEXT,
created_at TIMESTAMP
);
This works at the application level but sacrifices database integrity. There's no foreign key constraint. notable_id might reference a row that doesn't exist. The database cannot enforce referential integrity.
The Robust Pattern: Multiple Foreign Keys
Use nullable foreign keys for each possible parent:
CREATE TABLE notes (
id BIGINT PRIMARY KEY,
customer_id BIGINT REFERENCES customers(id),
order_id BIGINT REFERENCES orders(id),
content TEXT,
created_at TIMESTAMP,
CONSTRAINT exactly_one_parent CHECK (
(customer_id IS NOT NULL)::int +
(order_id IS NOT NULL)::int = 1
)
);
The check constraint ensures exactly one parent is set. The foreign keys ensure the parent exists. The database enforces the business rules.
The Junction Table Pattern
For true many-to-many polymorphism, use separate junction tables:
CREATE TABLE customer_notes (
customer_id BIGINT REFERENCES customers(id),
note_id BIGINT REFERENCES notes(id),
PRIMARY KEY (customer_id, note_id)
);
CREATE TABLE order_notes (
order_id BIGINT REFERENCES orders(id),
note_id BIGINT REFERENCES notes(id),
PRIMARY KEY (order_id, note_id)
);
More tables, but full referential integrity and the ability for a note to belong to multiple parents if the business requires it.
Trade-off: Morph columns are convenient for rapid development. Multiple foreign keys are safer for production systems where data integrity matters. Choose based on how much you trust your application code to never make mistakes.
Soft Deletes and Archival Strategies
Deleting a record removes it permanently. Sometimes that's correct. Often it isn't. Audit requirements, reporting needs, and the ability to undo mistakes all argue for soft deletes.
The Implementation
ALTER TABLE customers ADD COLUMN deleted_at TIMESTAMP NULL;
A record with deleted_at IS NULL is active. A record with a timestamp is soft-deleted. Every query that should exclude deleted records needs:
SELECT * FROM customers WHERE deleted_at IS NULL;
The Problem: Query Complexity
Every developer must remember to add the filter. Every query, every join, every subquery. Miss one, and deleted records appear in reports. ORMs like Laravel handle this with global scopes, but raw queries bypass them.
The Robust Pattern: Partial Indexes and Views
Create a view for active records:
CREATE VIEW active_customers AS
SELECT * FROM customers WHERE deleted_at IS NULL;
Application code queries the view by default. The underlying table is available when you need historical data.
Add a partial index for queries that filter on active records:
CREATE INDEX idx_customers_active_email
ON customers(email)
WHERE deleted_at IS NULL;
This index only includes active records, making it smaller and faster for the common case.
Archival for Large Tables
Soft deletes keep everything in one table. For high-volume tables (millions of rows), this slows queries even with indexes. The archival pattern moves old data to separate tables:
-- Active orders (last 2 years)
CREATE TABLE orders (...);
-- Archived orders (older than 2 years)
CREATE TABLE orders_archive (LIKE orders INCLUDING ALL);
-- Move old orders periodically
INSERT INTO orders_archive
SELECT * FROM orders WHERE created_at < NOW() - INTERVAL '2 years';
DELETE FROM orders WHERE created_at < NOW() - INTERVAL '2 years';
Active queries hit the smaller table. Historical reports union both tables when needed. The approach trades query simplicity for performance.
JSON Columns vs. Relational Tables
PostgreSQL's JSONB and MySQL's JSON columns offer schema flexibility within a relational database. Used correctly, they simplify certain patterns. Used incorrectly, they recreate all the problems of document databases without the benefits.
Good Uses for JSON Columns
Configuration that varies by record
Each customer has different notification preferences. Each product has different configurable options. The structure varies, and you don't query by these fields. Store as JSON, validate in application code.
External API responses
You store webhook payloads or API responses for debugging and audit purposes. You don't query by their contents. JSON preserves the original structure without schema maintenance.
Truly variable structures
Form builder responses where each form has different fields. Survey answers where questions vary. The structure is user-defined and cannot be predicted at schema design time.
Poor Uses for JSON Columns
Data you filter or sort by
Extracting values from JSON for WHERE clauses is slow. Even with JSON path indexes, it's slower than a proper column. If you query by it, it should be a column.
Data with consistent structure
If every record has the same fields, use columns. You get type checking, constraints, and indexing. JSON provides flexibility you're not using.
Relationships between records
Storing foreign keys in JSON arrays prevents referential integrity. The database cannot enforce that referenced records exist. Use junction tables for relationships.
The Hybrid Pattern
Use columns for data you query, JSON for data you don't:
CREATE TABLE orders (
id BIGINT PRIMARY KEY,
customer_id BIGINT REFERENCES customers(id),
status VARCHAR(50) NOT NULL, -- Queried frequently
total DECIMAL(10,2) NOT NULL, -- Queried frequently
shipping_address_id BIGINT REFERENCES addresses(id), -- Relational
metadata JSONB, -- Webhook data, notes, flexible config
created_at TIMESTAMP
);
-- Index the columns you query
CREATE INDEX idx_orders_status ON orders(status);
CREATE INDEX idx_orders_customer ON orders(customer_id);
-- Optional: index specific JSON paths if you must query them
CREATE INDEX idx_orders_source ON orders((metadata->>'source'));
This gives you relational rigour where it matters and flexibility where it doesn't.
Indexing Strategies for Common Query Patterns
Indexes speed up reads and slow down writes. Every index must earn its place. The strategy depends on your query patterns, not abstract best practices.
The Naive Approach: Index Everything
Add an index to every column. Queries are fast. Inserts and updates are slow. The database maintains indexes it never uses. Storage grows.
The Other Naive Approach: Index Nothing
Let the database figure it out. It won't. Full table scans on every query. The application slows to a crawl as data grows.
The Robust Pattern: Index for Your Queries
Analyse your actual query patterns. Index the columns that appear in WHERE clauses, JOIN conditions, and ORDER BY clauses. Use EXPLAIN ANALYZE to verify indexes are being used.
Single-Column Indexes
The basics. Index columns you filter by:
-- If you frequently query orders by status
CREATE INDEX idx_orders_status ON orders(status);
-- If you frequently query customers by email
CREATE UNIQUE INDEX idx_customers_email ON customers(email);
Composite Indexes
For queries that filter on multiple columns:
-- Queries like: WHERE customer_id = ? AND status = ?
CREATE INDEX idx_orders_customer_status ON orders(customer_id, status);
-- Column order matters: this index helps queries that filter by customer_id alone,
-- but not queries that filter only by status
The leftmost columns in a composite index can satisfy queries independently. A (customer_id, status, created_at) index helps queries filtering by customer_id, or by customer_id + status, or by all three. It does not help queries filtering only by status or only by created_at.
Covering Indexes
Include columns the query selects to avoid table lookups:
-- Query: SELECT status, total FROM orders WHERE customer_id = ?
CREATE INDEX idx_orders_customer_covering
ON orders(customer_id) INCLUDE (status, total);
The index contains all the data the query needs. The database never touches the table itself. Faster reads at the cost of larger indexes.
Partial Indexes
Index a subset of rows:
-- Most queries only care about active orders
CREATE INDEX idx_orders_active_customer
ON orders(customer_id)
WHERE status NOT IN ('cancelled', 'completed');
-- Most queries exclude soft-deleted records
CREATE INDEX idx_customers_active
ON customers(email)
WHERE deleted_at IS NULL;
Smaller indexes. Faster maintenance. Only useful if your queries include the same filter condition.
Expression Indexes
Index computed values:
-- Queries like: WHERE LOWER(email) = ?
CREATE INDEX idx_customers_email_lower ON customers(LOWER(email));
-- Queries like: WHERE DATE(created_at) = ?
CREATE INDEX idx_orders_created_date ON orders(DATE(created_at));
The query must use the exact same expression for the index to apply.
Measure, don't guess: Use EXPLAIN ANALYZE to see which indexes are actually used. Check pg_stat_user_indexes (PostgreSQL) to find unused indexes. Remove indexes that cost more than they help.
State Machines and Status Fields
Orders move through states: draft, submitted, approved, shipped, completed, cancelled. Modelling state transitions correctly prevents impossible states and provides audit trails.
The Naive Approach: Status Column Only
ALTER TABLE orders ADD COLUMN status VARCHAR(50) DEFAULT 'draft';
The application updates the status. No record of when transitions happened. No enforcement of valid transitions. An order can jump from "draft" to "shipped" if the code has a bug.
The Robust Pattern: Status with Transition History
CREATE TABLE order_status_transitions (
id BIGINT PRIMARY KEY,
order_id BIGINT REFERENCES orders(id),
from_status VARCHAR(50),
to_status VARCHAR(50) NOT NULL,
transitioned_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
transitioned_by BIGINT REFERENCES users(id),
notes TEXT
);
-- The orders table stores current status for fast queries
ALTER TABLE orders ADD COLUMN current_status VARCHAR(50) DEFAULT 'draft';
ALTER TABLE orders ADD COLUMN current_status_at TIMESTAMP;
Insert a transition record whenever status changes. The orders table has the current status for filtering. The transitions table has the complete history for audit and reporting. For comprehensive change tracking across your entire system, see our audit trails guide.
Valid Transition Enforcement
Define which transitions are allowed:
CREATE TABLE valid_status_transitions (
from_status VARCHAR(50),
to_status VARCHAR(50),
PRIMARY KEY (from_status, to_status)
);
INSERT INTO valid_status_transitions VALUES
('draft', 'submitted'),
('submitted', 'approved'),
('submitted', 'rejected'),
('approved', 'shipped'),
('shipped', 'completed'),
('draft', 'cancelled'),
('submitted', 'cancelled');
Application code checks this table before allowing a transition. Or use a trigger to enforce it at the database level:
CREATE FUNCTION check_status_transition() RETURNS TRIGGER AS $$
BEGIN
IF NOT EXISTS (
SELECT 1 FROM valid_status_transitions
WHERE from_status = OLD.current_status
AND to_status = NEW.current_status
) THEN
RAISE EXCEPTION 'Invalid status transition from % to %',
OLD.current_status, NEW.current_status;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
This prevents bugs in application code from creating invalid states. The database is the last line of defence.
Multi-Tenancy Patterns
Multiple clients sharing one database. Three approaches, each with trade-offs.
Tenant ID on every table
Add tenant_id to every table. Every query includes the filter. Simple to implement. Risk of data leakage if a query misses the filter.
Best for: Shared infrastructure, cost-sensitive deployments, tenants with similar data volumes.
Separate schemas
Each tenant gets their own PostgreSQL schema. Queries don't need tenant filters. Schema changes must be applied to all schemas.
Best for: Moderate isolation requirements, tenants who want some data separation guarantees.
Separate databases
Complete isolation. Each tenant has their own database. Maximum security. Higher operational cost. Cross-tenant queries require application-level aggregation.
Best for: Strict compliance requirements, large enterprise tenants, data residency requirements.
Implementing Tenant ID Safely
If using the tenant ID pattern, use Row Level Security (PostgreSQL) to enforce isolation at the database level. We cover multi-tenancy implementation in more detail in our multi-tenant Laravel guide.
-- Enable RLS
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
-- Policy that restricts access to current tenant
CREATE POLICY tenant_isolation ON orders
USING (tenant_id = current_setting('app.current_tenant_id')::bigint);
-- Set tenant context at connection time
SET app.current_tenant_id = '123';
Even if application code forgets the WHERE clause, the database enforces tenant isolation. This is defence in depth.
Constraints: Your Last Line of Defence
Application code has bugs. ORMs have bugs. API consumers send unexpected data. Constraints are the database saying "this data is invalid" regardless of how it arrived.
Constraint Types
- NOT NULL: Column must have a value. Prevents incomplete records.
- UNIQUE: No duplicate values. Email addresses, order numbers, SKUs.
- FOREIGN KEY: Referenced row must exist. Prevents orphaned records.
- CHECK: Custom validation.
CHECK (price >= 0),CHECK (end_date > start_date). - EXCLUSION: No overlapping values. Date ranges, scheduled resources, room bookings.
CREATE TABLE bookings (
id BIGINT PRIMARY KEY,
room_id BIGINT REFERENCES rooms(id) NOT NULL,
starts_at TIMESTAMP NOT NULL,
ends_at TIMESTAMP NOT NULL,
booked_by BIGINT REFERENCES users(id) NOT NULL,
-- End must be after start
CONSTRAINT valid_time_range CHECK (ends_at > starts_at),
-- No double-booking (PostgreSQL with btree_gist extension)
CONSTRAINT no_overlap EXCLUDE USING gist (
room_id WITH =,
tsrange(starts_at, ends_at) WITH &&
)
);
The database prevents double-bookings. No race conditions in application code. No "oops, two people booked the same room" support tickets.
The investment: A missing constraint in a customer database allowed duplicate entries for years. Cleaning up the resulting data cost more than the original development. Constraints are cheap. Data problems are expensive.
Anti-Patterns to Recognise
Patterns that seem reasonable but cause problems at scale or over time.
(entity_id, attribute_name, attribute_value). Seems flexible. Actually makes queries impossible to optimise and prevents meaningful constraints.tags = 'red,blue,green'). Cannot be indexed properly. Cannot enforce referential integrity. Use a junction table.If you recognise these patterns in an existing schema, plan migrations carefully. Fixing a schema with live data is surgery, not coding.
The Business Link: Why This Matters
Schema work isn't glamorous. It doesn't produce visible features. The payoff is indirect but substantial.
-
Fewer bugs reach production Constraints catch invalid data before it corrupts reports and confuses users.
-
Queries stay fast as data grows Proper indexes and structure mean the system that works with 10,000 records still works with 10 million.
-
New features are easier to build A well-structured schema makes new queries straightforward. A poorly-structured one makes every feature a battle.
-
Reports answer the actual questions Historical data is available when auditors ask. Temporal queries return accurate results.
-
Migrations are less painful A schema designed for change can be modified without rebuilding the entire system.
The cost of getting schema right is paid once, during design. The cost of getting it wrong is paid repeatedly, on every feature, every report, every migration, for the lifetime of the system.
Further Reading
- PostgreSQL Constraints - Official documentation for CHECK, UNIQUE, and EXCLUSION constraints.
- PostgreSQL Row Level Security - Database-enforced multi-tenancy isolation patterns.
- Use The Index, Luke - Comprehensive guide to database indexing strategies and query performance.
Design Your Data Model
We design data models that capture your business reality accurately. Schemas that handle the exceptions, edge cases, and "but sometimes..." scenarios that real businesses have. The foundation everything else builds on.
Let's talk about data modelling →