Designing Schemas for Messy Real-World Business Logic
The database schema is the foundation everything else builds on. Get it right, and your application flows naturally. Get it wrong, and you'll fight the data model for the lifetime of the system.
Good data modelling captures business reality accurately. It handles the exceptions, edge cases, and "but sometimes..." scenarios that real businesses have. Poor data modelling forces workarounds, creates data integrity issues, and makes simple queries complicated.
The Constraint: Business Logic is Messy
Tutorial schemas are clean. A user has many orders. An order has many line items. Foreign keys point where they should. The happy path works.
Then reality arrives. A customer merges with another company, and you need to combine their order histories without losing audit trails. A product gets discontinued, but existing orders still reference it. An invoice needs to reflect the price at the time of sale, not the current price. A manager needs to see "what did this record look like last Tuesday?"
These aren't edge cases. They're the normal state of business software that runs for more than six months. The schema either handles them gracefully or forces increasingly desperate workarounds in application code.
The pattern: Most schema problems aren't discovered during development. They surface six months later when someone asks for a report that the schema makes impossible, or when a business rule changes and you realise the data model assumed it never would.
Five Decisions That Shape Every Schema
Before diving into patterns and SQL, these are the decisions you will face on every project. The rest of this page gives you the frameworks to make them deliberately rather than by accident.
1. Normalisation level
Start normalised. Denormalise only for measured performance problems, using materialised views or snapshot-on-write. Document every trade-off.
2. Temporal data approach
Will anyone ask "what was true on this date?" If yes (and they will), choose between effective dating, snapshot-on-write, or system-versioned tables before you have live data.
3. Soft delete strategy
Soft deletes add query complexity and index bloat. Consider partial indexes, view abstractions, archive tables, or hard deletes with audit logging. For UK businesses, factor in GDPR right-to-erasure obligations.
4. Tenancy pattern
Row-level tenant_id, schema-per-tenant, or database-per-tenant. This decision cascades into every constraint, index, migration, and backup strategy.
5. Constraint philosophy
The database is the last line of defence. Encode business rules as constraints (CHECK, UNIQUE, EXCLUSION, foreign keys) rather than relying on application code alone. Application code has bugs. Constraints don't.
Domain Modelling: Start with the Language
Before writing any SQL, map the language of the business. What do they call things? An "account" might mean a customer to sales, a login to IT, and a billing entity to finance. A "project" in one department might overlap with "job" in another.
These aren't pedantic distinctions. They determine table names, relationship structures, and the queries you'll write for years. Get them wrong, and every developer who touches the codebase will need a translation guide.
Interview the domain experts
Talk to the people who do the work. They know the exceptions. "Usually orders have one customer, but sometimes we split billing between two." These details determine schema structure.
Build a glossary
Document terms and their meanings across departments. When sales says "account" and support says "account", do they mean the same entity? Resolve ambiguities before encoding them in tables.
Map the relationships
Draw the connections. Can a customer have multiple billing addresses? Can a line item span multiple orders? Can an employee report to multiple managers? Cardinality matters.
Find the lifecycle events
What happens when an order is cancelled? When a customer churns? When a product is discontinued? These events reveal requirements that happy-path discussions miss.
The result is a ubiquitous language: terms that mean the same thing in code, in the database, and in conversations with the business. When the code says Invoice, everyone knows what that means. No translation required. This shared vocabulary is essential for building a single source of truth.
Normalisation: The Naive Approach vs. Deliberate Design
The Naive Approach: Normalise Everything
Database courses teach normalisation as dogma. Third normal form. No redundant data. Every fact stored once. In isolation, this is correct.
In production, it creates problems. A simple "show me the order summary" query joins six tables. The report that runs overnight because it aggregates across perfectly normalised tables. The API endpoint that times out because it needs to reconstruct data from fifteen related records.
The Other Naive Approach: Denormalise Everything
The pendulum swings. "Normalisation is slow, let's just copy data everywhere." Now you have synchronisation bugs. The customer's name is different in the orders table than in the customers table. Someone updated one and not the other. Which is correct? No one knows.
The Better Pattern: Normalise First, Then Denormalise Deliberately
Start with a normalised design. Understand the canonical structure. Then, with specific, measured performance problems in hand, introduce denormalisation where the trade-offs are worth it.
| Scenario | Normalised Approach | When to Denormalise |
|---|---|---|
| Customer name on orders | Join to customers table | Copy at order time if the order must preserve the commercial record at the time of sale. This is about snapshot semantics, not join cost. |
| Order totals | SUM(line_items.price * quantity) | Store computed total if the calculation is expensive and totals are queried frequently |
| Product categories | Junction table for many-to-many | Keep the junction table as the source of truth. If reads are hot, add a materialised view or a cached display column. |
| Reporting aggregates | Calculate on demand | Materialised views or summary tables for dashboards that run constantly |
The key word is "deliberately". Every denormalisation is a trade-off: storage vs. computation, consistency vs. performance. Document why you made the trade-off. Future developers (including yourself in six months) will need to know.
The Decision Framework: Normalised vs. Denormalised vs. JSONB
Three storage approaches, each with a specific sweet spot. The wrong choice costs you either performance or data integrity.
| Question | Normalised columns | Denormalised copy | JSONB column |
|---|---|---|---|
| Do you filter or sort by this data? | Yes. B-tree indexes, type checking, constraints. | Possible, but sync risk on every write. | Slower for most queries. GIN indexes work well for containment and existence checks, but range queries and sorting remain expensive compared to proper columns. |
| Does the structure vary per record? | No. Fixed columns cannot adapt. | No. Same limitation. | Yes. Form builder responses, API payloads, user-defined config. |
| Do you need referential integrity? | Yes. Foreign keys enforced by the database. | Partial. Copies break referential links. | No. Cannot reference other rows from inside JSON. |
| Is read performance the bottleneck? | Joins can be slow for complex aggregations. | Pre-computed values avoid joins. Use materialised views as a safer middle ground. | Single-row reads are fast, but filtering across rows is slow. |
If you find yourself reaching for JSONB because you're unsure of the structure, pause. In most cases, the structure is known but the developer hasn't spoken to the domain experts yet. Talk to them first. The data model usually reveals itself.
Temporal Data: The Problem Everyone Ignores Until It's Too Late
Business data changes over time. Prices change. Employees change roles. Customers change addresses. A schema that only stores current state will eventually fail someone who needs to answer: "What was true on this date?"
The Constraint
An auditor asks: "What was this customer's credit limit when this order was approved?" The order was placed in March. The credit limit was updated in April. Your schema stores only the current credit limit. You cannot answer the question.
Effective Dating
Records that are valid for a period use effective dates: effective_from and effective_to. The current record has effective_to = NULL or a far-future date.
-- Required for EXCLUDE with BIGINT inside a GiST index
CREATE EXTENSION IF NOT EXISTS btree_gist;
-- Price history table
CREATE TABLE product_prices (
id BIGINT PRIMARY KEY,
product_id BIGINT REFERENCES products(id),
price DECIMAL(10,2) NOT NULL,
effective_from DATE NOT NULL,
effective_to DATE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT no_overlap EXCLUDE USING gist (
product_id WITH =,
daterange(effective_from, effective_to, '[)') WITH &&
)
);
The exclusion constraint (PostgreSQL) prevents overlapping date ranges for the same product. This is the database enforcing business rules that application code would otherwise need to check on every insert.
Querying Effective-Dated Data
To find the price on a specific date:
SELECT price
FROM product_prices
WHERE product_id = 123
AND daterange(effective_from, effective_to, '[)') @> DATE '2025-03-15';
This pattern applies to any data that changes over time: employee roles, customer addresses, tax rates, exchange rates, pricing tiers.
Snapshot vs. Temporal Tables
Snapshot approach
Copy relevant data at transaction time. The order stores the price, customer name, and address as they were when the order was placed. Simple to query. Duplicates data.
Use when: You need fast reads and the data is copied at a clear point in time (order placement, invoice generation).
Temporal tables
Maintain history in a separate table or with effective dates. Query "as of" any point in time. More complex queries. Row versions are duplicated into the history table, but the source of truth remains a single record at any given point.
Use when: You need to query historical state at arbitrary points, or when the same data is referenced by many records.
SQL Server offers system-versioned temporal tables that handle history automatically. PostgreSQL takes a different approach: use effective-dated rows, history tables, or triggers. PostgreSQL 18 adds temporal constraints such as WITHOUT OVERLAPS, but does not generate history tables automatically. For MySQL or SQLite, you'll implement temporal patterns manually.
Polymorphic Relationships: Several Implementation Options
A "note" that can belong to either a customer or an order. An "attachment" linked to invoices, projects, or support tickets. These are polymorphic relationships: one table relates to multiple other tables.
The Naive Approach: Morph Columns
Laravel's morphTo pattern stores notable_type and notable_id:
CREATE TABLE notes (
id BIGINT PRIMARY KEY,
notable_type VARCHAR(255), -- 'App\Models\Customer' or 'App\Models\Order'
notable_id BIGINT,
content TEXT,
created_at TIMESTAMP
);
This works at the application level but sacrifices database integrity. There's no foreign key constraint. notable_id might reference a row that doesn't exist. The database cannot enforce referential integrity.
The Better Pattern: Multiple Foreign Keys
Use nullable foreign keys for each possible parent:
CREATE TABLE notes (
id BIGINT PRIMARY KEY,
customer_id BIGINT REFERENCES customers(id),
order_id BIGINT REFERENCES orders(id),
content TEXT,
created_at TIMESTAMP,
CONSTRAINT exactly_one_parent CHECK (
(customer_id IS NOT NULL)::int +
(order_id IS NOT NULL)::int = 1
)
);
The check constraint ensures exactly one parent is set. The foreign keys ensure the parent exists. The database enforces the business rules.
The Junction Table Pattern
For true many-to-many polymorphism, use separate junction tables:
CREATE TABLE customer_notes (
customer_id BIGINT REFERENCES customers(id),
note_id BIGINT REFERENCES notes(id),
PRIMARY KEY (customer_id, note_id)
);
CREATE TABLE order_notes (
order_id BIGINT REFERENCES orders(id),
note_id BIGINT REFERENCES notes(id),
PRIMARY KEY (order_id, note_id)
);
More tables, but full referential integrity and the ability for a note to belong to multiple parents if the business requires it.
Trade-off: Morph columns are convenient for rapid development. Multiple foreign keys are safer for production systems where data integrity matters. Choose based on how much you trust your application code to never make mistakes.
Soft Deletes and Archival Strategies
Deleting a record removes it permanently. Sometimes that's correct. Often it isn't. Audit requirements, reporting needs, and the ability to undo mistakes all argue for soft deletes.
The Implementation
ALTER TABLE customers ADD COLUMN deleted_at TIMESTAMP NULL;
A record with deleted_at IS NULL is active. A record with a timestamp is soft-deleted. Every query that should exclude deleted records needs:
SELECT * FROM customers WHERE deleted_at IS NULL;
The Problem: Query Complexity
Every developer must remember to add the filter. Every query, every join, every subquery. Miss one, and deleted records appear in reports. ORMs like Laravel handle this with global scopes, but raw queries bypass them.
The Better Pattern: Partial Indexes and Views
Create a view for active records:
CREATE VIEW active_customers AS
SELECT * FROM customers WHERE deleted_at IS NULL;
Application code queries the view by default. The underlying table is available when you need historical data.
Add a partial index for queries that filter on active records:
CREATE INDEX idx_customers_active_email
ON customers(email)
WHERE deleted_at IS NULL;
This index only includes active records, making it smaller and faster for the common case.
Archival for Large Tables
Soft deletes keep everything in one table. For high-volume tables (millions of rows), this slows queries even with indexes. The archival pattern moves old data to separate tables:
-- Archive table with defaults and CHECK constraints (does not copy indexes or foreign keys)
CREATE TABLE orders_archive (LIKE orders INCLUDING DEFAULTS INCLUDING CONSTRAINTS);
-- Atomic move using DELETE ... RETURNING
WITH moved_rows AS (
DELETE FROM orders
WHERE created_at < NOW() - INTERVAL '2 years'
RETURNING *
)
INSERT INTO orders_archive
SELECT * FROM moved_rows;
The CTE deletes and inserts in a single atomic operation. Active queries hit the smaller table. Historical reports union both tables when needed. For very large tables, consider PostgreSQL's native range partitioning: if designed up front, the query planner automatically prunes old partitions and routes new rows to the correct partition. It won't archive data for you, but it removes the need for manual data movement.
Data Retention and GDPR Considerations
For UK businesses, soft deletes create a specific legal exposure. Under GDPR, a customer who exercises their right to erasure expects their personal data to be deleted, not just hidden behind a deleted_at timestamp. The ICO's storage limitation principle requires that personal data is kept only as long as necessary for its original purpose.
This means your soft-delete strategy needs a data lifecycle plan.
A schema that treats "deleted" as a single concept will struggle with these requirements. Design the data lifecycle into the schema from the start, not as a compliance afterthought.
JSON Columns vs. Relational Tables
PostgreSQL's JSONB and MySQL's JSON columns offer schema flexibility within a relational database. Used correctly, they simplify certain patterns. Used incorrectly, they recreate all the problems of document databases without the benefits.
Good Uses for JSON Columns
Configuration that varies by record
Each customer has different notification preferences. Each product has different configurable options. The structure varies, and you don't query by these fields. Store as JSON, validate in application code.
External API responses
You store webhook payloads or API responses for debugging and audit purposes. You don't query by their contents. JSON preserves the original structure without schema maintenance.
Truly variable structures
Form builder responses where each form has different fields. Survey answers where questions vary. The structure is user-defined and cannot be predicted at schema design time.
Poor Uses for JSON Columns
Data you filter or sort by
Extracting values from JSON for WHERE clauses is slow. Even with JSON path indexes, it's slower than a proper column. If you query by it, it should be a column.
Data with consistent structure
If every record has the same fields, use columns. You get type checking, constraints, and indexing. JSON provides flexibility you're not using.
Relationships between records
Storing foreign keys in JSON arrays prevents referential integrity. The database cannot enforce that referenced records exist. Use junction tables for relationships.
The Hybrid Pattern
Use columns for data you query, JSON for data you don't:
CREATE TABLE orders (
id BIGINT PRIMARY KEY,
customer_id BIGINT REFERENCES customers(id),
status VARCHAR(50) NOT NULL, -- Queried frequently
total DECIMAL(10,2) NOT NULL, -- Queried frequently
shipping_address_id BIGINT REFERENCES addresses(id), -- Relational
metadata JSONB, -- Webhook data, notes, flexible config
created_at TIMESTAMP
);
-- Index the columns you query
CREATE INDEX idx_orders_status ON orders(status);
CREATE INDEX idx_orders_customer ON orders(customer_id);
-- Optional: index specific JSON paths if you must query them
CREATE INDEX idx_orders_source ON orders((metadata->>'source'));
This gives you relational rigour where it matters and flexibility where it doesn't.
Indexing Strategies for Common Query Patterns
Indexes speed up reads and slow down writes. Every index must earn its place. The strategy depends on your query patterns, not abstract best practices.
The Naive Approach: Index Everything
Add an index to every column. Queries are fast. Inserts and updates are slow. The database maintains indexes it never uses. Storage grows.
The Other Naive Approach: Index Nothing
Let the database figure it out. It won't. Full table scans on every query. The application slows to a crawl as data grows.
The Better Pattern: Index for Your Queries
Analyse your actual query patterns. Index the columns that appear in WHERE clauses, JOIN conditions, and ORDER BY clauses. Use EXPLAIN ANALYZE to verify indexes are being used.
Single-Column Indexes
The basics. Index columns you filter by:
-- If you frequently query orders by status
CREATE INDEX idx_orders_status ON orders(status);
-- If you frequently query customers by email
CREATE UNIQUE INDEX idx_customers_email ON customers(email);
Composite Indexes
For queries that filter on multiple columns:
-- Queries like: WHERE customer_id = ? AND status = ?
CREATE INDEX idx_orders_customer_status ON orders(customer_id, status);
-- Column order matters: this index helps queries that filter by customer_id alone,
-- but not queries that filter only by status
The leftmost columns in a composite index can satisfy queries independently. A (customer_id, status, created_at) index helps queries filtering by customer_id, or by customer_id + status, or by all three. It does not help queries filtering only by status or only by created_at.
Covering Indexes
Include columns the query selects to avoid table lookups:
-- Query: SELECT status, total FROM orders WHERE customer_id = ?
CREATE INDEX idx_orders_customer_covering
ON orders(customer_id) INCLUDE (status, total);
The index contains all the data the query needs. PostgreSQL can use an index-only scan and avoid most heap lookups when the visibility map allows it. Faster reads at the cost of larger indexes.
Partial Indexes
Index a subset of rows:
-- Most queries only care about active orders
CREATE INDEX idx_orders_active_customer
ON orders(customer_id)
WHERE status NOT IN ('cancelled', 'completed');
-- Most queries exclude soft-deleted records
CREATE INDEX idx_customers_active
ON customers(email)
WHERE deleted_at IS NULL;
Smaller indexes. Faster maintenance. Only useful if your queries include the same filter condition.
Expression Indexes
Index computed values:
-- Queries like: WHERE LOWER(email) = ?
CREATE INDEX idx_customers_email_lower ON customers(LOWER(email));
-- Queries like: WHERE DATE(created_at) = ?
CREATE INDEX idx_orders_created_date ON orders(DATE(created_at));
The query must use the exact same expression for the index to apply.
Measure, don't guess: Use EXPLAIN ANALYZE to see which indexes are actually used. Check pg_stat_user_indexes (PostgreSQL) to find unused indexes. Remove indexes that cost more than they help.
State Machines and Status Fields
Orders move through states: draft, submitted, approved, shipped, completed, cancelled. Modelling state transitions correctly prevents impossible states and provides audit trails.
The Naive Approach: Status Column Only
ALTER TABLE orders ADD COLUMN status VARCHAR(50) DEFAULT 'draft';
The application updates the status. No record of when transitions happened. No enforcement of valid transitions. An order can jump from "draft" to "shipped" if the code has a bug.
The Better Pattern: Status with Transition History
CREATE TABLE order_status_transitions (
id BIGINT PRIMARY KEY,
order_id BIGINT REFERENCES orders(id),
from_status VARCHAR(50),
to_status VARCHAR(50) NOT NULL,
transitioned_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
transitioned_by BIGINT REFERENCES users(id),
notes TEXT
);
-- The orders table stores current status for fast queries
ALTER TABLE orders ADD COLUMN current_status VARCHAR(50) DEFAULT 'draft';
ALTER TABLE orders ADD COLUMN current_status_at TIMESTAMP;
Insert a transition record whenever status changes. The orders table has the current status for filtering. The transitions table has the complete history for audit and reporting. For system-wide change history, see our audit trails guide.
Valid Transition Enforcement
Define which transitions are allowed:
CREATE TABLE valid_status_transitions (
from_status VARCHAR(50),
to_status VARCHAR(50),
PRIMARY KEY (from_status, to_status)
);
INSERT INTO valid_status_transitions VALUES
('draft', 'submitted'),
('submitted', 'approved'),
('submitted', 'rejected'),
('approved', 'shipped'),
('shipped', 'completed'),
('draft', 'cancelled'),
('submitted', 'cancelled');
Application code checks this table before allowing a transition. Or use a trigger to enforce it at the database level:
CREATE FUNCTION check_status_transition() RETURNS TRIGGER AS $$
BEGIN
-- Skip check if status hasn't changed
IF NEW.current_status IS NOT DISTINCT FROM OLD.current_status THEN
RETURN NEW;
END IF;
IF NOT EXISTS (
SELECT 1 FROM valid_status_transitions
WHERE from_status = OLD.current_status
AND to_status = NEW.current_status
) THEN
RAISE EXCEPTION 'Invalid status transition from % to %',
OLD.current_status, NEW.current_status;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER orders_check_status_transition
BEFORE UPDATE OF current_status ON orders
FOR EACH ROW
EXECUTE FUNCTION check_status_transition();
The trigger fires on every status update. If the transition isn't in the allowed list, the database rejects it. This prevents bugs in application code from creating invalid states.
Multi-Tenancy Patterns
Multiple clients sharing one database. Three approaches, each with trade-offs.
Tenant ID on every table
Add tenant_id to every tenant-owned table. Scope unique constraints by tenant, include tenant_id in tenant-owned foreign keys, and enforce isolation with RLS. Risk of data leakage if a query misses the filter and RLS is not enabled.
Best for: Shared infrastructure, cost-sensitive deployments, tenants with similar data volumes.
Separate schemas
Each tenant gets their own PostgreSQL schema. Queries don't need tenant filters. Schema changes must be applied to all schemas.
Best for: Moderate isolation requirements, tenants who want some data separation guarantees.
Separate databases
Complete isolation. Each tenant has their own database. Maximum security. Higher operational cost. Cross-tenant queries require application-level aggregation.
Best for: Strict compliance requirements, large enterprise tenants, data residency requirements.
Implementing Tenant ID Safely
If using the tenant ID pattern, use Row Level Security (PostgreSQL) to enforce isolation at the database level. We cover multi-tenancy implementation in more detail in our multi-tenant Laravel guide.
-- Enable RLS (applies to non-owner roles)
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
-- FORCE applies RLS even to the table owner (critical for admin operations)
ALTER TABLE orders FORCE ROW LEVEL SECURITY;
-- Policy restricts both reads (USING) and writes (WITH CHECK)
CREATE POLICY tenant_isolation ON orders
USING (tenant_id = current_setting('app.current_tenant_id')::bigint)
WITH CHECK (tenant_id = current_setting('app.current_tenant_id')::bigint);
-- Set tenant context at connection time
SET app.current_tenant_id = '123';
Without FORCE ROW LEVEL SECURITY, the table owner bypasses all RLS policies. This is a common source of data leakage during admin operations, migrations, and background jobs. The WITH CHECK clause prevents a tenant from inserting or updating rows belonging to another tenant. Even if application code forgets the WHERE clause, the database enforces tenant isolation.
Constraints: Your Last Line of Defence
Application code has bugs. ORMs have bugs. API consumers send unexpected data. Constraints are the database saying "this data is invalid" regardless of how it arrived.
Constraint Types
- NOT NULL: Column must have a value. Prevents incomplete records.
- UNIQUE: No duplicate values. Email addresses, order numbers, SKUs.
- FOREIGN KEY: Referenced row must exist. Prevents orphaned records.
- CHECK: Custom validation.
CHECK (price >= 0),CHECK (end_date > start_date). - EXCLUSION: No overlapping values. Date ranges, scheduled resources, room bookings.
-- btree_gist required for BIGINT inside a GiST index
CREATE EXTENSION IF NOT EXISTS btree_gist;
CREATE TABLE bookings (
id BIGINT PRIMARY KEY,
room_id BIGINT REFERENCES rooms(id) NOT NULL,
starts_at TIMESTAMP NOT NULL,
ends_at TIMESTAMP NOT NULL,
booked_by BIGINT REFERENCES users(id) NOT NULL,
-- End must be after start
CONSTRAINT valid_time_range CHECK (ends_at > starts_at),
-- No double-booking (PostgreSQL with btree_gist extension)
CONSTRAINT no_overlap EXCLUDE USING gist (
room_id WITH =,
tsrange(starts_at, ends_at) WITH &&
)
);
The database prevents double-bookings. No race conditions in application code. No "oops, two people booked the same room" support tickets.
Quick Reference: Choosing the Right Constraint
| Business Rule | Constraint Type | Example |
|---|---|---|
| Every order must have a customer | FOREIGN KEY + NOT NULL | customer_id BIGINT NOT NULL REFERENCES customers(id) |
| No two customers share an email | UNIQUE | UNIQUE(email) |
| Prices cannot be negative | CHECK | CHECK (price >= 0) |
| End date must be after start date | CHECK | CHECK (ends_at > starts_at) |
| No overlapping booking times | EXCLUSION (PostgreSQL) | EXCLUDE USING gist (room_id WITH =, tsrange(starts_at, ends_at) WITH &&) |
| No overlapping price periods | EXCLUSION (PostgreSQL) | EXCLUDE USING gist (product_id WITH =, daterange(effective_from, effective_to) WITH &&) |
| Status must be a known value | CHECK or ENUM | CHECK (status IN ('draft','submitted','approved')) or PostgreSQL native ENUM |
The investment: A missing constraint in a customer database allowed duplicate entries for years. Cleaning up the resulting data cost more than the original development. Constraints are cheap. Data problems are expensive.
Anti-Patterns to Recognise
Patterns that seem reasonable but cause problems at scale or over time.
(entity_id, attribute_name, attribute_value). Seems flexible. Actually makes queries impossible to optimise and prevents meaningful constraints.tags = 'red,blue,green'). Cannot be indexed properly. Cannot enforce referential integrity. Use a junction table.If You Inherited One of These Patterns
Recognising anti-patterns is only half the work. The harder part is migrating away from them on a live system without breaking production. Here's the general approach.
God table: Split by concept
Create new tables for each concept the god table represents. Add foreign keys back to the original. Run a dual-write period where both the old table and new tables receive data. Migrate historical data in batches. Cut over when the new tables are verified. Drop the old columns last.
EAV: Extract to typed columns
Identify the most-queried attributes. Add them as proper columns with constraints. Backfill from the EAV rows. Once verified, stop writing to EAV for those attributes. Keep the EAV table for genuinely variable attributes (if any exist), remove it entirely when it's empty.
JSONB overuse: Promote to columns
For each JSON path you filter or sort by, add a generated column or a proper column with a backfill migration. Add constraints and indexes on the new columns. Update application code to read from columns instead of JSON. Keep JSON only for data that genuinely varies per record.
String foreign keys: Add surrogate keys
Add an integer or UUID primary key to the referenced table. Add a new foreign key column to each referencing table. Backfill the new column using the string match. Switch application code to use the new key. Drop the string column when nothing references it.
Every remediation follows the same pattern: add the new structure alongside the old, dual-write, backfill, verify, cut over, clean up. Rushing any step risks data loss or downtime. Plan for each migration to take longer than you expect, and test with a copy of production data first.
The Business Link: Why This Matters
Schema work isn't glamorous. It doesn't produce visible features. The payoff is indirect but substantial.
-
Fewer bugs reach production Constraints catch invalid data before it corrupts reports and confuses users.
-
Queries stay fast as data grows Proper indexes and structure mean the system that works with 10,000 records still works with 10 million.
-
New features are easier to build A well-structured schema makes new queries straightforward. A poorly-structured one makes every feature a battle.
-
Reports answer the actual questions Historical data is available when auditors ask. Temporal queries return accurate results.
-
Migrations are less painful A schema designed for change can be modified without rebuilding the entire system.
The cost of getting schema right is paid once, during design. The cost of getting it wrong is paid repeatedly, on every feature, every report, every migration, for the lifetime of the system.
Further Reading
- PostgreSQL Constraints - Official documentation for CHECK, UNIQUE, and EXCLUSION constraints.
- PostgreSQL Range Types - Foundation for effective-dated data and exclusion constraints on date and timestamp ranges.
- PostgreSQL Row Level Security - Database-enforced multi-tenancy isolation patterns.
- PostgreSQL Partial Indexes - Index subsets of rows for soft-delete filtering and active-record queries.
- Use The Index, Luke - In-depth guide to database indexing strategies and query performance.
- Laravel Migrations - Schema versioning and migration tooling for implementing these patterns.
- ICO Storage Limitation Guidance - UK data retention requirements relevant to soft delete and archival strategies.
Design Your Data Model
We design data models that capture your business reality accurately. Schemas that handle the exceptions, edge cases, and "but sometimes..." scenarios that real businesses have. The foundation everything else builds on.
Let's talk about data modelling →