Connecting Systems That Weren't Designed to Talk

Modern businesses run on connected systems. Your application needs to talk to payment processors, email services, accounting software, CRMs, and a dozen other external services. Each connection is an API integration, and each one is a potential point of failure.

Good API integrations are invisible. Data flows between systems automatically. Bad integrations create operational headaches: failed payments that don't retry, missing customer data, orders that vanish between systems. The difference is in the engineering, not the choice of provider. This is a critical consideration when thinking through build vs buy decisions for your systems.

Live Integration Mesh

Region: eu-west-2

⚙️

CORE

Orchestrator

The Fundamental Constraint

Every integration has the same underlying problem: you are coupling your system to something you do not control. The external service can change behaviour, go offline, rate limit you, return unexpected data, or deprecate the endpoint you depend on. Your code runs on your infrastructure, but its correctness depends on infrastructure you cannot see.

This is not a solvable problem in the sense that you can eliminate it. It is a constraint you design around. The goal is not to prevent external services from failing. The goal is to ensure that when they fail (and they will), your system continues to operate in a predictable, recoverable state.

The constraint: You cannot control what the other side does. You can only control how your system responds to what the other side does.

Why Integrations Are Hard

Before diving into patterns and implementation, it is worth being explicit about what makes integrations difficult. These are not edge cases. They are the normal operating conditions you should expect.

You don't control the other side

External APIs change. Providers update their systems, modify response formats, deprecate endpoints. Sometimes they notify you. Sometimes they do not. We build integrations that handle versions we didn't anticipate and degrade gracefully when fields disappear.

Networks are unreliable

Requests fail. Timeouts happen. Services go down. Packets get lost. DNS resolves incorrectly. TLS handshakes fail. An integration that works perfectly in development can fail unpredictably in production. We assume failure and design for it.

Data doesn't map cleanly

Your system's concept of a "customer" might not match theirs. Field names differ. Required fields in one system are optional in another. Date formats vary. Currency precision differs. Data transformation is always more complex than it first appears.

Errors are ambiguous

Did the request succeed? Fail? Partially succeed? A timeout does not mean failure. A 500 error might have processed the request before erroring. Different APIs communicate errors differently. We build error handling that interprets responses correctly regardless of how they're formatted.

The Naive Approach (What Goes Wrong)

Most integration code starts the same way. A developer reads the API documentation, writes a function that makes an HTTP request, parses the response, and moves on. The code works in development. It works in the first few weeks of production. Then it fails.

The naive approach treats the external API like a local function call: synchronous, reliable, and consistent. This assumption is false in every particular.

No timeout configured: HTTP client uses default timeout (often 30 seconds or infinite). A hung connection blocks threads, exhausts connection pools, and cascades into system-wide failure.

No retry logic: Transient failures (network blips, momentary overloads) become permanent failures. Users retry manually, creating duplicate operations.

No idempotency: Retries create duplicate records, duplicate charges, duplicate emails. The customer gets charged twice. The order gets placed twice.

Inline execution: API calls happen in the request/response cycle. Slow external APIs make your application slow. Failing external APIs make your application fail.

No logging: When something fails, no one knows what was sent, what was received, or how long it took. Debugging becomes guesswork.

Tight coupling: API client code is scattered throughout the codebase. When the external API changes, dozens of files need updating.

These are not mistakes made by junior developers. They are the natural result of building features under time pressure without explicit requirements for resilience. The code does what it was asked to do: call the API and return the result. The problem is that "call the API and return the result" is not sufficient specification for production systems.

Scenario	Naive approach	Production impact
Timeout	Wait indefinitely	Thread pool exhaustion, cascading failure
5xx error	Return error to user	User retries manually, creates duplicates
Rate limit	Return error to user	Operation fails permanently
Malformed response	Crash with parsing error	Entire request fails, no partial recovery
Credential expiry	401 error, operation fails	All operations fail until manual intervention

How We Build Reliable Integrations

Reliable integrations require explicit design decisions about failure modes. Every external call needs answers to specific questions before writing code.

We assume failure

Every API call can fail. We design for it from the start. Before implementing any integration, we answer these questions:

What happens if this request times out after 5 seconds? After 30 seconds?
What if the service is down for five minutes? For an hour? For a day?
What if the response is malformed or missing expected fields?
What if we get rate-limited?
What if the request succeeds but we never receive the response?
What if the same request is submitted twice?

Integrations that assume success work in demos. Integrations that assume failure work in production.

We make operations idempotent

If a request fails and we retry, the same operation must not happen twice. We use unique identifiers (idempotency keys) to prevent duplicate charges, duplicate orders, duplicate records.

Why this matters: A payment integration without idempotency means customers can get charged twice when a timeout triggers a retry. Stripe processes the payment, the response times out on the network, your system retries, Stripe processes again. The customer sees two charges. That is a customer service problem, a chargeback risk, and a regulatory issue.

Idempotency works by sending a unique key with each operation. If the external system receives the same key twice, it returns the result of the first operation without re-executing. Most payment processors and financial APIs support this. For APIs that do not, we implement idempotency on our side by tracking operations in a database before sending them.

We log everything

When an integration fails, we need to debug it. That means knowing exactly what happened, not reconstructing it from memory or guessing from error messages.

We log:

Request details: Endpoint, method, headers (sanitised), body (sanitised), timestamp
Response details: Status code, headers, body, timestamp
Timing: Total request duration, connection time, time to first byte
Outcome: Success, failure type, retry count, final disposition
Context: Correlation ID linking the request to the business operation that triggered it

Structured logs make troubleshooting possible. When something goes wrong at 3am, the logs tell us what happened. We can replay the request, compare it to successful requests, and identify the failure mode without guessing. For more on maintaining complete operational records, see our approach to audit trails.

Sanitisation is critical. Logs must not contain API keys, passwords, credit card numbers, or personal data. We strip these before logging and replace them with masked placeholders.

We isolate external dependencies

We wrap external API calls in abstraction layers. The rest of the codebase interacts with our wrapper, not the external API directly. When the external API changes, we change the wrapper. Every other file in the codebase remains unchanged.

This pattern also allows us to:

Mock the external API for testing without hitting real endpoints
Swap providers without touching business logic
Add logging, retry logic, and circuit breakers in one place
Version our interface independently of the external API's versioning

Integration Patterns We Use

Different integration scenarios call for different patterns. The choice depends on latency requirements, reliability needs, and the capabilities of the external system.

Synchronous request-response

The simplest pattern: send a request, wait for a response, proceed based on the result. Used for real-time data lookups, synchronous operations where the user is waiting (payment authorisation, address validation), and quick requests where latency matters.

The risk: your application blocks waiting. We mitigate with aggressive timeouts, circuit breakers, and fallback behaviour. If the external service is slow, your application must not become slow.

Asynchronous with webhooks

Send a request, receive acknowledgement that it was queued, then receive results via webhook callback. Used for operations that take time (document processing, batch imports, background checks) or events you need to react to (payment confirmations, subscription changes).

We build webhook handlers that verify authenticity, acknowledge quickly, process asynchronously, and handle duplicates gracefully.

Polling

Periodically check an external system for updates. Used when the external system does not support webhooks, when webhook delivery is unreliable, or when you need to sync data that changes outside your control.

We implement polling with exponential backoff during quiet periods, proper pagination for large datasets, and state tracking to process only changed records.

Queue-based processing

Decouple the triggering of an operation from its execution. The request goes into a queue, a background worker processes it, and the result is stored or forwarded. Used for high-volume operations, operations that can tolerate latency, and operations where retry logic is complex. See our background jobs guide for implementation details.

Queues provide natural backpressure, retry semantics, and isolation from external failures.

Pattern selection matrix

Choosing the right pattern depends on your specific requirements. This matrix provides a starting point.

Requirement	Pattern	Trade-off
User waiting for result	Synchronous	Must handle timeouts gracefully
Operation takes >5 seconds	Async + webhook or polling	More complex state management
High volume (1000+ calls/minute)	Queue-based	Latency between trigger and execution
External system unreliable	Queue with dead letter	Need monitoring and manual review process
Need eventual consistency	Polling or event-driven	Data may be stale between sync cycles

Error Handling Strategies

Different errors require different responses. A network timeout needs retry logic. A 400 Bad Request needs investigation and possibly a code fix. A 429 rate limit needs backoff. Treating all errors the same leads to either excessive retries (wasting resources, hitting rate limits) or insufficient retries (failing operations that would succeed on retry).

Error classification

We classify errors into categories that determine handling behaviour.

Transient errors (retry immediately)

Network timeouts, connection refused, DNS resolution failures, 502/503/504 errors. These often resolve on retry. We retry with exponential backoff, up to a configured maximum.

Rate limit errors (retry with delay)

429 Too Many Requests, provider-specific rate limit responses. The request is valid but we are sending too many. We respect the Retry-After header if present, otherwise back off exponentially.

Client errors (do not retry)

400 Bad Request, 401 Unauthorised, 403 Forbidden, 404 Not Found, 422 Unprocessable Entity. The request is malformed or unauthorised. Retrying will not help. Log and alert for investigation.

Ambiguous errors (retry with idempotency)

Timeouts where we do not know if the request was processed, 500 errors that might have partially succeeded. These are the hardest. We retry with idempotency keys and verify state before proceeding.

Retry implementation

Retry logic has several parameters that need tuning for each integration.

Maximum retries: How many times to retry before giving up. Typically 3-5 for synchronous operations, higher for queued operations.
Base delay: Initial wait before first retry. Typically 100ms-1s.
Backoff multiplier: How much to increase delay between retries. Typically 2x (exponential backoff).
Maximum delay: Cap on delay to prevent waiting forever. Typically 30s-60s.
Jitter: Random variation in delay to prevent thundering herd. Add 0-25% random variation.

Attempt 1

Immediate

Retry 1

1s delay

Retry 2

2s delay

Retry 3

4s delay

Retry 4

8s delay

Give up

Dead letter

Circuit breaker pattern

If an external service is failing repeatedly, continuing to call it wastes resources and can cascade into broader system failure. The circuit breaker pattern detects repeated failures and stops calling the failing service for a period, giving it time to recover.

A circuit breaker has three states:

Closed

Normal operation. Requests flow through. Failures are counted. If failures exceed the threshold (e.g., 5 failures in 60 seconds), the circuit opens.

Open

Service assumed down. Requests fail immediately without calling the external service. After a timeout period (e.g., 30 seconds), the circuit moves to half-open.

Half-open

Testing recovery. A limited number of requests are allowed through. If they succeed, the circuit closes. If they fail, the circuit opens again.

Circuit breakers prevent cascading failures. Without them, a failing external service can bring down your entire application as requests queue up waiting for responses that never come. Thread pools exhaust. Memory fills. Other, healthy integrations start failing because they share resources with the failing one.

Rate Limiting and Throttling

Most external APIs have rate limits. Exceed them and you get 429 errors, temporary bans, or degraded service. Some providers are generous. Others are strict. Either way, your integration needs to operate within the limits.

Understanding rate limits

Rate limits come in several forms:

Requests per second: Common for real-time APIs. Often 10-100 requests per second.
Requests per minute/hour: Common for batch APIs. Often 1,000-10,000 per hour.
Concurrent connections: Limits how many simultaneous requests you can have open.
Daily quotas: Hard limits that reset at midnight UTC. Common for expensive operations.
Per-endpoint limits: Different limits for different endpoints. Write operations often have stricter limits than read operations.

Rate limits may apply per API key, per IP address, per user, or per organisation. Read the documentation carefully. Test at scale before production.

Client-side throttling

We implement client-side throttling to stay within limits rather than hitting limits and dealing with 429 errors. This is more efficient and more reliable.

Token bucket: Track available "tokens" that replenish over time. Each request consumes a token. If no tokens available, wait.
Sliding window: Track requests in the last N seconds. If at limit, wait until oldest request falls out of window.
Request queuing: Queue requests and dispatch at a controlled rate. Smooth out bursts.

Handling 429 responses

When you do hit a rate limit:

Parse the Retry-After header if present. It tells you exactly how long to wait.
If no header, use exponential backoff starting at 1 second.
Log rate limit events for monitoring. Frequent rate limits indicate a problem.
Consider queuing affected requests rather than retrying immediately.

Warning: Aggressive retry on rate limits makes the problem worse. If you are rate limited and immediately retry thousands of requests, you extend your ban. Queue and wait.

Webhook Implementation

Receiving webhooks from external services requires care. Webhooks invert the normal request/response model: the external service calls you. This means you are running code triggered by external events, on external timing, with external data.

Webhook security

Anyone can send an HTTP request to a public URL. Webhook endpoints must verify that requests actually come from the expected source.

Signature verification

Most webhook providers sign payloads using HMAC-SHA256 or similar. They include the signature in a header. You compute the expected signature using your shared secret and compare. If they do not match, reject the request. This is the strongest verification method.

IP allowlisting

Some providers publish the IP addresses their webhooks come from. You can reject requests from other IPs. This is defence in depth, not a primary control. IP addresses can be spoofed and providers may not keep their documentation updated.

Shared secrets in headers

Some providers send a secret token in a header that you configured during webhook setup. Verify it matches. Less secure than signature verification but better than nothing.

Webhook processing pattern

Webhook endpoints should acknowledge quickly and process later. If your processing takes 30 seconds, the webhook provider times out and retries, creating duplicate events.

Receive and verify. Validate signature. Return 400 if invalid.

Store raw payload. Save to database with timestamp and status "pending". This is your audit trail and recovery mechanism.

Acknowledge. Return 200 OK immediately. Total time: under 1 second.

Queue processing. Dispatch a background job to process the payload. Handle failures with retries.

Handling duplicate webhooks

Webhook providers retry failed deliveries. Your endpoint might receive the same event multiple times. Processing must be idempotent.

Store the event ID (most providers include one) before processing
Check if you have seen this event ID before
If seen, return 200 without reprocessing
Use database transactions or atomic operations to prevent race conditions

Some providers send the same event from multiple servers simultaneously for reliability. You may receive duplicates within milliseconds of each other. Simple "check then insert" logic has race conditions. Use database constraints or atomic upserts.

Authentication and Security

API credentials are keys to external systems. Compromise them and attackers can charge credit cards, send emails as you, access customer data, or delete resources. Security is not optional.

Credential management

API credentials never belong in code or version control. Ever. Not even in private repositories.

Environment variables: The baseline. Credentials injected at runtime, not stored in code.
Secret management systems: AWS Secrets Manager, HashiCorp Vault, Azure Key Vault. Centralised, audited, access-controlled storage for secrets.
Rotation support: Credentials should be rotatable without code changes or deployments. When a credential is compromised, you need to rotate immediately.

✓

Good: API key loaded from environment variable or secret store at runtime.

Avoid: API key hardcoded in source file, config file committed to git, or shared in Slack.

OAuth implementation

For user-authorised integrations (connecting to a user's account on another service), we implement OAuth 2.0 properly.

Secure token storage: Access tokens and refresh tokens stored encrypted. Never logged. Never exposed to frontend code.
Token refresh: Refresh tokens before they expire, not when they fail. Schedule refresh at 80% of token lifetime.
Revocation handling: Users can revoke access at any time. Detect revoked tokens (usually 401 errors) and prompt for reauthorisation.
State parameter: Use cryptographically random state in OAuth flows to prevent CSRF attacks.
PKCE: For public clients (mobile apps, SPAs), use Proof Key for Code Exchange to prevent authorisation code interception.

Least privilege

Request only the permissions the integration actually needs. Broad permissions create security risk (more damage if compromised) and make users hesitant to authorise. If you only need to read contacts, do not request write access. If you only need email, do not request calendar.

Review permissions periodically. As integrations evolve, unused scopes accumulate. Remove them.

Transport security

TLS only: All API calls over HTTPS. No exceptions. No fallback to HTTP.
Certificate validation: Verify the server's certificate chain. Do not disable certificate validation for convenience.
TLS version: Minimum TLS 1.2. Prefer TLS 1.3 where supported.
Certificate pinning: For high-security integrations, pin the expected certificate or public key.

Monitoring and Alerting

An integration that is not monitored is an integration you learn about when users complain. By then, the damage is done: failed orders, missed emails, broken sync. Monitoring catches problems before they cascade.

What to monitor

Error rates

Percentage of requests that fail. Alert when error rate exceeds threshold (e.g., 1% in 5 minutes). Track by error type: 4xx vs 5xx, timeout vs connection refused.

Latency

Time for requests to complete. Track p50, p95, p99. Alert when latency increases significantly. Slow external APIs degrade your user experience.

Throughput

Requests per second/minute. Sudden drops may indicate a problem. Sudden spikes may indicate a runaway process or attack.

Queue depth

For queue-based integrations, monitor pending items. Growing queues indicate processing is not keeping up. Alert before queues become unmanageable.

Alerting thresholds

Alerts should be actionable. Too sensitive and you get alert fatigue. Too insensitive and you miss real problems.

Metric	Warning threshold	Critical threshold
Error rate	1% over 5 minutes	5% over 5 minutes
p99 latency	2x baseline	5x baseline
Circuit breaker open	Any open event	Open for >5 minutes
Queue depth	100 items pending	1000 items pending
Dead letter queue	Any item arrives	10+ items in 1 hour

Health checks

Implement periodic health checks that verify the integration is working end to end. Not just "can we reach the API" but "can we complete a representative operation".

For payment integrations: verify credentials are valid, test mode available
For email integrations: send a test email to a monitored inbox
For data sync: verify last sync time is recent, record counts are plausible

Health checks run on a schedule (every 1-5 minutes) and alert if they fail. They catch problems like expired credentials, changed API endpoints, and silent failures that do not generate errors.

Versioning and Backward Compatibility

External APIs change. New fields appear. Old fields disappear. Endpoints get deprecated. Response formats evolve. Your integration must handle change without breaking.

Defensive parsing

Do not assume the response contains exactly what the documentation says. Parse defensively.

Use optional fields: If a field might not be present, handle its absence gracefully.
Ignore unknown fields: New fields in responses should not cause parsing errors.
Validate types: A field documented as integer might arrive as string. Handle it.
Handle null vs missing: These are different conditions. A null value is not the same as a missing key.

API versioning strategies

When external APIs version their endpoints, you need a strategy for migration.

Pin to specific version

Request a specific API version in headers or URL. Prevents surprise breakages from automatic updates. Requires active migration when versions are deprecated.

Monitor deprecation notices

Subscribe to provider changelogs, mailing lists, and status pages. Many providers announce deprecations 6-12 months in advance. Track these in your backlog.

Maintain version compatibility layer

Your abstraction layer can support multiple API versions simultaneously. Route requests to appropriate version based on configuration. Migrate gradually.

Testing against API changes

Record real API responses and use them in tests. When you suspect the API has changed, compare fresh responses against recordings. Automated tests that call real APIs (in sandbox mode) catch changes that documentation does not mention.

Common Failure Modes

Integration failures follow patterns. Knowing the patterns helps you build defences before you encounter the failure in production.

The timeout that succeeded

You send a request. It times out after 30 seconds. You assume it failed and retry. But the original request actually succeeded. Now you have duplicate records, duplicate charges, or inconsistent state.

Defence: Idempotency keys on all mutating operations. Verify state before retrying. For critical operations (payments), query for existing record before creating new one.

The silent data change

The external API starts returning a field in a different format. No error, no warning. Your code parses it into garbage and stores it. You discover the problem a week later when someone notices corrupt data.

Defence: Schema validation on responses. Log warnings for unexpected formats. Monitoring for data quality (null rates, format consistency).

The cascading failure

An external API slows down. Your application's threads block waiting for responses. Thread pool exhausts. Other, unrelated requests start failing. The entire application becomes unresponsive.

Defence: Aggressive timeouts. Separate connection pools per integration. Circuit breakers. Bulkheads (isolate external calls from critical paths).

The credential expiry

OAuth tokens expire. API keys get rotated by the provider. A team member leaves and their personal API key stops working. Integration fails with 401 errors until someone manually fixes it.

Defence: Use service accounts, not personal credentials. Automated token refresh. Monitoring for authentication failures. Documented credential rotation procedures.

The rate limit death spiral

You hit a rate limit. Your retry logic retries immediately. You hit the limit again. You retry more aggressively. The provider extends your cooldown period. You have now created an outage that lasts hours instead of seconds.

Defence: Respect Retry-After headers. Implement exponential backoff with maximum. Queue requests during rate limiting rather than retrying aggressively.

The webhook flood

Something triggers a large number of webhooks in a short period. Your webhook handler processes them synchronously. Database connections exhaust. Background job queue fills. Other operations start failing.

Defence: Acknowledge webhooks immediately, process asynchronously. Rate limit your own processing. Use separate queues or workers for webhook processing.

Testing Integration Code

Integration code is notoriously difficult to test. The external dependency makes tests slow, flaky, and dependent on external state. But untested integration code is a liability. We use multiple testing strategies in combination.

Unit tests with mocks

Mock the HTTP client to return predetermined responses. Test your parsing logic, error handling, and business logic without network calls. Fast, deterministic, run in CI.

The limitation: mocks only return what you tell them to. They do not catch cases where the real API behaves differently than you expected.

Contract tests

Record real API responses and replay them in tests. Periodically refresh recordings to catch API changes. Tools like VCR, Betamax, or Polly automate this.

The limitation: recordings become stale. Real APIs return dynamic data (timestamps, IDs) that must be handled.

Sandbox testing

Most API providers offer sandbox or test environments. Use them for integration testing that exercises real network calls. Test both success paths and error paths (many sandboxes support triggering specific error conditions).

The limitation: sandboxes may not perfectly mirror production behaviour. Rate limits may be different. Some edge cases may not be reproducible.

Chaos testing

Intentionally introduce failures in a controlled environment. Inject network latency. Return error responses. Drop connections. Verify that fallback behaviour works as designed.

Add 5 second delay to responses. Does your timeout trigger?
Return 500 errors randomly. Does retry logic engage?
Close connections mid-response. Do you handle partial data?
Return 429 for all requests. Does backoff work?

What You Get

Integrations built with these patterns behave predictably in production. When external services fail (and they will), your system degrades gracefully rather than catastrophically.

✓

Work reliably Failure handling, retry logic, and idempotency built in from the start. Transient failures resolve automatically.
✓

Fail gracefully Circuit breakers and bulkheads prevent cascading failures. One bad integration does not bring down the system.
✓

Are debuggable Comprehensive structured logging tells you exactly what happened. Correlation IDs link requests to business operations.
✓

Handle change Abstraction layers isolate you from external API changes. Defensive parsing handles unexpected responses.
✓

Are secure Credentials managed properly, permissions minimised, webhooks verified, transport encrypted.
✓

Are observable Monitoring and alerting catch problems before users report them. Health checks verify end-to-end functionality.

Data flows between your systems reliably. You do not get woken up when an external service has a bad night. When problems do occur, you have the logs, metrics, and tools to diagnose and resolve them quickly.

Development

Systems