Automation, Workers, and Event-Driven Workflows

A user clicks "Export Report." The request hits your controller, which queries 200,000 rows, compiles them into a spreadsheet, writes the file to storage, and returns a download link. That request takes 45 seconds. The load balancer times out at 30. The user sees a 504. They click the button again. Now two export jobs are running against the same dataset, fighting for memory.

Laravel background jobs exist because HTTP requests are the wrong place for slow work. Any non-trivial Laravel application eventually needs a queue. The question is never whether you need one. It is how you design it so jobs do not pile up, fail silently, or corrupt data when they retry.

This page covers the architectural patterns we use in production. Queue topology, driver selection, idempotent job design, failure handling with dead letter queues, monitoring through Horizon, and the specific failure modes that tutorials never mention. If you have already read the Laravel queue documentation and want to know what changes when you move from tutorial to production, this is where that conversation starts.

Five decisions that define your queue infrastructure:

1. Queue driver (Redis, SQS, or database). 2. Queue topology (named queues with priority ordering). 3. Idempotency strategy (how retried jobs avoid duplicating work). 4. Failure handling (retry logic, dead letter queues, alerting). 5. Deployment safety (how workers restart without losing in-flight jobs).

The Constraint: Why Synchronous Processing Fails

Every HTTP request in a Laravel application runs inside a process with finite memory, a configured timeout, and a user waiting on the other end. For most requests (fetching a page, saving a form, returning JSON from an API), the lifecycle is fine. The work finishes in milliseconds. The problems start when the work is slow, unpredictable, or both.

PDF generation for a complex invoice might take 3 seconds or 30, depending on how many line items it contains. Sending a batch of 500 emails through an SMTP relay takes as long as the relay takes. Importing a CSV with 50,000 rows means 50,000 database writes, each with validation, event dispatch, and potential webhook callbacks to external systems.

Running any of these inside a web request creates three categories of failure.

Timeouts

PHP's max_execution_time, Nginx's proxy_read_timeout, and any load balancer timeout all compete. The strictest one wins, and the user gets a blank error page.

Memory exhaustion

Processing 50,000 rows in a single request can exceed PHP's memory limit. The process dies. No cleanup runs. Partial data sits in the database.

Worker blocking

While one request grinds through a data import, that PHP-FPM worker is unavailable. With enough concurrent slow requests, the entire application becomes unresponsive for everyone.

The naive fix is to increase timeouts, raise memory limits, and add more workers. This delays the problem. It does not fix it. The real fix is to move slow work out of the request lifecycle entirely. Accept the request, dispatch a job to a queue, return a response immediately, and let a separate worker process handle the heavy lifting. Background jobs are not a feature. They are infrastructure.

Queue Architecture and Driver Selection: Redis vs SQS vs Database

Laravel background jobs run on whichever queue driver you configure, and the system abstracts the transport layer behind a consistent API. You dispatch jobs the same way regardless of whether they end up in Redis, Amazon SQS, a database table, or Beanstalkd. The driver choice matters for operations, not for application code.

Redis (the default for production)

Redis is the most common queue driver for Laravel applications, and for good reason. It is fast (sub-millisecond latency in typical configurations), supports blocking pops (workers do not poll), and integrates with Laravel Horizon for monitoring and metrics. We use Redis for the majority of our queue infrastructure.

The trade-off: Redis is an in-memory store. If the Redis instance restarts without persistence configured, queued jobs disappear. In production, this means either Redis with AOF persistence, Redis Sentinel for failover, or a managed service like AWS ElastiCache with Multi-AZ.

Amazon SQS

SQS is the right choice when durability matters more than latency. Messages are replicated across multiple availability zones. You do not manage the infrastructure. The trade-off is that SQS does not support blocking pops (workers poll), message ordering is best-effort (FIFO queues exist but add complexity), and Horizon does not support SQS. We use SQS for jobs where message loss is unacceptable, such as webhook-triggered jobs from payment providers: you cannot afford to lose a Stripe webhook because Redis restarted.

Database queues

The database driver stores jobs in a jobs table. No additional infrastructure is needed. For applications with low job volume (fewer than 1,000 jobs per day), this is a reasonable starting point. It fails at scale because every job dispatch and every job pickup is a database write, competing with your application's transactional queries for connection pool capacity. On PostgreSQL, we have seen connection pool contention begin at around 5,000 jobs per day, though the exact threshold depends on your application's transactional load and pool size.

Driver comparison at a glance

The three drivers trade durability, latency, and operational effort against each other. No single column wins outright. The right choice depends on whether you can tolerate losing a queued job, whether you want Horizon's dashboard, and how much infrastructure you are willing to run.

Factor	Redis	SQS	Database
Durability	Volatile without AOF persistence	Replicated across availability zones	As durable as your database
Latency	Sub-millisecond latency, blocking pops	Polling-based, higher latency	Database write per dispatch and pickup
Horizon support	Full dashboard and auto-scaling	Not supported	Not supported
Infrastructure	Requires Redis server	Managed by AWS	No additional services
Scale ceiling	High (memory-bound)	Effectively unlimited	Connection pool contention above ~5k jobs/day

Queue topology

A single default queue is where most applications start. It is also where most applications should stop pretending they have thought about queue design. In production, we typically configure three to five named queues with explicit priorities.

Queue	Purpose	Example Jobs
critical	User-facing, time-sensitive	Password reset emails, payment confirmations
webhooks	External system callbacks	Stripe webhooks, CRM sync events
default	Standard processing	Notification dispatch, cache warming
bulk	High-volume batch work	CSV imports, report generation
scheduled	Time-triggered jobs	Daily digest emails, data cleanup

Workers are then assigned to queues with priority ordering. A password reset email will never wait behind a 50,000-row CSV import. The critical queue drains first. Always.

Job Dispatching Patterns

Laravel provides four dispatch mechanisms, each suited to a different situation: synchronous, standard async, delayed, and the relational patterns of chaining and batching. The choice affects execution timing, failure isolation, and how jobs relate to each other.

Synchronous dispatch. dispatch_sync() runs the job inline, in the current process. Use this for testing and for local development. Running jobs synchronously in feature tests catches integration bugs that mocking hides: database state, event dispatch ordering, and exception handling. The speed trade-off is real, but the coverage gain is worth it for critical job classes. Never use synchronous dispatch in production for slow work.

Standard async dispatch. The job serialises onto the configured queue and returns immediately. A worker picks it up when capacity is available. This is the default and the right choice for most jobs.

Delayed dispatch. The job sits on the queue but is not available for processing until the delay expires. Useful for scheduled follow-ups, rate-limited API calls, and retry windows. Be aware that delayed jobs on Redis use sorted sets, which consume memory proportional to the number of delayed jobs.

Job chaining

Each job in a chain runs only if the previous one succeeded. If the first step fails, the validation and notification jobs never execute. This is the correct pattern for multi-step workflows where later steps depend on earlier results, applying a form of command-query separation where the dispatch side (commands) is decoupled from the processing side. Compare this with workflow engines, which handle more complex branching and conditional logic.

Job batching

Batching runs jobs concurrently with aggregate callbacks. Use it for parallelisable work: processing rows in a CSV, generating thumbnails for uploaded images, or sending notifications to a list of recipients. The batch tracks progress, so you can show the user a percentage complete via a real-time dashboard.

Job middleware

Middleware wraps job execution with cross-cutting concerns. The two we use most frequently:

Rate limiting

RateLimited::class prevents jobs from exceeding external API limits. If your Stripe account allows 100 requests per second, rate-limiting middleware ensures your payment sync jobs respect that cap.

Preventing overlaps

WithoutOverlapping::class ensures only one instance of a job runs for a given key at a time. Critical for jobs that modify the same resource, such as recalculating an account balance.

Designing Idempotent Jobs

A job is idempotent if running it twice with the same input produces the same result as running it once. This property is not optional. It is a requirement for any job that might be retried. Queue systems provide at-least-once delivery, not exactly-once semantics, so every job must be safe to re-execute.

Retries happen constantly in production. A worker crashes mid-job. Redis fails over. A deployment restarts all workers. The queue system re-dispatches the job because job acknowledgement never reached the broker. If that job already wrote half its data to the database, the retry writes it again. Without idempotency, you get duplicate records, double-charged customers, or emails sent twice.

The rule: Jobs run "at least once", not "exactly once". Every job that performs a side effect must be designed so that running it multiple times produces the same outcome as running it once.

Unique job identifiers

Assign each job a UUID at dispatch time. Before processing, check whether a job with that UUID has already completed. Store completed UUIDs in a cache or database table with a TTL.

Database transactions with constraints

Wrap job work in a transaction and rely on unique constraints to prevent duplicates. If the job creates an invoice, a unique constraint on [order_id, invoice_type] ensures the retry fails gracefully rather than creating a second invoice.

Upserts over inserts

Use updateOrCreate() instead of create() when the job writes records that might already exist. The second execution updates the existing record rather than failing or duplicating.

Separating side effects

Move non-idempotent side effects (sending emails, calling external APIs) to the end of the job, after the idempotent database work. If the job fails before reaching the side effect, no email is sent. If it fails after, the retry skips the database work and re-sends the email, which is typically acceptable.

Idempotency is not a library you install. It is a design discipline applied to every job class. We review job idempotency during code review the same way we review database migrations: as infrastructure that must be correct. The deduplication logs and completion records that idempotent jobs produce also feed into audit trails. That gives you a verifiable history of what work ran, when, and whether each execution was a first run or a safe retry.

Failed Job Handling and Dead Letter Queues

Jobs fail. Connections drop. External APIs return 500s. A CSV contains a row with a malformed date that crashes the parser. The question is not whether jobs will fail but what happens when they do.

Retry strategies

Laravel's default retry behaviour is configurable per job. You can set the number of attempts and a backoff schedule (for example, 10 seconds, then 60, then 300). The backoff prevents a job from hammering an external service that is already struggling.

For transient failures (network timeouts, rate limits), retries with backoff usually resolve the issue. For permanent failures (invalid data, missing dependencies), retries waste resources. The job needs to distinguish between the two.

Failure Type	Example	Response
Transient	Network timeout, rate limit, temporary API outage	Retry with exponential backoff
Permanent	Invalid email address, deleted record, bad data	Fail immediately, log for review
Resource	Out of memory, disk full, connection pool exhausted	Release job back to queue, alert operations
Dependency	External service down, API returning 500s	Delay retry, activate circuit breaker

Dead letter queues

When a job exhausts all retries, Laravel moves it to the failed_jobs table. This is Laravel's built-in failed job store, distinct from a true dead letter queue (such as an SQS DLQ), though it serves a similar purpose. These jobs need attention. They represent work the system could not complete. The dead letter queue pattern, borrowed from enterprise integration patterns, treats failed jobs as messages that require human or automated intervention.

In our systems, we implement three layers of handling.

Automated triage

A scheduled job scans failed_jobs hourly. Jobs that failed due to known transient issues (a third-party API outage that has since resolved) are automatically retried.

Alerting

When the failed job count exceeds a threshold (we typically set this at 10 failures per hour), an alert fires to Slack or PagerDuty. This catches systemic failures: a bad deployment, a database connection leak, or an external dependency that is down.

Manual review

Jobs that cannot be automatically retried are reviewed by a developer. The failed_jobs table stores the serialised job payload and the exception trace, providing everything needed to diagnose and replay.

The worst failure mode is silent failure. A job fails, nobody notices, and the customer never receives their invoice. Dead letter queue monitoring prevents this.

Monitoring, Horizon, and Operational Visibility

A queue without monitoring is a queue where problems go undetected until a customer reports them.

Laravel Horizon

Laravel Horizon provides a dashboard and configuration layer for Redis-based queues. It shows real-time metrics for every queue and worker. You see jobs per minute, job runtime with percentile breakdowns, failed jobs with full exception traces, wait time before a worker picks up a job, and worker status across all processes.

Horizon also manages worker processes through its supervisor configuration, automatically scaling workers up or down based on queue depth. This is more reliable than manually managing queue:work processes with Supervisor.

What to monitor and alert on

Beyond Horizon's dashboard, we set up alerts for specific conditions that require human intervention. These thresholds integrate with the broader security and operations monitoring stack, where queue metrics sit alongside uptime checks, slow query alerts, and application performance data.

Metric	Warning Threshold	Critical Threshold
Queue depth	500 pending jobs	2,000 pending jobs
Wait time	30 seconds	120 seconds
Failure rate	1% of processed	5% of processed
Worker count	Below expected	Zero workers running
Memory per worker	100 MB	200 MB

Process management

Queue workers are long-running PHP processes. They do not restart between jobs. This makes them susceptible to memory leaks, stale database connections, and accumulated state. In production, we configure workers to restart regularly: process up to 1,000 jobs or run for one hour, whichever comes first, then exit cleanly. Supervisor or systemd restarts the worker immediately. This limits the impact of memory leaks and ensures workers pick up code changes after deployments.

Production Failure Modes

Tutorials show you how to dispatch a job and process it. They do not show you what happens when things go wrong at scale. These are the failure modes that surface repeatedly in production Laravel applications.

Memory leaks in long-running workers

PHP was designed for request-response cycles where memory is freed after each request. Queue workers break this assumption. Common sources: Eloquent model events that accumulate listeners, logging handlers that buffer output, and image processing libraries that do not release resources. The fix: limit worker lifetime with --max-jobs and --max-time.

Job timeouts and the --timeout flag

A job that hangs blocks the worker indefinitely. Set the worker timeout higher than the individual job's $timeout property. The job-level timeout raises a MaxAttemptsExceededException. The worker-level timeout kills the process. If they are equal, the worker dies before the exception handler can run.

Race conditions between concurrent jobs

Two workers pick up two jobs that both modify the same account balance. Without locking, one writes an outdated value. Database-level locking (lockForUpdate()) prevents this but adds contention. For high-throughput scenarios, use atomic cache operations or redesign the job to append events rather than mutate state directly.

Tenant isolation in multi-tenant queues

In multi-tenant Laravel applications, jobs must execute within the correct tenant context. Capture the tenant identifier at dispatch time, restore it at the start of handle(). Without this, a queue worker processing jobs from multiple tenants will retain the context from the previous job.

The timeout triad

Three timeout values interact in Laravel queue processing, and misconfiguring any one of them creates failures that are difficult to diagnose.

Setting	Controls	Correct Configuration
`$timeout` (job property)	Maximum seconds a single job may run before a `MaxAttemptsExceededException` is thrown	Set per job class based on expected execution time
`--timeout` (worker flag)	Maximum seconds before the worker process is killed	Must be higher than the longest job `$timeout` on that queue
`retry_after` (queue config)	Seconds before an unacknowledged job becomes available again	Must be higher than the longest possible job execution time

The critical rule: worker --timeout must exceed job $timeout. If they are equal (or the worker timeout is lower), the worker process dies before the job's exception handler can run. No failed() method fires. No cleanup happens. The job simply vanishes from the worker's perspective and is eventually retried by the queue, potentially causing duplicate execution.

The second critical rule: retry_after must exceed maximum job execution time. If a job takes 90 seconds but retry_after is set to 60, the queue makes the job available again while the first worker is still processing it. Two workers now process the same job concurrently. Without idempotency, this causes data corruption.

Webhook-triggered job patterns

External systems (Stripe, Xero, CRM platforms) send webhooks to your application. Each webhook should dispatch a job rather than processing inline. This ensures the webhook endpoint returns a 200 quickly and isolates the processing from the HTTP request.

The challenge with webhook jobs is idempotency. Stripe sends the same webhook event multiple times as a reliability measure. Your job must handle receiving the same event three times without creating three payment records. Store the webhook event ID and check it before processing, following the same idempotency patterns described above. This pattern connects to our broader approach to API integrations, where incoming data from external systems flows through queued jobs with validation, deduplication, and error handling at each stage.

Symptom-to-cause diagnostic reference

When something goes wrong with queue processing, the symptom rarely points directly at the cause. This reference maps common production symptoms to their root causes and fixes.

Symptom	Likely Cause	Fix
Worker crashes after N jobs	Memory leak (Eloquent listeners, image libraries, logging buffers)	Set `--max-jobs` and `--max-time`
Job runs twice	Missing idempotency key, or `retry_after` shorter than execution time	Add UUID deduplication or unique constraint; increase `retry_after`
Queue depth growing steadily	Producers outpacing consumers	Scale workers, add rate-limited dispatch, enable Horizon auto-scaling
Jobs fail after deployment	Payload incompatibility (changed class signatures or properties)	Drain queue before deploying, or version job classes
Wrong tenant data in job output	Tenant context leakage between jobs	Capture tenant ID at dispatch, restore in `handle()` via job middleware
Worker dies without logging	Worker `--timeout` equal to or lower than job `$timeout`	Set worker `--timeout` higher than the longest job timeout on that queue

Deployment Safety and Queue Worker Restarts

Deploying new code to an application with active queue workers introduces a failure mode that tutorials never cover: payload compatibility between code versions.

When a job is dispatched, Laravel serialises the job class and its properties. When a worker picks the job up, it deserialises that payload and calls handle(). If the code changed between dispatch and processing (because a deployment happened in between), the deserialisation can fail. Renamed classes throw ClassNotFoundException. Changed constructor signatures cause property mismatches. Removed or renamed properties produce silent null values that cascade into application errors.

The risk window is small but real: any job sitting on the queue at the moment of deployment was serialised by the old code and will be processed by the new code.

Graceful worker shutdown

Laravel's queue:restart command signals all workers to finish their current job and then exit. Supervisor or systemd restarts them with the new code. This prevents a worker from being killed mid-job. Run queue:restart as part of every deployment script, after the new code is live.

Queue draining for breaking changes

If a deployment changes a job class's constructor signature, namespace, or serialised properties, drain the affected queues before deploying. Stop dispatching new jobs, wait for existing jobs to process, then deploy. For high-volume queues, this means timing deployments during low-traffic windows.

Backward-compatible job changes

The safest approach is to make job changes backward-compatible. Add new constructor parameters with defaults. Keep old property names as aliases during a transition period. This is the same discipline applied to database migrations: never remove a column that running code still references.

For applications where job loss is unacceptable (payment processing, order fulfilment), we version job classes explicitly. The old class stays in the codebase until all pending jobs have processed, and new dispatches use the updated class. This adds complexity but eliminates the deployment risk window entirely.

Backpressure and Queue Depth Management

Most queue documentation covers dispatching and processing but ignores what happens when producers outpace consumers. Queue depth grows. Memory consumption rises. Eventually the system degrades, and the degradation pattern depends on your driver.

With Redis, delayed jobs use sorted sets. Each delayed job consumes memory proportional to its serialised payload. A burst of 100,000 delayed jobs with large payloads can push Redis memory usage past its configured limit, triggering eviction policies that silently drop queued jobs. With the database driver, a growing jobs table increases query time for both dispatch and pickup operations, and the table's indexes bloat. With SQS, the default in-flight message limit (120,000 messages per standard queue) acts as a constraint, though this quota can be increased via AWS support.

Backpressure strategies prevent these failure modes.

Rate-limited dispatch

Use rate-limiting middleware on the dispatch side, not just the processing side. If a bulk import generates 50,000 jobs, dispatch them in batches of 1,000 with delays between batches rather than flooding the queue.

Dynamic worker scaling

Horizon's auto-scaling adjusts worker count based on queue depth. Configure a minimum worker count for baseline throughput and a maximum for burst capacity. Without Horizon (SQS or database drivers), use external scaling based on queue depth metrics.

Queue depth alerting

Monitor queue depth as a leading indicator. A steadily growing queue means consumers cannot keep pace. Alert before the queue reaches a level where memory, disk, or message limits become a problem.

When Background Jobs Change How a Business Operates

The technical patterns above are infrastructure. The business impact of well-designed Laravel background jobs is what makes them worth the engineering investment.

✓

Reports that generate themselves Users click "Generate" and receive an email with the finished report. No waiting, no timeouts, no 504 errors.
✓

Data imports with progress tracking Batch jobs with progress bars replace cron jobs that blocked other scheduled tasks and left operations guessing.
✓

Payment webhooks that never go missing Dedicated queues with SQS durability, dead letter monitoring, and automatic retry. Revenue stops leaking through infrastructure gaps.
✓

Responsive applications under load Heavy work happens in the background. Users never wait for email servers or report generation. Pages respond instantly.

These are patterns we have implemented across order management systems, financial operations platforms, and service delivery tools, deployed via Laravel Forge or Laravel Vapor depending on the infrastructure requirements. Background job processing connects to workflow engines for complex multi-step processes, to real-time dashboards for operational visibility, and to infrastructure decisions about how workers are deployed and scaled. Automatic retry only stays safe when the work behind it is idempotent, so a job that runs twice never double-charges a card or ships an order twice.

Build Reliable Queue Infrastructure

If your Laravel application is running slow work inside HTTP requests, or if your background jobs work in development but cause problems in production, we are happy to talk it through.

Discuss your queue architecture →

Development

Systems