Automation, Workers, and Event-Driven Workflows
A user clicks "Export Report." The request hits your controller, which queries 200,000 rows, compiles them into a spreadsheet, writes the file to storage, and returns a download link. That request takes 45 seconds. The load balancer times out at 30. The user sees a 504. They click the button again. Now two export jobs are running against the same dataset, fighting for memory.
Background jobs exist because HTTP requests are the wrong place for slow work. Any non-trivial Laravel application eventually needs a queue. The question is never whether you need background job processing. It is how you design it so jobs do not pile up, fail silently, or corrupt data when they retry.
This page covers the architectural patterns we use in production: queue topology, driver selection, idempotent job design, failure handling with dead letter queues, monitoring through Horizon, and the specific failure modes that tutorials never mention. If you have already read the Laravel queue documentation and want to know what changes when you move from tutorial to production, this is where that conversation starts.
Five decisions that define your queue infrastructure:
1. Queue driver (Redis, SQS, or database). 2. Queue topology (named queues with priority ordering). 3. Idempotency strategy (how retried jobs avoid duplicating work). 4. Failure handling (retry logic, dead letter queues, alerting). 5. Deployment safety (how workers restart without losing in-flight jobs).
The Constraint: Why Synchronous Processing Fails
Every HTTP request in a Laravel application runs inside a process with finite memory, a configured timeout, and a user waiting on the other end. For most requests (fetching a page, saving a form, returning JSON from an API), the lifecycle is fine. The work finishes in milliseconds. The problems start when the work is slow, unpredictable, or both.
PDF generation for a complex invoice might take 3 seconds or 30, depending on how many line items it contains. Sending a batch of 500 emails through an SMTP relay takes as long as the relay takes. Importing a CSV with 50,000 rows means 50,000 database writes, each with validation, event dispatch, and potential webhook callbacks to external systems.
Running any of these inside a web request creates three categories of failure.
Timeouts
PHP's max_execution_time, Nginx's proxy_read_timeout, and any load balancer timeout all compete. The strictest one wins, and the user gets a blank error page.
Memory exhaustion
Processing 50,000 rows in a single request can exceed PHP's memory limit. The process dies. No cleanup runs. Partial data sits in the database.
Worker blocking
While one request grinds through a data import, that PHP-FPM worker is unavailable. With enough concurrent slow requests, the entire application becomes unresponsive for everyone.
The naive fix is to increase timeouts, raise memory limits, and add more workers. This delays the problem. It does not fix it. The fix is to move slow work out of the request lifecycle entirely: accept the request, dispatch a job to a queue, return a response immediately, and let a separate worker process handle the heavy lifting. Background jobs are not a feature. They are infrastructure.
Queue Architecture and Driver Selection
Laravel's queue system abstracts the transport layer behind a consistent API. You dispatch jobs the same way regardless of whether they end up in Redis, Amazon SQS, a database table, or Beanstalkd. The driver choice matters for operations, not for application code.
Redis (the default for production)
Redis is the most common queue driver for Laravel applications, and for good reason. It is fast (sub-millisecond latency in typical configurations), supports blocking pops (workers do not poll), and integrates with Laravel Horizon for monitoring and metrics. We use Redis for the majority of our queue infrastructure.
The trade-off: Redis is an in-memory store. If the Redis instance restarts without persistence configured, queued jobs disappear. In production, this means either Redis with AOF persistence, Redis Sentinel for failover, or a managed service like AWS ElastiCache with Multi-AZ.
Amazon SQS
SQS is the right choice when durability matters more than latency. Messages are replicated across multiple availability zones. You do not manage the infrastructure. The trade-off is that SQS does not support blocking pops (workers poll), message ordering is best-effort (FIFO queues exist but add complexity), and Horizon does not support SQS. We use SQS for jobs where message loss is unacceptable, such as webhook-triggered jobs from payment providers: you cannot afford to lose a Stripe webhook because Redis restarted.
Database queues
The database driver stores jobs in a jobs table. No additional infrastructure is needed. For applications with low job volume (fewer than 1,000 jobs per day), this is a reasonable starting point. It fails at scale because every job dispatch and every job pickup is a database write, competing with your application's transactional queries for connection pool capacity. On PostgreSQL, we have seen connection pool contention begin at around 5,000 jobs per day, though the exact threshold depends on your application's transactional load and pool size.
Driver comparison at a glance
| Factor | Redis | SQS | Database |
|---|---|---|---|
| Durability | Volatile without AOF persistence | Replicated across availability zones | As durable as your database |
| Latency | Sub-millisecond latency, blocking pops | Polling-based, higher latency | Database write per dispatch and pickup |
| Horizon support | Full dashboard and auto-scaling | Not supported | Not supported |
| Infrastructure | Requires Redis server | Managed by AWS | No additional services |
| Scale ceiling | High (memory-bound) | Effectively unlimited | Connection pool contention above ~5k jobs/day |
Queue topology
A single default queue is where most applications start. It is also where most applications should stop pretending they have thought about queue design. In production, we typically configure three to five named queues with explicit priorities.
| Queue | Purpose | Example Jobs |
|---|---|---|
| critical | User-facing, time-sensitive | Password reset emails, payment confirmations |
| webhooks | External system callbacks | Stripe webhooks, CRM sync events |
| default | Standard processing | Notification dispatch, cache warming |
| bulk | High-volume batch work | CSV imports, report generation |
| scheduled | Time-triggered jobs | Daily digest emails, data cleanup |
Workers are then assigned to queues with priority ordering. A password reset email will never wait behind a 50,000-row CSV import. The critical queue drains first. Always.
Job Dispatching Patterns
Laravel provides several dispatch mechanisms, each suited to different situations. The choice affects execution timing, failure isolation, and how jobs relate to each other.
Synchronous dispatch. dispatch_sync() runs the job inline, in the current process. Use this for testing and for local development. Running jobs synchronously in feature tests catches integration bugs that mocking hides: database state, event dispatch ordering, and exception handling. The speed trade-off is real, but the coverage gain is worth it for critical job classes. Never use synchronous dispatch in production for slow work.
Standard async dispatch. The job serialises onto the configured queue and returns immediately. A worker picks it up when capacity is available. This is the default and the right choice for most jobs.
Delayed dispatch. The job sits on the queue but is not available for processing until the delay expires. Useful for scheduled follow-ups, rate-limited API calls, and retry windows. Be aware that delayed jobs on Redis use sorted sets, which consume memory proportional to the number of delayed jobs.
Job chaining
Each job in a chain runs only if the previous one succeeded. If the first step fails, the validation and notification jobs never execute. This is the correct pattern for multi-step workflows where later steps depend on earlier results, applying a form of command-query separation where the dispatch side (commands) is decoupled from the processing side. Compare this with workflow engines, which handle more complex branching and conditional logic.
Job batching
Batching runs jobs concurrently with aggregate callbacks. Use it for parallelisable work: processing rows in a CSV, generating thumbnails for uploaded images, or sending notifications to a list of recipients. The batch tracks progress, so you can show the user a percentage complete via a real-time dashboard.
Job middleware
Middleware wraps job execution with cross-cutting concerns. The two we use most frequently:
Rate limiting
RateLimited::class prevents jobs from exceeding external API limits. If your Stripe account allows 100 requests per second, rate-limiting middleware ensures your payment sync jobs respect that cap.
Preventing overlaps
WithoutOverlapping::class ensures only one instance of a job runs for a given key at a time. Critical for jobs that modify the same resource, such as recalculating an account balance.
Designing Idempotent Jobs
A job is idempotent if running it twice with the same input produces the same result as running it once. This property is not optional. It is a requirement for any job that might be retried. Queue systems provide at-least-once delivery, not exactly-once semantics, so every job must be safe to re-execute.
Retries happen constantly in production. A worker crashes mid-job. Redis fails over. A deployment restarts all workers. The queue system re-dispatches the job because job acknowledgement never reached the broker. If that job already wrote half its data to the database, the retry writes it again. Without idempotency, you get duplicate records, double-charged customers, or emails sent twice.
The rule: Jobs run "at least once", not "exactly once". Every job that performs a side effect must be designed so that running it multiple times produces the same outcome as running it once.
Unique job identifiers
Assign each job a UUID at dispatch time. Before processing, check whether a job with that UUID has already completed. Store completed UUIDs in a cache or database table with a TTL.
Database transactions with constraints
Wrap job work in a transaction and rely on unique constraints to prevent duplicates. If the job creates an invoice, a unique constraint on [order_id, invoice_type] ensures the retry fails gracefully rather than creating a second invoice.
Upserts over inserts
Use updateOrCreate() instead of create() when the job writes records that might already exist. The second execution updates the existing record rather than failing or duplicating.
Separating side effects
Move non-idempotent side effects (sending emails, calling external APIs) to the end of the job, after the idempotent database work. If the job fails before reaching the side effect, no email is sent. If it fails after, the retry skips the database work and re-sends the email, which is typically acceptable.
Idempotency is not a library you install. It is a design discipline applied to every job class. We review job idempotency during code review the same way we review database migrations: as infrastructure that must be correct. The deduplication logs and completion records that idempotent jobs produce also feed into audit trails, providing a verifiable history of what work was performed, when, and whether it was a first execution or a safe retry.
Failed Job Handling and Dead Letter Queues
Jobs fail. Connections drop. External APIs return 500s. A CSV contains a row with a malformed date that crashes the parser. The question is not whether jobs will fail but what happens when they do.
Retry strategies
Laravel's default retry behaviour is configurable per job. You can set the number of attempts and a backoff schedule (for example, 10 seconds, then 60, then 300). The backoff prevents a job from hammering an external service that is already struggling.
For transient failures (network timeouts, rate limits), retries with backoff usually resolve the issue. For permanent failures (invalid data, missing dependencies), retries waste resources. The job needs to distinguish between the two.
| Failure Type | Example | Response |
|---|---|---|
| Transient | Network timeout, rate limit, temporary API outage | Retry with exponential backoff |
| Permanent | Invalid email address, deleted record, bad data | Fail immediately, log for review |
| Resource | Out of memory, disk full, connection pool exhausted | Release job back to queue, alert operations |
| Dependency | External service down, API returning 500s | Delay retry, activate circuit breaker |
Dead letter queues
When a job exhausts all retries, Laravel moves it to the failed_jobs table. This is Laravel's built-in failed job store, distinct from a true dead letter queue (such as an SQS DLQ), though it serves a similar purpose. These jobs need attention. They represent work the system could not complete. The dead letter queue pattern, borrowed from enterprise integration patterns, treats failed jobs as messages that require human or automated intervention.
In our systems, we implement three layers of handling.
Automated triage
A scheduled job scans failed_jobs hourly. Jobs that failed due to known transient issues (a third-party API outage that has since resolved) are automatically retried.
Alerting
When the failed job count exceeds a threshold (we typically set this at 10 failures per hour), an alert fires to Slack or PagerDuty. This catches systemic failures: a bad deployment, a database connection leak, or an external dependency that is down.
Manual review
Jobs that cannot be automatically retried are reviewed by a developer. The failed_jobs table stores the serialised job payload and the exception trace, providing everything needed to diagnose and replay.
The worst failure mode is silent failure. A job fails, nobody notices, and the customer never receives their invoice. Dead letter queue monitoring prevents this.
Monitoring, Horizon, and Operational Visibility
A queue without monitoring is a queue where problems go undetected until a customer reports them.
Laravel Horizon
Laravel Horizon provides a dashboard and configuration layer for Redis-based queues. It shows real-time metrics for every queue and worker: jobs per minute, job runtime with percentile breakdowns, failed jobs with full exception traces, wait time before a worker picks up a job, and worker status across all processes.
Horizon also manages worker processes through its supervisor configuration, automatically scaling workers up or down based on queue depth. This is more reliable than manually managing queue:work processes with Supervisor.
What to monitor and alert on
Beyond Horizon's dashboard, we set up alerts for specific conditions that require human intervention. These thresholds integrate with the broader security and operations monitoring stack, where queue metrics sit alongside uptime checks, slow query alerts, and application performance data.
| Metric | Warning Threshold | Critical Threshold |
|---|---|---|
| Queue depth | 500 pending jobs | 2,000 pending jobs |
| Wait time | 30 seconds | 120 seconds |
| Failure rate | 1% of processed | 5% of processed |
| Worker count | Below expected | Zero workers running |
| Memory per worker | 100 MB | 200 MB |
Process management
Queue workers are long-running PHP processes. They do not restart between jobs. This makes them susceptible to memory leaks, stale database connections, and accumulated state. In production, we configure workers to restart regularly: process up to 1,000 jobs or run for one hour, whichever comes first, then exit cleanly. Supervisor or systemd restarts the worker immediately. This limits the impact of memory leaks and ensures workers pick up code changes after deployments.
Production Failure Modes
Tutorials show you how to dispatch a job and process it. They do not show you what happens when things go wrong at scale. These are the failure modes that surface repeatedly in production Laravel applications.
Memory leaks in long-running workers
PHP was designed for request-response cycles where memory is freed after each request. Queue workers break this assumption. Common sources: Eloquent model events that accumulate listeners, logging handlers that buffer output, and image processing libraries that do not release resources. The fix: limit worker lifetime with --max-jobs and --max-time.
Job timeouts and the --timeout flag
A job that hangs blocks the worker indefinitely. Set the worker timeout higher than the individual job's $timeout property. The job-level timeout raises a MaxAttemptsExceededException. The worker-level timeout kills the process. If they are equal, the worker dies before the exception handler can run.
Race conditions between concurrent jobs
Two workers pick up two jobs that both modify the same account balance. Without locking, one writes an outdated value. Database-level locking (lockForUpdate()) prevents this but adds contention. For high-throughput scenarios, use atomic cache operations or redesign the job to append events rather than mutate state directly.
Tenant isolation in multi-tenant queues
In multi-tenant Laravel applications, jobs must execute within the correct tenant context. Capture the tenant identifier at dispatch time, restore it at the start of handle(). Without this, a queue worker processing jobs from multiple tenants will retain the context from the previous job.
The timeout triad
Three timeout values interact in Laravel queue processing, and misconfiguring any one of them creates failures that are difficult to diagnose.
| Setting | Controls | Correct Configuration |
|---|---|---|
$timeout (job property) |
Maximum seconds a single job may run before a MaxAttemptsExceededException is thrown |
Set per job class based on expected execution time |
--timeout (worker flag) |
Maximum seconds before the worker process is killed | Must be higher than the longest job $timeout on that queue |
retry_after (queue config) |
Seconds before an unacknowledged job becomes available again | Must be higher than the longest possible job execution time |
The critical rule: worker --timeout must exceed job $timeout. If they are equal (or the worker timeout is lower), the worker process dies before the job's exception handler can run. No failed() method fires. No cleanup happens. The job simply vanishes from the worker's perspective and is eventually retried by the queue, potentially causing duplicate execution.
The second critical rule: retry_after must exceed maximum job execution time. If a job takes 90 seconds but retry_after is set to 60, the queue makes the job available again while the first worker is still processing it. Two workers now process the same job concurrently. Without idempotency, this causes data corruption.
Webhook-triggered job patterns
External systems (Stripe, Xero, CRM platforms) send webhooks to your application. Each webhook should dispatch a job rather than processing inline. This ensures the webhook endpoint returns a 200 quickly and isolates the processing from the HTTP request.
The challenge with webhook jobs is idempotency. Stripe sends the same webhook event multiple times as a reliability measure. Your job must handle receiving the same event three times without creating three payment records. Store the webhook event ID and check it before processing, following the same idempotency patterns described above. This pattern connects to our broader approach to API integrations, where incoming data from external systems flows through queued jobs with validation, deduplication, and error handling at each stage.
Symptom-to-cause diagnostic reference
When something goes wrong with queue processing, the symptom rarely points directly at the cause. This reference maps common production symptoms to their root causes and fixes.
| Symptom | Likely Cause | Fix |
|---|---|---|
| Worker crashes after N jobs | Memory leak (Eloquent listeners, image libraries, logging buffers) | Set --max-jobs and --max-time |
| Job runs twice | Missing idempotency key, or retry_after shorter than execution time |
Add UUID deduplication or unique constraint; increase retry_after |
| Queue depth growing steadily | Producers outpacing consumers | Scale workers, add rate-limited dispatch, enable Horizon auto-scaling |
| Jobs fail after deployment | Payload incompatibility (changed class signatures or properties) | Drain queue before deploying, or version job classes |
| Wrong tenant data in job output | Tenant context leakage between jobs | Capture tenant ID at dispatch, restore in handle() via job middleware |
| Worker dies without logging | Worker --timeout equal to or lower than job $timeout |
Set worker --timeout higher than the longest job timeout on that queue |
Deployment Safety and Queue Worker Restarts
Deploying new code to an application with active queue workers introduces a failure mode that tutorials never cover: payload compatibility between code versions.
When a job is dispatched, Laravel serialises the job class and its properties. When a worker picks the job up, it deserialises that payload and calls handle(). If the code changed between dispatch and processing (because a deployment happened in between), the deserialisation can fail. Renamed classes throw ClassNotFoundException. Changed constructor signatures cause property mismatches. Removed or renamed properties produce silent null values that cascade into application errors.
The risk window is small but real: any job sitting on the queue at the moment of deployment was serialised by the old code and will be processed by the new code.
Graceful worker shutdown
Laravel's queue:restart command signals all workers to finish their current job and then exit. Supervisor or systemd restarts them with the new code. This prevents a worker from being killed mid-job. Run queue:restart as part of every deployment script, after the new code is live.
Queue draining for breaking changes
If a deployment changes a job class's constructor signature, namespace, or serialised properties, drain the affected queues before deploying. Stop dispatching new jobs, wait for existing jobs to process, then deploy. For high-volume queues, this means timing deployments during low-traffic windows.
Backward-compatible job changes
The safest approach is to make job changes backward-compatible. Add new constructor parameters with defaults. Keep old property names as aliases during a transition period. This is the same discipline applied to database migrations: never remove a column that running code still references.
For applications where job loss is unacceptable (payment processing, order fulfilment), we version job classes explicitly. The old class stays in the codebase until all pending jobs have processed, and new dispatches use the updated class. This adds complexity but eliminates the deployment risk window entirely.
Backpressure and Queue Depth Management
Most queue documentation covers dispatching and processing but ignores what happens when producers outpace consumers. Queue depth grows. Memory consumption rises. Eventually the system degrades, and the degradation pattern depends on your driver.
With Redis, delayed jobs use sorted sets. Each delayed job consumes memory proportional to its serialised payload. A burst of 100,000 delayed jobs with large payloads can push Redis memory usage past its configured limit, triggering eviction policies that silently drop queued jobs. With the database driver, a growing jobs table increases query time for both dispatch and pickup operations, and the table's indexes bloat. With SQS, the default in-flight message limit (120,000 messages per standard queue) acts as a constraint, though this quota can be increased via AWS support.
Backpressure strategies prevent these failure modes.
Rate-limited dispatch
Use rate-limiting middleware on the dispatch side, not just the processing side. If a bulk import generates 50,000 jobs, dispatch them in batches of 1,000 with delays between batches rather than flooding the queue.
Dynamic worker scaling
Horizon's auto-scaling adjusts worker count based on queue depth. Configure a minimum worker count for baseline throughput and a maximum for burst capacity. Without Horizon (SQS or database drivers), use external scaling based on queue depth metrics.
Queue depth alerting
Monitor queue depth as a leading indicator. A steadily growing queue means consumers cannot keep pace. Alert before the queue reaches a level where memory, disk, or message limits become a problem.
When Background Jobs Change How a Business Operates
The technical patterns above are infrastructure. The business impact of well-designed background jobs is what makes them worth the engineering investment.
-
Reports that generate themselves Users click "Generate" and receive an email with the finished report. No waiting, no timeouts, no 504 errors.
-
Data imports with progress tracking Batch jobs with progress bars replace cron jobs that blocked other scheduled tasks and left operations guessing.
-
Payment webhooks that never go missing Dedicated queues with SQS durability, dead letter monitoring, and automatic retry. Revenue stops leaking through infrastructure gaps.
-
Responsive applications under load Heavy work happens in the background. Users never wait for email servers or report generation. Pages respond instantly.
These are patterns we have implemented across order management systems, financial operations platforms, and service delivery tools, deployed via Laravel Forge or Laravel Vapor depending on the infrastructure requirements. Background job processing connects to workflow engines for complex multi-step processes, to real-time dashboards for operational visibility, and to infrastructure decisions about how workers are deployed and scaled.
Build Reliable Queue Infrastructure
If your Laravel application is running slow work inside HTTP requests, or if your background jobs work in development but cause problems in production, we are happy to talk it through.
Discuss your queue architecture →