Home | About | Blog | Connect with me
Published On: Jan 22 2026
Written By: Krishnan Sethuraman
Category: Infrastructure

Modern web applications are rarely simple. They are composed of multiple moving parts like load balancers, horizontally scaled web servers, optimized databases, caching layers, background workers, and asynchronous queues. When designed correctly, these components work together to absorb traffic spikes, isolate failures, and keep latency predictable.
Yet, even in well-architected systems, performance issues can creep in silently.
This article walks through a real production issue we encountered with one of our customers in ProdSure, where server load kept increasing despite all the “right” optimizations being in place. The root cause turned out to be something deceptively simple and surprisingly common: misused cron jobs.
More importantly, we’ll cover how we redesigned the execution model using RabbitMQ, workers, clustering, and proper isolation, resulting in a 40% improvement in application performance and a 60% reduction in server load.
The customer reached out to us with a familiar yet frustrating problem:
Web servers were experiencing gradually increasing load. When we looked into the server and logs everything seemed to be fine.
At first glance, the infrastructure looked solid:
And yet, server load kept climbing slowly over time, eventually affecting application responsiveness.
This is one of the hardest types of problems to diagnose, death by a thousand small cuts rather than a single catastrophic failure.
Before jumping to conclusions, we systematically ruled out the common causes:
Traffic was evenly distributed. No server was unfairly overloaded.
Slow query logs were clean. Execution plans were stable. No sudden growth in query volume.
Cache hit ratio was healthy. No mass cache invalidations.
Workers were consuming messages as expected. No backlog accumulation.
Cron jobs were already split across servers, so in theory no single server should have been overloaded.
Yet, glances kept pointing out an increased amount of load on the servers.
At this point, we shifted focus from infrastructure configuration to execution behavior.
Using tools like glances, htop, and process-level inspection, a pattern emerged:
This was unexpected because cron jobs were supposedly lightweight.
So we did what often yields the biggest insights, we read the cron job code. In the meantime we also disabled two non critical cron jobs from one of the web servers.
The issue became obvious almost immediately.
Although cron jobs were split across servers, they were:
In short, the cron jobs were doing far more than they should.
This was A Critical Design Mistake. Cron jobs were being treated as workers, not triggers. This was a fundamental architectural flaw.
Cron has a very specific role in system design:l. They trigger processes, not execute heavy business logic.
Their responsibilities should be limited to:
They are not meant for:
When cron jobs violate this principle, they introduce several problems:
That is exactly what was happening in this case. And the disabled cron jobs confirmed the same, as the server performance improved.
What made this case especially interesting was that the customer already had:
Yet, the cron jobs had bypassed this entire architecture and were executing heavy logic directly.
So instead of introducing new tools, our approach was simple, use the existing architecture correctly.
The following solution was implemented in phases and was gradually pushed into production.
The first thing we addressed was resilience.
RabbitMQ was running on a single server, which posed a clear risk. If RabbitMQ went down, all background processing would stop and result in a catastrophic failure. The Cron-triggered tasks would fail silently or pile up
We converted the single RabbitMQ instance into a two-node cluster. RabbitMQ supports clustering out of the box so we did not have much difficulty in implementing this.
Key benefits:
This ensured that the new execution model would be production-grade, not just performant.
Next, we examined the worker layer. There was only one worker server handling all background jobs. While it was performing well, it introduced another risk: single point of failure. The worker server failure would halt all async processing. No redundancy during deployments or outages.
We cloned the existing worker server to create a second identical worker node.
This achieved two things:
Now, both RabbitMQ and the worker layer were highly available.
This was the most critical step. We guided the development team to rewrite the cron jobs entirely.
Old Model (Problematic)
Cron Job → Heavy SQL → Calculations → Updates → Completion
New Model (Correct)
Cron Job → Publish Message → Exit
Worker → Heavy SQL → Calculations → Updates
In practice, this meant:
Previously, cron jobs were running on application servers. It is always a good practice to have cron jobs in a different server to avoid issues like resource contention.
Drawbacks of running the cron jobs on the web servers:
We moved cron execution entirely to the worker servers. To prevent duplicate execution, we implemented a lock system.
When you run cron jobs on multiple servers, you must ensure:
The lock system ensured:
Once the code changes were complete, we updated the Jenkins pipeline so that the cron jobs are not configured in the web servers. So now the cron jobs were running from the worker server and they were disabled from the web servers.
The pipeline was modified to:
This ensured:
In the production environment we also decided not to use supervisor to run the jobs. Instead we used systemd. The jobs were configured as systemd services. This removed an unnecessary layer and also restarted itself everytime during server reboot.
This final step completed the separation of concerns:
A few days after deploying the changes into production, we reviewed the system metrics again using glances, and the difference was clearly visible. The web servers that were earlier constantly under pressure now had plenty of breathing room - even during peak cron windows.
The results were striking:
More importantly, the system became predictable again.
That predictability matters a lot in production because it restores confidence. Developers can now deploy without fear that a scheduled task will unexpectedly spike the servers. Operations teams can monitor worker queues and scale consumers if needed. And the business no longer has to worry about performance degradation slowly building up throughout the day.
By moving the heavy lifting away from the web servers and into the worker layer, we didn’t just reduce load—we made the platform more scalable, resilient, and easier to operate long-term.
This case reinforces several critical lessons:
1. Cron should trigger work, not do the work.
If your cron job contains heavy database operations, loops over large datasets, or expensive calculations, it will eventually become a bottleneck—especially as your data grows.
2. Heavy Logic Must Be Asynchronous
Queues and workers exist for a reason—use them.
RabbitMQ (or any message broker) is built exactly for handling these workloads reliably, with retries, scaling options, and better control over execution.
3. Isolation Improves Performance
Separating web, cron, and worker responsibilities dramatically reduces contention.
Once web servers are only handling requests and not background computations, user traffic becomes more stable and latency stops fluctuating randomly.
4. High Availability Matters Everywhere
Single points of failure often hide in “secondary” systems like queues and workers.
Clustering RabbitMQ and running multiple worker nodes ensured the system stayed reliable even if one instance went down.
5. Architecture Discipline Pays Off
Even a well-built system can fail if execution boundaries are ignored.
A queue and worker setup won’t help if the application bypasses it and pushes heavy workloads back into cron jobs.

Performance issues are not always caused by missing components. Often, they stem from misusing the components you already have.
In this case, the fix did not require new databases, new caching layers, or expensive scaling. It required correct architectural discipline.
If your servers are under load and everything looks optimized, take a closer look at your cron jobs. You may find that they are quietly doing far more damage than you expect.
Founder & CTO of Geedesk. Passionate about building software from scratch, launching SaaS products, and helping teams deliver enterprise-grade solutions.
Like what you are reading?
Discover more similar articles sent to your email
Subscribe to my newsletter