
What Cloudflare’s Second Oopsie in Three Weeks Taught Us About Your Fragile Stack
It wasn’t state-sponsored cyberattacks. It was a Lua exception in a killswitch that had never been tested. Welcome to the terrifyingly mundane reality of modern internet infrastructure.
Let’s be honest. When your phone lit up with alerts at 08:47 am on December 5th, you weren’t surprised. You were probably just exhausted. Another titan of internet infrastructure had face-planted. Twenty-five minutes of HTTP 500 errors affecting 28% of Cloudflare’s traffic. And here’s the kicker: this was Cloudflare’s second major outage in three weeks, following their November 18th incident.
For those keeping score at home: AWS, Azure, and now Cloudflare (twice) have all stumbled spectacularly this quarter. The unholy trinity of hyperscale failures is complete, with Cloudflare earning bonus points for an encore performance.
For us in the trenches, it’s a grimly satisfying ‘I told you so’ moment. But for CTOs and Managing Directors staring at dashboards full of angry red lines, it’s another terrifying reminder that ‘the cloud’ is just someone else’s computer. And sometimes, that computer crashes because someone disabled an internal testing tool they thought wasn’t required.
At Digital Craftsmen, we’ve been banging the drum on true digital resilience for 24 years. This isn’t scaremongering—it’s architectural reality. If you’re still banking on a single provider’s SLA to save your skin, December 5th was your final wake-up call.
The Anatomy of a Mundane Disaster (Again)
Contrary to breathless speculation on tech Twitter, Cloudflare wasn’t taken down by sophisticated cyberattacks or malicious activity of any kind. The culprit was far more pedestrian, and frankly, more terrifying for anyone who manages complex systems.
Here’s what actually happened:
The Chain of Events
Cloudflare was rolling out changes to their Web Application Firewall to protect customers against CVE-2025-55182, a critical vulnerability in React Server Components disclosed that week. As part of this security improvement, they increased their HTTP request body buffer size from 128KB to 1MB.
During the gradual rollout, an internal testing tool started throwing errors. Since it was ‘just’ an internal tool and the security fix was important, they decided to disable it via their global configuration system temporarily. That’s when things went sideways.
The Technical Reality
When the killswitch was applied to a rule with an action of ‘execute’ — something they’d never done before—the code correctly skipped evaluation, but then hit this error:
if rule_result.action == “execute” then
rule_result.execute.results = ruleset_results[tonumber(rule_result.execute.results_index)]
end
The code expected that if a ruleset has action=”execute”, the rule_result.execute object would exist. But because the rule had been skipped, that object was nil, causing Lua to throw an exception.
The result? Every customer running Cloudflare’s older FL1 proxy with the Managed Ruleset deployed started serving HTTP 500 errors for everything.
The irony? This straightforward error had existed undetected for many years and would have been prevented by languages with strong type systems—their new FL2 proxy written in Rust didn’t have the bug.
The Uncomfortable Truths for Your Stack
So what does this mean for you, the battle-weary tech leader who’s seen it all?
1. The ‘Too Big to Fail’ Vendor Just Failed. Again.
The assumption that infrastructure giants have immaculate processes and infallible systems is demonstrably false. They’re run by humans who push code that breaks in edge cases nobody tested.
Cloudflare admitted: ‘We have never before applied a killswitch to a rule with an action of ‘execute’. Translation: they had an untested code path in production for years, and it detonated spectacularly when finally triggered.
If your own change management processes haven’t tripped over similar landmines, you’re either incredibly lucky or haven’t pushed hard enough yet.
2. A Single Point of Failure Remains a Ticking Time Bomb
The change propagated to Cloudflare’s entire network within seconds using their global configuration system—the same system under review after their November 18th outage.
You wouldn’t run your entire data centre on a single power supply. Why are you running your entire public-facing infrastructure through a single digital pipe? When that pipe gets clogged, you have zero recourse.
3. Resilience Isn’t a Product. It’s a Strategy.
You can’t buy your way out of this with another SaaS dashboard. Cloudflare acknowledged they’re working on enhanced rollouts, versioning, fail-open error handling, and streamlined break-glass capabilities—but admitted ‘we have not finished deploying them yet’.
Even the giants are still figuring this out. Which means you absolutely need to architect resilience into your own systems, because no vendor can do it for you.
The Grown-Up Approach to Infrastructure
The lesson isn’t to abandon Cloudflare or any other provider. These outages weren’t caused by malicious activity—they were triggered by changes being made to improve security. They build incredible technology that powers the modern web.
The lesson is to stop treating them as infallible gods and start treating them as what they are: critical but fallible vendors in your supply chain.
Your Digital Resilience Checklist
This is where we get practical. Digital Craftsmen’s Digital Resilience Checklist breaks this down into immediate actions and long-term strategy. Here’s what actually matters:
Immediate: Critical Dependency Review
Right now, today, map every single point of failure:
- Identify all critical services – Applications, APIs, payment gateways, CDNs, communication platforms, and middleware. Everything.
- Map third-party dependencies – Which vendors could take you offline if they go down?
- Confirm primary cloud regions – Where does each critical component actually live?
- Assess single-region risk – What happens when (not if) that region fails?
If you can’t answer these questions in under 10 minutes, your dependency mapping is inadequate. Full stop.
Immediate: Backup & Recovery Verification
When did you last actually test restoring from backup? Not ‘confirmed backups exist’. Actually restored them and verified functionality.
The 3-2-1-1-0 Rule isn’t optional anymore:
- 3 copies of data (production + 2 backups)
- 2 different types of media
- 1 copy off-site
- 1 immutable/air-gapped copy (ransomware protection)
- 0 errors (regular verification)
Define your tolerances:
- RPO (Recovery Point Objective): How much data loss can you actually survive?
- RTO (Recovery Time Objective): How long can you actually be offline?
Be honest. Not aspirational. Honest.
Long-Term: Architect for High Availability
Goal: Minimise downtime from localised failures.
Key strategies:
- Multi-Availability Zone deployment across geographically separate data centres
- Elastic load balancing and auto-scaling to distribute traffic and resources automatically
- Static stability – systems that keep operating even when dependencies fail
This isn’t exotic enterprise architecture anymore. This is baseline professional practice.
Long-Term: Implement Robust Disaster Recovery
Choose your DR strategy based on actual business requirements:
Pilot Light: Minimal core infrastructure running in DR region (lower cost, slower recovery)
Warm Standby: Scaled-down but fully functional stack always running (faster recovery, higher cost)
Critical elements:
- Documented DR plan with clear roles and technical steps
- Regular testing – annually minimum, quarterly for critical systems
- Automation via Infrastructure as Code – manual DR is fantasy DR
Always: Modern Backup as Last Line of Defence
Cloud services like Amazon S3 are ideal for implementing the 3-2-1-1-0 rule properly. The immutable/air-gapped copy is your ransomware insurance policy.
Test restoration regularly. Untested backups are Schrödinger’s backups—simultaneously working and not working until you actually need them.
The Partnership Imperative: Why Going It Alone Is the Real Risk
Building this level of resilience is hard. It requires:
- Multi-cloud expertise across AWS, Azure, GCP, and bespoke platforms
- 24/7 monitoring and response capabilities
- Enterprise-grade operational discipline that doesn’t sleep
- Continuous testing and improvement of disaster scenarios
Your internal teams are already stretched thin, designing, building and shipping new features and keeping production running. Most don’t have the bandwidth to become full-time resilience architects across multiple cloud platforms.
Digital Craftsmen: 24 Years of Battle-Tested Expertise
We’ve seen every type of failure the internet can throw at a business. We know how to design systems that stay standing when others fall, not because we’re smarter, but because we’ve made different architectural choices based on decades of accumulated scar tissue.
What sets us apart:
Cloud-Agnostic Expertise: We’re not tied to any single vendor. We design across AWS, Azure, GCP, and bespoke platforms based on what your business actually needs, not what earns us referral fees.
24/7 Proactive Monitoring: Dedicated incident response teams ensure immediate action when things go wrong, minimising RTO and business impact.
Enterprise-Grade Resilience, SMB Pricing: Sophisticated HA/DR strategies shouldn’t require enterprise budgets. We make them accessible.
White-Label Support: We seamlessly extend your existing team and enhance your offerings to clients without stepping on your brand.
Sustainable Infrastructure: Green hosting backed by our commitment to carbon neutrality. Increasingly required for enterprise procurement.
Free Up Your Talent: Your internal teams focus on innovation and growth, not infrastructure firefighting at 3 am.
The Uncomfortable Questions
When was the last time you:
- Actually tested your disaster recovery plan with a realistic simulation?
- Verified you can restore from backup within your stated RTO?
- Mapped all critical dependencies and identified single points of failure?
- Evaluated whether your current architecture could survive your primary provider going down for 25 minutes? Four hours? A day?
If the answers make you uncomfortable, you’re not alone. Most organisations discover their resilience gaps the hard way—during an actual outage.
The Cloudflare Reality Check
Cloudflare publicly acknowledged: ‘These kinds of incidents, and how closely they are clustered together, are not acceptable for a network like ours’.
They’re right. It’s not acceptable.
But here’s the thing: They’re locking down all changes to their network until they have better mitigation and rollback systems. Even Cloudflare, with their resources and talent, is still building the resilience infrastructure they need.
If Cloudflare is still working on this, what makes you think your five-person ops team has it covered?
Stop Firefighting. Start Future-Proofing.
The next outage is coming. The only question is whether your business will be one of the casualties or one of the survivors.
Don’t wait for your own December 5th moment to expose the cracks in your foundation. Download our Digital Resilience Checklist and start the honest conversation about where your vulnerabilities are.
Better yet, let’s talk about building Fort Knox-level resilience around your critical applications. Not someday. Now.
Contact Digital Craftsmen: Call us on 020 3745 7706 or email [email protected] Or find us on LinkedIn – www.linkedin.com/company/digital-craftsmen
Because in 2025, hoping your provider doesn’t break twice in three weeks isn’t a strategy. It’s a liability.
Glossary for the Acronym-Weary:
RTO (Recovery Time Objective): Maximum acceptable downtime after disaster
RPO (Recovery Point Objective): Maximum acceptable data loss after a disaster
HA (High Availability): Systems designed to operate continuously without interruption
DR (Disaster Recovery): Process of restoring operations after major incident
AZ (Availability Zone): Isolated locations within cloud region for fault tolerance
IaC (Infrastructure as Code): Managing infrastructure through machine-readable definition files

