Post-Mortem of a Cloudflare incident that resulted in site down time

For the last year or so, we’ve used the Business Plan at Cloudflare. This is my review of that experience. BUT, in the end, you, the user, will be the one who tells us how the experience is.

Our server infrastructure is simplistic, a physical server all to ourselves. Hosted at Softlayer/IBM. Then we have redundancies. I made sure that end of the equation was good. I’ve refactored it several times and can spin up a new server instance in no time (though it’s a bit stressful if done on-demand).

A traditional CDN serves up your site’s static content, like images, from edge nodes around the world. This makes those objects faster. Cloudflare is a different beast. It routes *all* traffic through its servers, in the process decrypting it, parsing it, modifying it, re-encrypting it …

When I discovered Cloudflare, I was amazed. I thought I had found the solution to lots of problems. They delivered more than 90% of our bandwidth, cached at the edge nodes. That’s not bad! They ensured our server was compatible with the latest standards and protocols, some experimental. They made sure our content was optimal and secure.

But the price you pay is heavy. Over the last year, I’ve paid them $2400, for those of you who need a quick calculator. This bought me no special treatment, believe me. The speed penalty is what got me in the end. I am sure a free plan would have served almost as well, to be honest, or the $20 plan for sure. I thought, “You get what you pay for” though and I always *must* keep our infrastructure stable. I mean, it’s a darn critical day if the server is down! So, I invest in that portion of the business, as opposed to being cheap. In this case, I should have went cheap.

A traditional CDN serves up your site’s static content, like images, from edge nodes around the world. This makes those objects faster. Cloudflare is a different beast. It routes *all* traffic through its servers, in the process decrypting it, parsing it, modifying it, re-encrypting it, then passing it along. That works great if you have mostly cached content, but for sites like ours with full HTTPS, it is problematic because you have to be careful about what you cache. In other cases you might just not cache any encrypted content to be safe, but I designed the site in such a way that I had static encrypted content, then dynamic encrypted content not-cacheable.

Anyway, this improved load speed on static, cacheable content allowed me to accept a slight delay in page speed load times, as it sped up downloads and reduced server load and bandwidth dramatically. This was fine, if a bit wasteful (given the lack of advantages of the $200/mo plan in hindsight – I had hoped for better service and traffic prioritization), until one day

Until one day…

Ironically, it happened right after I made a remark about a critical parsing bug they had last month on an ‘answer’ the CEO had written. Now, I’m not saying it is related. No way what I have to say is that important to anyone. BUT, none-the-less, the next day I found server performance was crippled through Cloudflare. Page load times had doubled. Everything was timing out. It was a disaster.

A network engineer at Cloudflare quickly *validated* the issue (from the UK there was a 30 second fresh and full page load time direct, 60 second through them – both high, but point made) and traced it to an ‘upstream provider’. Now, what that means is the Internet backbone. Had I then went to Softlayer/IBM, they’d tell me it’s a ‘downstream provider’ issue. Not my server, not the routers in my data center,  according to him, it was part of the Internet backbone. If that is the case, I have absolutely no leverage over them and they have no accountability to me. Cloudflare sure would have some leverage, but they could not, or would not, lobby on my behalf, ASSUMING that was even true, as we saw it worldwide.

Here are the test results from the edge node in the UK, while verifying what I was seeing in the USA and in China. This is archived in Gmail email during the Support Ticket when I contacted them.

Overall Comparison  CloudFlare  Origin     Difference  Percentage
------------------  ----------  ---------  ----------  ----------
Total Requests      111         111        0           0.00%
onLoad Time         62.09s      30.04s     -32.05s     -106.68%
Total Size          2262.90KB   2337.80KB  74.89KB     3.20%
Total Time          718.61s     62.02s     -656.59s    -1058.62%

CF Page Weight  #    Size       %
--------------  ---  ---------  -------
Cache HIT       68   1477.76KB  65.30%
Cache EXPIRED   14   43.46KB    1.92%
NOT Caching     8    33.85KB    1.50%
External        21   707.84KB   31.28%
Total           111  2262.90KB  100.00%

TTFB     CloudFlare  Origin  Difference  Percentage
-------  ----------  ------  ----------  ----------
Minimum  6           7       1           14.29%
Maximum  35077       689     -34388      -4991.00%
Average  6456.0      54.0    -6402.0     -11855.56%

Direct to Origin Tests

Direct to Origin Tests

Through Cloudflare

Through Cloudflare 2x delay

The damage done was to my time. I had wasted a day I could have been coding, lost revenue, and almost (had the timing been a bit worse) had even bigger problems.

Cloudflare''s Impact on Google Bots

Cloudflare”s Impact on Google Bots

Another graph now available, damage done to our click-through rate at Google:

Impact on Google Search

Impact on Google Search

Who knows the truth. Maybe some other Cloudflare site was under DDoS attack or they filled one of their internal pipes to capacity a little too much (they brag about filling ‘pipes’ [fiber] to capacity) – though it was equally inaccessible worldwide as best we could tell – had simultaneous tests in UK, USA, and China.

The damage done was to my time. I had wasted a day I could have been coding, lost revenue, and almost (had the timing been a bit worse) had even bigger problems.

In any event, I found disabling Cloudflare and switching back to a simple image-based CDN improved site performance. If/when we are under attack, I’ll enable a similar service, but until then – there’s just no reason to. All those optimizations they sell you on just do not make up for the performance penalties, AND the critical issue of privacy, as others have argued in ideological opposition to Cloudflare because they do intermediate decryption of the data without the user ever knowing, creating many possible points of interception of data throughout the world. Some web sites don’t even encrypt on the ‘other side’ (Cloudflare to origin server), as it’s cheaper. They call this ‘Flexible’ HTTPS, and it’s definitely very problematic, which is why we never used it.

I found disabling Cloudflare and switching back to a simple image-based CDN improved site performance. If/when we are under attack, I’ll enable a similar service, like Sucuri, who I’ve also used and liked. But until then – there’s just no reason to. All those optimizations they sell you on just do not make up for the performance penalties, AND the critical issue of privacy …

Naturally, in downgrading (to the $20 a month plan), I got no credit for time remaining and got charged again <sigh>. I’ve now downgraded to the Free Plan and asked for a refund because we’ve all validated Cloudflare will not work with my site. Update: They did refund the last month’s payment.

And to be both honest and non-biased, here are the results of the additional Server Load as it took upon Cloudflare’s responsibilities. It is unfortunate that Cloudflare was untenable for us. Any issue allowed to persist for more than a day is unacceptable for any paid web service provider! I mean, this is my family’s livelihood!

This is a dual CPU (SMP) server, hence the two graphs. They are not duplicates, if you look closely, lol. Due to distribution of computing resources, you end up with both processors pretty close in stats. NOTE THIS IS NOT ACCURATE as the period before we disabled Cloudflare had almost no traffic getting through. A longer range picture shows less of a difference, but still a remarkable savings in server resources. That’s why we were drawn there, their potential is so tempting, so sweet. It is almost like a Honeypot.

Server Load after Disabling Cloudflare

Server Load after Disabling Cloudflare

So, we say goodbye, at least for now. But what I need to know, how is the site experience for you? Yes, I know it will never compete with generation 1 of the site, but times change guys. We are comparing it to last week, not 2 years ago. How is working for YOU? Please comment below.

%d bloggers like this: