When the policeman becomes the criminal – how Cloudflare attacks my machines. 

On the Internet you are nobody until someone attacks you.

It gets even more interesting when the attack comes from someone with practically unlimited resources and when these are the same people that are supposed to protect you.

This article is the story of how Cloudflare started an “attack” on a machine at the FLLCasts platform. This increased the traffic of the machine about 10x and AWS started charging the account 10x more. I managed to stop them and I hope my experience is useful for all CTOs, sysadmins, devops and others that would like to understand more and look out for such cases.

TL; DR;

Current up to date status is – after all the investigation it turns out that when a client makes a HEAD request for a file this will hit Cloudflare infrastructure. Cloudflare will then send a GET request to the account machine and will cache the file. This has changed at 28 of August. Before 28 of August when clients were sending HEAD requests, Cloudflare was sending HEAD requests (that don’t generate traffic). After 28 of August clients are still sending HEAD requests, but now Cloudflare is sending GET requests, generating terabytes of additional traffic that is not needed.

Increase of the Bill

On 28 of August 2021 I got a notification from AWS that the account is close to surpassing its budget for the month. This is not surprising as it was the end of the month, but nevertheless I decided to check. It seems that the traffic to one of the machines has increased 10x in a day. Nothing else has increased. No visits, no other resources, just the traffic to this one particular machine. 
That was strange. This has been going on for 7 days now and this is the increase of the traffic.

AWS increase of the bill

Limit billing on AWS

First thought was “How can I set a global limit to AWS spending for this account? I don’t want to wake up with $50K in traffic charges the next day?”

The answer is “You can’t”. There is no way to set a global spending limit for an AWS account. This was something I already knew, but decided to check again with support and yes, you can’t set such a limit. This means that AWS is providing all the tools for you to be bankrupt by a third party and they are not willing to limit it.

Limit billing on Digital Ocean

I have some machines on Digital Ocean and I checked there. “Can I set a global spending limit for my account where I will no longer be charged and all my services will stop if my spending is above X amount of dollars?”.
The answer was again – “No. Digital ocean does not provide it”.

Should there be a global limit on spending on cloud providers?

My understanding is – yes. There is a break even point where users are coming to your service and generating revenue and you are delivering the service and this is costing you money. Once it costs you more to deliver the service than the revenue that the service is generating, I would personally prefer to stop the service. No need for it to be running. Otherwize you could wake up with a $50K bill.

AWS monitoring

I had the bill from AWS so I tried looking at the monitoring.
There is a spike every day between 03:00 AM UTC and 05:00 AM UTC. This spike is increasing the traffic with hundreds of gigabytes. It could easily be terabytes next time.
The conclusion is that the machine is heavily loaded during this time.

AWS monitoring

Nginx access.log

Looking at the access log I see that there are a lot of requests by machines that are using a user agent called ‘curl’. ‘curl’ is a popular tool for accessing files over HTTP and is heavily used by different bots. But bots tend to identify themselves.

This is how the access.log looks like:

172.68.65.227 - - [30/Aug/2021:03:26:02 +0000] "GET /f9a13214d1d16a7fb2ebc0dce9ee496e/file1.webm HTTP/1.1" 200 27755976 "-" "curl/7.58.0"

Parsing the log file

I have my years in bash experience and couple of commands later I get a list of all the IPs and how many requests we’ve received from these IPs.

grep curl access.log | cut -d ' ' -f 1 | sort | uniq -c | sort -n

The result is 547 machines. The full log file is available at – Full list of Cloudflare IPs attacking my machine. The top 20 are (there are some IPs that are not from Cloudflare). The first is the number of requests, the second is the IP of the machine.  

NumberOfRequest IP
    113 172.69.63.18
    117 172.68.65.107
    150 172.70.42.135
    158 172.70.42.161
    164 172.69.63.82
    167 172.70.42.129
    169 172.69.63.116
    170 172.68.65.231
    173 172.68.65.101
    178 172.69.63.16
    178 172.70.42.143
    188 173.245.54.236
    264 172.70.134.69
    268 172.70.134.117
    269 172.70.134.45
    287 172.70.134.153
    844 172.70.34.131
    866 172.70.34.19
    904 172.70.34.61
    912 172.70.34.69

These are Cloudflare machines!

Looking at the machines that are making the requests these are 547 different machines, most of which are Cloudflare machines. These are servers that Cloudflare seems to be running that are making the request.

How does Cloudflare work?

For this particular FLLCasts account with this particular machine I have years ago setup Cloudflare to sit in front of the machine to help  protect the account from internet attacks.

The way Cloudflare works is that only Cloudflare knows what is the IP address of our machine. This is the promise that Cloudflare is making. Because only they know the IP address of the machine, only they know what is the IP address for a given domain. In this way when a user points their browser at the “http://domainname” the internet will direct this request to Cloudflare, then Cloudflare will check if this request is ok, and then and only then forward this request to our machine. But in the meantime Cloudflare is trying to help businesses like the platform by caching the content. This means that when Cloudflare receives a request for a file, they will check on their Cloudflare infrastructure if this file was cached and send a request to the account machine only if there is no cache.

In a nutshell Cloudflare maintains a cache for the content the platform is delivering.

Image is from Cloudflare support at https://support.cloudflare.com/hc/en-us/articles/205177068-How-does-Cloudflare-work-

What is broken?

Cloudflare maintains a cache of the platform resources. Every night between 03:00 AM UTC and 05:00 AM UTC some 547 Cloudflare machines decide to update their cache and they start sending requests to our server. These are 10x more requests that the machine generally receives from all users. The content on the server does not change. It’s been the same content for years. But for the last 7 days Cloudflare is caching the same content every night on 547 machines.

And AWS bills us for this.

Can Cloudflare help?

I created a ticket. The response was along the lines of “You are not subscribed for our support, you can get only community support”. Fine.
I called them on the phone early in the morning.
I called enterprise sales and I asked them.

Me - "Hi, I am under attack. Can you help?"
They - "Yes, we can help. Who is attacking you?"
Me - "Well, you are. Is there an enterprise package I could buy so that you can protect me against your attack?"

Luckily the guy on the phone caught my sense of humor and urgency and quickly organized a meeting with a product representative. Regrettably there were no solution engineers on this call.

Both guys were very knowledgeable, but I had difficulties explaining that it was actually Cloudflare causing the traffic increase. I had all the data from AWS, from the access.log files, but the support agents still had some difficulty accepting it.

To be clear – I don’t think that Cloudflare is maliciously causing this. There is no point. What I think has happened is some misconfiguration on their side that caused this for the last 7 days.

What I think has happened?

I tried to explain to the support agents that there are three scenarios all of which Cloudflare is responsible for.

1. Option 1 – “someone that has 547 machines is trying to attack the FLLCasts account and Cloudflare is failing to stop it”. First this is very unlikely. Nobody will invest in starting 547 machines just to make the platform pay a few dollars more this month. And even if this is the case, this is what Cloudflare should actually prevent, right? Option 1: “Cloudflare is failing in preventing attacks” (unlikely)


2. Option 2 – “only Cloudflare knows the IP of this domain name and they have been compromised.”. The connection between domain name and ip address is something that only Cloudflare knows about. If a third party knows the domain name and they are able to find the IP name this means that they are compromising Cloudflare. Option 2: “Cloudflare is compromised” (possible, but again, unlikely)

3. Option 3 – “there is a misconfiguration in some of the Cloudflare servers”. I don’t like looking for malicious activity where everything could be explained with simple ignorance or a mistake. Most likely there is a misconfiguration in the Cloudflare infrastructure that is causing these servers to behave in this way. Option 3: “There is a misconfiguration in Cloudflare infrastructure”

4. Option 4 – “there is a mistake on our end”. As there basically is nothing on our end and this nothing has not changed in years, the possibility for this to be the case is minimal. 

On a support call we set a plan with the support agents to investigate it. I will change the public IP of the AWS machine and will reconfigure it on Cloudflare. In this way we hope to stop some of the requests. We have no plan for what to do after that.

Can I block in on the Nginx level?

Nginx is an HTTP server,serving files. There are a couple of options to explore there, but the most reasonable was to stop all curl requests to the Nginx server. This was the shortest path. There was no need to protect against other attacks, there was only the need to protect against Cloudflare attacks. The Cloudflare attack was using “curl” as a tool. I decided to stop ‘curl’

  # Surely not the best, but the simplest and will get the job done for now.
  if ($http_user_agent ~ 'curl') {
      return 444; # Consider returning 444. It's a custom nginx code that drop the connection without responding.
  }

Resolution

I am now waiting to see if the change of the public IP of the AWS machine will have any impact and if not I am just rejecting all “curl” requests that seem to be what Cloudflare is using.

Update 1

The first solution that we decide to implement is to

Change the public IP of the AWS machine and change it in the DNS settings at Cloudflare. In this way we would make sure that only Cloud flare really knows this IP.

Resolution is – It did not work!

I know it won’t, because it was another way for support to get me to do anything without really looking into the issue, but I went along with it. Better exhaust this options and be sure.

The traffic of a Cloudflare attacked machine. Changing the IP address of 03 of September had no effect.

Update 2

Adding CF-Connection-IP header

Cloudflare support was really helpful. They asked me to include CF-Connection-IP in the logs. In this way we would know what is the real IP that is making the requests and if these are in fact Cloudflare machines.

The header is described at https://support.cloudflare.com/hc/en-us/articles/200170986-How-does-Cloudflare-handle-HTTP-Request-headers-

I went on and updated the Nginx configuration

log_format  cloudflare_debug     '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for" "$http_cf_connecting_ip"';

access_log /var/log/nginx/access.log cloudflare_debug;

Now the log file contained the original IP.

Cloudflare is making GET when client makes a HEAD request

This is what I found out. The platform has a daily job that checks the machine and makes sure files are ok. This integrity check was left there from times when we had to do it, like years ago. It is still running and is starting every night checking the machine with HEAD requests. But Cloudflare started making GET request at 28 of August 2021 and this increases the traffic to the machine.

Steps to reproduce

Here are the steps to reproduce:

1. I am sending a HEAD request with ‘curl -I’

2. Cloudflare has not cached the file so there is “cf-cache-status: MISS”

3. Cloudflare sends a GET request and gets the whole file

4. Cloudflare responds to the HEAD request.

5. I send a HEAD request agian with ‘curl -I’

6. Cloudflare has the file cached and there is a “cf-cache-status: HIT”

7. The account server is not hit.

The problem here is that I am sending a HEAD request to my file and Cloudflare is sending a GET request for the whole file in order to cache this file

Commands to reproduce

This is a HEAD request:

$ curl -I https://domain.com/file1.webm
HTTP/2 200
date: Sat, 04 Sep 2021 07:09:11 GMT
content-type: video/webm
content-length: 2256504
last-modified: Sat, 04 Jan 2014 14:24:01 GMT
etag: "52c81981-226e78"
cache-control: max-age=14400
cf-cache-status: MISS
accept-ranges: bytes
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=Xg9TLgssa5Gm6j1fRlJZH8VahaoY21LdCE1W1JqVueu49mzdiTmh9MZp4pFZDsVeSmRg%2Bc%2FMryoN7tgmKUmdxhWzE7UZdVvgG%2FRxHSZ%2FYS6pDtxLwpXSD71jo5ADNyT4TSpKXtE%3D"}],"group":"cf-nel","max_age":604800}
nel: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
server: cloudflare
cf-ray: 689564111e594ee0-FRA
alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400, h3=":443"; ma=86400

This is the log right after the HEAD request. Not that I am sending HEAD request to domain.com and Cloudflare is sending GET request for the file.

162.158.94.236 - - [04/Sep/2021:07:09:12 +0000] "GET /file1.webm HTTP/1.1" 200 2256504 "-" "curl/7.68.0" "188.254.161.195" "188.254.161.195"

Then I send a second HEAD requests

$ curl -I https://domain.com/file1.webm
HTTP/2 200
date: Sat, 04 Sep 2021 07:09:53 GMT
content-type: video/webm
content-length: 2256504
last-modified: Sat, 04 Jan 2014 14:24:01 GMT
etag: "52c81981-226e78"
cache-control: max-age=14400
cf-cache-status: HIT
age: 42
accept-ranges: bytes
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=CKSvpGGHoj5LfV6xXpPUK5kHJtdsX3fylgt%2F2%2B6G94oUsdAd8FnHmUgEUIgnj5dd2Vvsv%2BKQxxgsHdHA0RvpjTxATakFKFuirMeI%2FS3lAdDX5VA0tY74z0CRYEHM2rS%2Fld6K738%3D"}],"group":"cf-nel","max_age":604800}
nel: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
server: cloudflare
cf-ray: 689565175dffc29f-FRA
alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400, h3=":443"; ma=86400

And then there is NOTHING in the log file

Note that for the last HEAD request there is a “cf-cache-status: HIT”.

Status and how it could be resolved?

Yes, we are doing HEAD requests every day to the files in order to check that they are all working. Every day we send a HEAD request for every file to make sure all files are up to date. This has been going on for years and is a left over of an integrity check we implemented in 2015.

What has changed on 28 of August 2021 is that when Cloudflare receives a HEAD request for a file it is sending a GET request to our machine in order to cache the file. This is what has changed and this is generating all the traffic.

We send HEAD request with ‘curl -I’

I have 30 weeks of log files that show that Cloudflare was sending HEAD requests like

I have asked Cloudflare

Could you please rollback this change in the infrastructure and do not send a GET request to our machine when you receive a HEAD request from a client?

Let’s see how will this be resolved.

Up to date conclusion

Check your machines from time to time. I hope you don’t get in this situation.

Want to keep in touch? – find me on LinkedIn or Twitter