Memcached weak default configuration issue may be abused for DDoS attack!

It was Thursday, March 1st 2018 09:30 AM. Someone from data center called us (on-call guys in our IT operations team) and told us about massive transfer between from one of our servers to several destinations. One of the destinations was nmap!

I or better say, we, started to investigate the problem if they are right. First of all, I checked Zabbix dashboard that VM’s important graphs on Zabbix and Grafana. Important ones were looked like these:

Massive network traffic caused by Memcached.
Massive CPU load caused by Memcached.

 

There is no default trigger set on network traffic on interfaces and we didn’t set such a trigger on that item. Also, processor load trigger were not triggered because of this problem of Memcached, because the default trigger value was set on >5, but processor load of the server rarely met 4.

 

I asked my colleagues if they had transferred anything from this VM. They denied all!

When I was investigating this problem and took a look at Zabbix dashboard, I realized that Zabbix-agent is not accessible on another server and also one of our services is stopped and strange enough, no one of our monitoring guys called us because of this problem.

I saw:

  1. At that moment the problem had gone!
  2. Data center had limited our network ports to 100 Mbps, so the abuse will not affect anyone else seriously anymore.
  3. Our usual transfer rate does need gigabit port and 100 Mbps is enough.
  4. Two other of my teammates were working on this problem.
  5. No one saw other problems, and
  6. Although the first problem is not affected the business directly, but these ones would.

So, I informed my teammates and started investigating on these problems. After enough investigating, I realized that a glitch during VM backup on one of our databases locked that specific VM and caused zabbix agent unreachable on this VM and too many process on application server because of it retries to get a connection to its DB server. These are sources where I realized this problem that could affect our business directly:

A disk I/O overload then zabbix agent unreachable on DB and backup glitch on hypervisor.
We are behind the same time yesterday.
Backup glitch on DB.

 

Let’s get back to the main problem! Well, it took some time to find the source of the problem. As we were blind on this problem, we had to investigate more to address this problem. As this problem shows itself as a massive network transfer and CPU load, we watched iftop and htop to see if it happens again and which process caused this. Fortunately we caught it!

iftop shows anomaly transfer on VM’s public network port.

With ss tool that is the next generation of obsolute netstat tool, we can realize which application and process is listening to port 11211:

ss tool shows that memcached is listening to 11211 port!

 

And a google showed us that it’s a problem related to Memcached! Thanks to this post of cloudflare’s engineering blog, we understood it’s sort of vulnerability of Memcached and we have to limit its listening IP address.

Unfortunately limiting listening IP address of Memcached might caused unexpected event on application, but still, we can solve the problem easily! It probably have several solutions, but we preferred  IPTables:

iptables -A INPUT -m tcp -p tcp -d PULIC_IP_OF_THE_SERVER --dport 11211 -j DROP
iptables -A INPUT -m udp -p udp -d PULIC_IP_OF_THE_SERVER --dport 11211 -j DROP

With this solution, we limited access to Memcached to just our LAN and loopback interfaces. Phew, here it is!

The strange thing for me is that neither CVE  nor netcraft do not address this problem yet. You may say CVE just address vulnerabilities and this one is not a vulnerability, but I believe they should record this issue as an entry for Memcached.

All researches about this problem go to this post of cloudflare’s engineering blog!