It was Thursday, March 1st 2018 09:30 AM. Someone from data center called us (on-call guys in our IT operations team) and told us about massive transfer between from one of our servers to several destinations. One of the destinations was nmap!
I or better say, we, started to investigate the problem if they are right. First of all, I checked Zabbix dashboard that VM’s important graphs on Zabbix and Grafana. Important ones were looked like these:
There is no default trigger set on network traffic on interfaces and we didn’t set such a trigger on that item. Also, processor load trigger were not triggered because of this problem of Memcached, because the default trigger value was set on >5, but processor load of the server rarely met 4.
I asked my colleagues if they had transferred anything from this VM. They denied all!
When I was investigating this problem and took a look at Zabbix dashboard, I realized that Zabbix-agent is not accessible on another server and also one of our services is stopped and strange enough, no one of our monitoring guys called us because of this problem.
- At that moment the problem had gone!
- Data center had limited our network ports to 100 Mbps, so the abuse will not affect anyone else seriously anymore.
- Our usual transfer rate does need gigabit port and 100 Mbps is enough.
- Two other of my teammates were working on this problem.
- No one saw other problems, and
- Although the first problem is not affected the business directly, but these ones would.
So, I informed my teammates and started investigating on these problems. After enough investigating, I realized that a glitch during VM backup on one of our databases locked that specific VM and caused zabbix agent unreachable on this VM and too many process on application server because of it retries to get a connection to its DB server. These are sources where I realized this problem that could affect our business directly:
Let’s get back to the main problem! Well, it took some time to find the source of the problem. As we were blind on this problem, we had to investigate more to address this problem. As this problem shows itself as a massive network transfer and CPU load, we watched iftop and htop to see if it happens again and which process caused this. Fortunately we caught it!
And a google showed us that it’s a problem related to Memcached! Thanks to this post of cloudflare’s engineering blog, we understood it’s sort of vulnerability of Memcached and we have to limit its listening IP address.
Unfortunately limiting listening IP address of Memcached might caused unexpected event on application, but still, we can solve the problem easily! It probably have several solutions, but we preferred IPTables:
iptables -A INPUT -m tcp -p tcp -d PULIC_IP_OF_THE_SERVER --dport 11211 -j DROP iptables -A INPUT -m udp -p udp -d PULIC_IP_OF_THE_SERVER --dport 11211 -j DROP
The strange thing for me is that neither CVE nor netcraft do not address this problem yet. You may say CVE just address vulnerabilities and this one is not a vulnerability, but I believe they should record this issue as an entry for Memcached.
All researches about this problem go to this post of cloudflare’s engineering blog!