Memcached weak default configuration issue may be abused for DDoS attack!

It was Thursday, March 1st 2018 09:30 AM. Someone from data center called us (on-call guys in our IT operations team) and told us about massive transfer between from one of our servers to several destinations. One of the destinations was nmap!

I or better say, we, started to investigate the problem if they are right. First of all, I checked Zabbix dashboard that VM’s important graphs on Zabbix and Grafana. Important ones were looked like these:

Massive network traffic caused by Memcached.
Massive CPU load caused by Memcached.

 

There is no default trigger set on network traffic on interfaces and we didn’t set such a trigger on that item. Also, processor load trigger were not triggered because of this problem of Memcached, because the default trigger value was set on >5, but processor load of the server rarely met 4.

 

I asked my colleagues if they had transferred anything from this VM. They denied all!

When I was investigating this problem and took a look at Zabbix dashboard, I realized that Zabbix-agent is not accessible on another server and also one of our services is stopped and strange enough, no one of our monitoring guys called us because of this problem.

I saw:

  1. At that moment the problem had gone!
  2. Data center had limited our network ports to 100 Mbps, so the abuse will not affect anyone else seriously anymore.
  3. Our usual transfer rate does need gigabit port and 100 Mbps is enough.
  4. Two other of my teammates were working on this problem.
  5. No one saw other problems, and
  6. Although the first problem is not affected the business directly, but these ones would.

So, I informed my teammates and started investigating on these problems. After enough investigating, I realized that a glitch during VM backup on one of our databases locked that specific VM and caused zabbix agent unreachable on this VM and too many process on application server because of it retries to get a connection to its DB server. These are sources where I realized this problem that could affect our business directly:

A disk I/O overload then zabbix agent unreachable on DB and backup glitch on hypervisor.
We are behind the same time yesterday.
Backup glitch on DB.

 

Let’s get back to the main problem! Well, it took some time to find the source of the problem. As we were blind on this problem, we had to investigate more to address this problem. As this problem shows itself as a massive network transfer and CPU load, we watched iftop and htop to see if it happens again and which process caused this. Fortunately we caught it!

iftop shows anomaly transfer on VM’s public network port.

With ss tool that is the next generation of obsolute netstat tool, we can realize which application and process is listening to port 11211:

ss tool shows that memcached is listening to 11211 port!

 

And a google showed us that it’s a problem related to Memcached! Thanks to this post of cloudflare’s engineering blog, we understood it’s sort of vulnerability of Memcached and we have to limit its listening IP address.

Unfortunately limiting listening IP address of Memcached might caused unexpected event on application, but still, we can solve the problem easily! It probably have several solutions, but we preferred  IPTables:

iptables -A INPUT -m tcp -p tcp -d PULIC_IP_OF_THE_SERVER --dport 11211 -j DROP
iptables -A INPUT -m udp -p udp -d PULIC_IP_OF_THE_SERVER --dport 11211 -j DROP

With this solution, we limited access to Memcached to just our LAN and loopback interfaces. Phew, here it is!

The strange thing for me is that neither CVE  nor netcraft do not address this problem yet. You may say CVE just address vulnerabilities and this one is not a vulnerability, but I believe they should record this issue as an entry for Memcached.

All researches about this problem go to this post of cloudflare’s engineering blog!

Resolve repetitive problems automatically using Zabbix !

Situation

We use Proxmox as hypervisor along with other technologies like vmware ESXi for virtualization in our company. About a year ago, we purchased a new server, an HP Proliant DL380P G8. I installed the latest version of Proxmox (it was proxmox-ve-4.2 those days) on that.

Problem

After a few month, we have several containers and virtual machines on this server, but unfortunately, problems started a few month ago. First, Zabbix complained about disk I/O overload on physical server and some of its virtual machines. Then we saw none of the VMs and containers response to requests!

All containers and VMs grayed out when we faced memory allocation problem on proxmox.

None of the containers and virtual machines were available via SSH. So, we checked their hypervisor logs (thanks to God, SSH and web panel were accessible on hypervisor) and guess what we just found on its logs?

A so weird message that I’ve never saw that before. It was repeating every 2 seconds. It was something like this:

Dec 18 05:48:00 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:58 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:56 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:54 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:53 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:51 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:49 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:47 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:45 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:43 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:41 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:39 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:37 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:35 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:33 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:31 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:29 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:27 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:25 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:23 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:21 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:19 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)

It seemed this should made that huge problem with inaccessibility of all containers and virtual machines on this physical server. After Googling this problem, we’ve realized that version of Proxmox (say that version of Debian Linux kernel) has this problem with XFS filesystem memory management.  The solution was quite easy. You just need to drop memory caches. In order to do that, you just need to invoke this command as root on Proxmox:

/bin/echo 3 > /proc/sys/vm/drop_caches

We thought this bug would be solve in the next update of Proxmox and we will not face that problem again.

After a couple weeks, we faced a new problem on this physical server running version 4.2 of Proxmox and that was using high usage of swap space. High usage of swap space on hypervisor caused lack of free swap space on containers and VMs running on this server.

Usually, Linux kernel will handle this problem too, but unfortunately, the version on Linux kernel used on Proxmox 4.2 could not solve the problem after a few hours. So, we had to solve it in Linux Sysadmin manual way! The solution wasn’t too hard and too weird. We just needed to do a swappofff and swapon:

/sbin/swapoff -a && /sbin/swapon -a

Just like XFS memory allocation problem, we’ve never thought to see this problem again, but as you probably guess, we faced not one of them, but both of these problems every couple of weeks.

Needless to say, Proxmox installer choose filesytem and amount of swap space automatically.

Permanent solution

I know, I know, the permanent solutions are to update Proxmox that probably includes update Linux kernel. Unfortunately, there is no Linux kernel update in proxmox update list. Re-installing Proxmox or install newer versions of Proxmox, may solve the problem, but as we do not want to have any downtime and we do not have enough resource to move all containers and virtual machines on another server and re-install Proxmox, we need to choose another solution.

I had prior experience with action feature of Zabbix when I was working for TENA. So, I choose to use Zabbix action to solve these problem.

I defined two new actions on Zabbix, one for a problem. Defining new action is not hard. I added two operations for every problem, one for informing sysadmin guys and one for real command execution. I usually enable recovery operations that usually inform sysadmin guys that the earlier mentioned problem has been solved successfully.

Thanks to Zabbix, we now, do not need to do any effort to solve such recurring problems manually anymore.