Resolve repetitive problems automatically using Zabbix !

Situation

We use Proxmox as hypervisor along with other technologies like vmware ESXi for virtualization in our company. About a year ago, we purchased a new server, an HP Proliant DL380P G8. I installed the latest version of Proxmox (it was proxmox-ve-4.2 those days) on that.

Problem

After a few month, we have several containers and virtual machines on this server, but unfortunately, problems started a few month ago. First, Zabbix complained about disk I/O overload on physical server and some of its virtual machines. Then we saw none of the VMs and containers response to requests!

All containers and VMs grayed out when we faced memory allocation problem on proxmox.

None of the containers and virtual machines were available via SSH. So, we checked their hypervisor logs (thanks to God, SSH and web panel were accessible on hypervisor) and guess what we just found on its logs?

A so weird message that I’ve never saw that before. It was repeating every 2 seconds. It was something like this:

Dec 18 05:48:00 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:58 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:56 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:54 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:53 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:51 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:49 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:47 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:45 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:43 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:41 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:39 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:37 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:35 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:33 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:31 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:29 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:27 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:25 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:23 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:21 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:19 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)

It seemed this should made that huge problem with inaccessibility of all containers and virtual machines on this physical server. After Googling this problem, we’ve realized that version of Proxmox (say that version of Debian Linux kernel) has this problem with XFS filesystem memory management.  The solution was quite easy. You just need to drop memory caches. In order to do that, you just need to invoke this command as root on Proxmox:

/bin/echo 3 > /proc/sys/vm/drop_caches

We thought this bug would be solve in the next update of Proxmox and we will not face that problem again.

After a couple weeks, we faced a new problem on this physical server running version 4.2 of Proxmox and that was using high usage of swap space. High usage of swap space on hypervisor caused lack of free swap space on containers and VMs running on this server.

Usually, Linux kernel will handle this problem too, but unfortunately, the version on Linux kernel used on Proxmox 4.2 could not solve the problem after a few hours. So, we had to solve it in Linux Sysadmin manual way! The solution wasn’t too hard and too weird. We just needed to do a swappofff and swapon:

/sbin/swapoff -a && /sbin/swapon -a

Just like XFS memory allocation problem, we’ve never thought to see this problem again, but as you probably guess, we faced not one of them, but both of these problems every couple of weeks.

Needless to say, Proxmox installer choose filesytem and amount of swap space automatically.

Permanent solution

I know, I know, the permanent solutions are to update Proxmox that probably includes update Linux kernel. Unfortunately, there is no Linux kernel update in proxmox update list. Re-installing Proxmox or install newer versions of Proxmox, may solve the problem, but as we do not want to have any downtime and we do not have enough resource to move all containers and virtual machines on another server and re-install Proxmox, we need to choose another solution.

I had prior experience with action feature of Zabbix when I was working for TENA. So, I choose to use Zabbix action to solve these problem.

I defined two new actions on Zabbix, one for a problem. Defining new action is not hard. I added two operations for every problem, one for informing sysadmin guys and one for real command execution. I usually enable recovery operations that usually inform sysadmin guys that the earlier mentioned problem has been solved successfully.

Thanks to Zabbix, we now, do not need to do any effort to solve such recurring problems manually anymore.