Resolve repetitive problems automatically using Zabbix !

Situation

We use Proxmox as hypervisor along with other technologies like vmware ESXi for virtualization in our company. About a year ago, we purchased a new server, an HP Proliant DL380P G8. I installed the latest version of Proxmox (it was proxmox-ve-4.2 those days) on that.

Problem

After a few month, we have several containers and virtual machines on this server, but unfortunately, problems started a few month ago. First, Zabbix complained about disk I/O overload on physical server and some of its virtual machines. Then we saw none of the VMs and containers response to requests!

All containers and VMs grayed out when we faced memory allocation problem on proxmox.

None of the containers and virtual machines were available via SSH. So, we checked their hypervisor logs (thanks to God, SSH and web panel were accessible on hypervisor) and guess what we just found on its logs?

A so weird message that I’ve never saw that before. It was repeating every 2 seconds. It was something like this:

Dec 18 05:48:00 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:58 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:56 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:54 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:53 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:51 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:49 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:47 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:45 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:43 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:41 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:39 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:37 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:35 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:33 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:31 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:29 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:27 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:25 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:23 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:21 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:19 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)

It seemed this should made that huge problem with inaccessibility of all containers and virtual machines on this physical server. After Googling this problem, we’ve realized that version of Proxmox (say that version of Debian Linux kernel) has this problem with XFS filesystem memory management.  The solution was quite easy. You just need to drop memory caches. In order to do that, you just need to invoke this command as root on Proxmox:

/bin/echo 3 > /proc/sys/vm/drop_caches

We thought this bug would be solve in the next update of Proxmox and we will not face that problem again.

After a couple weeks, we faced a new problem on this physical server running version 4.2 of Proxmox and that was using high usage of swap space. High usage of swap space on hypervisor caused lack of free swap space on containers and VMs running on this server.

Usually, Linux kernel will handle this problem too, but unfortunately, the version on Linux kernel used on Proxmox 4.2 could not solve the problem after a few hours. So, we had to solve it in Linux Sysadmin manual way! The solution wasn’t too hard and too weird. We just needed to do a swappofff and swapon:

/sbin/swapoff -a && /sbin/swapon -a

Just like XFS memory allocation problem, we’ve never thought to see this problem again, but as you probably guess, we faced not one of them, but both of these problems every couple of weeks.

Needless to say, Proxmox installer choose filesytem and amount of swap space automatically.

Permanent solution

I know, I know, the permanent solutions are to update Proxmox that probably includes update Linux kernel. Unfortunately, there is no Linux kernel update in proxmox update list. Re-installing Proxmox or install newer versions of Proxmox, may solve the problem, but as we do not want to have any downtime and we do not have enough resource to move all containers and virtual machines on another server and re-install Proxmox, we need to choose another solution.

I had prior experience with action feature of Zabbix when I was working for TENA. So, I choose to use Zabbix action to solve these problem.

I defined two new actions on Zabbix, one for a problem. Defining new action is not hard. I added two operations for every problem, one for informing sysadmin guys and one for real command execution. I usually enable recovery operations that usually inform sysadmin guys that the earlier mentioned problem has been solved successfully.

Thanks to Zabbix, we now, do not need to do any effort to solve such recurring problems manually anymore.

God bless Swift!

I chose Swift for learning as my next programming language after finishing Python 3. Now, I’m reading and practicing python 3 with Python Crash Course and even did a pull request on it for likely new review/version of that book.

I started digging Swift after thenextweb‘s article on Google’s considering Swift as a first class programming language for Android (although I am almost sure it is almost impossible). After a few month of that article, I forked an orphaned repository on github to bring Swift 4 on my favorite desktop distro, Fedora. Now, I have this swift4-on-fedora public repository on github and I’ll test building Swift 4 RPM package on Fedora 27 as soon as I installed it on my PC.

I read theverge‘s article on Swift code will run on Google’s Fuchsia OS and now, I’m more confident than ever on Apple’s Swift programming language bright future.

 

 

UPDATE: As I moved to development team as a DevOps engineer recently and our stack is based on Golang, so I stopped studying Swift and started Golang.

My first serious coding with python3!

I made some free time to update my blog. I dedicate this post to my first serious code.

The situation:

Our company has a database of users who subscribed to at last one of our services and are MTNIrancell users. In the other side, MTNIrancell has a database of their users who are subscribed to one or more of our services.

The problem:

The problem was that MTNIrancell DB and our DB were out of sync.

MTNIrancell gave us their users list as a .csv file and asked us to compare the list with our DB and finally give the diff.

 

Abraham was struggling with our DB and MTNIrance’s csv file for a couple of days. Those days, I was reading Eric Matthes‘ Python Crash Course Files and Exceptions chapter. Abe asked me to help him. I was glad because I could develop a serious operational code.

 

Finally I developed this code using python3 and intellijIdea:

filename = 'ghosts.csv'

with open('irancell.csv') as mtn:
  mtn_lines = mtn.readlines()

with open('vada.csv') as vada:
  vada_lines = vada.read()

line_number = 0

for line in vada_lines:
  line_number += 1
  print("Now reading line number: " + str(line_number) +" of vada.csv")
  if line not in mtn_lines:
    with open(filename, 'a') as file_diff:
      file_diff.write(line)


The code was working and compare MTNIrancell’s csv with our csv file and exported the users exist in our DB which are not in MTNIrancell’s DB. The MTNIrancell’s DB had about 2 M records while our DB had about 1.1 M records.

My code was working but it was too slow. It compares those csv files about 5 records per second. Aberaham and Ali didn’t have enough free time to wait for the code to finish. So they googled and found comm tool. Comm could solve a problem in a very very short time rather than my Python3 code. It’s because my python3 code is a single thread/single process code while comm could use more thread/process. Although I am not sure I can use multi thread/multi processing to compare those files in a shorter time, but who knows, maybe I refactor the code someday to support multi process processing.