Disk I/O is overloaded on a PostgreSQL server

It’s a couple weeks that Zabbix shows occasionally the message “Disk I/O is overloaded on db” on a PostgreSQL database server.  It showed the message for from a range of a few minutes to a few hours and then it vanished.

The other day, I checked all that VM’s graphs on Zabbix and then logged in to the VM to check its health status manually. Everything seemed OK. I checked PostgreSQL logs and it was OK too at the first sight.

As this problem didn’t have any impact on performance or business revenue and I had more important things to do, I didn’t spend a lot of time on that. Today, I made enough time to investigate it more.

As, this VM is dedicated to a Postgresql server and it does not server anything else and all items of operating system seemed OK, I went directly to PostgreSQL service! First of all, I realized, logging slow queries are not enable on this PostgreSQL server. So, I un-commented line containing log_min_duration_statement and gave it a value of 30,000. Note that that number is in milliseconds, so, PostgreSQL will log any query with a run time of more than 30 seconds.

Thanks to postgres, it didn’t need any restart and a reload was enough to sense recently enabled configuration. So I executed systemctl reload postgresql and tailed postgresql logs with tail -f /var/log/postgresql/postgresql-9.6-main.log command. I spent a few minutes there, but there was nothing to indicate what is wrong with it. I poured a cup of coffee, did a flow task and returned again to see if postgres logged anything related to the problem. I saw this:

2018-01-17 08:26:11.337 +0330 [19994] USER@DB_NAME ERROR:  integer out of range
2018-01-17 08:26:11.337 +0330 [19994] USER@DB_NAME STATEMENT:  INSERT INTO user_comments (delivery_id,services,timestamp) VALUES (2376584071,'{255}',NOW())

 

I figured out the problem. Once a customer tries to leave a comment, we will face the problem. We don’t need change type of delivery_id this time since we do not use it anymore in that way. I asked Abraham to completely remove the code related to user comments.

Hint: This database is a massive database with the size of ~660GB. A few months ago we have a serious problem on this DB server. We couldn’t insert any row on main table of this DB and after investigation, we found out that id column of that table was defined as integer and we met its limitations of about 2.1 billion. This DB’s tables are quite old and its original architect didn’t predict such a day!

 

We changed type of id column from integers to big integers with a lot of hassle and a few hours of unwanted down time.

P.S.

Name Storage Size Description Range
smallint 2 bytes small-range integer -32768 to +32767
integer 4 bytes typical choice for integer -2147483648 to +2147483647
bigint 8 bytes large-range integer -9223372036854775808 to +9223372036854775807
decimal variable user-specified precision, exact up to 131072 digits before the decimal point; up to 16383 digits after the decimal point
numeric variable user-specified precision, exact up to 131072 digits before the decimal point; up to 16383 digits after the decimal point
real 4 bytes variable-precision, inexact 6 decimal digits precision
double precision 8 bytes variable-precision, inexact 15 decimal digits precision
smallserial 2 bytes small autoincrementing integer 1 to 32767
serial 4 bytes autoincrementing integer 1 to 2147483647
bigserial 8 bytes large autoincrementing integer 1 to 9223372036854775807

Above table shows range of numeric types in PostgreSQL 9.6 borrowed from here.

problems after restarting a mission critical physical server

This big problem that caused 28 minutes downtime, started when Zabbix informed us that free disk space is less than 20% on one of disk arrays on a physical server last day morning. I logged into hypervisor web interface to see which massive machines are filled up the disk. But Proxmox-ve panel showed the node itself offline and all VMs and containers on it were grayed out! It happened but, all machines on it were serving requests successfully.

I googled and tried to restart some pve services like pvestatd but it didn’t work!

Also, I couldn’t execute some simple commands like df on physical server. I decided to run that command manually because Zabbix had told us this physical server has less than 20% free space in one of its storage. The server couldn’t run that command after a long time. I was suspicious of not only proxmox-ve, but the Linux kernel itself. So, I executed journalctl -f, but there was not anything useful there. Executing tail -f /var/log/pveam.log had the same result. I was thinking of giving a reboot on this server because of its almost 2 years uptime. But, it was risky and we could not have down time.

This screenshot shows a Proxmox-ve with 684 days uptime!

Invoking /etc/init.d/pve-manager restart caused a stop on all virtual machines and containers on this server. I spend a few minutes to recover and bring all of them to operational status again, but none of them worked. I had to reboot the server! We had 2 mission critical database, 1 mission critical application server and a couple of non mission critical machines on this server.

Unfortunately, issuing reboot command couldn’t restart the server. So, thanks to HPE’s iLO technology, I restarted it using server’s iLO web interface.

The bigger problem raised when all of containers and virtual machines got back to operational mode. My colleagues couldn’t login to the application web panel. Abraham checked application server’s nginx logs and told me he gave database connection error when he tries to open panel link. First, I guessed it could be a database access/permission problem. I tried to ssh to database server but I couldn’t! I could ping that server though. So, I tried to get access to that machine using lxc-attach –name 105 from physical server. lucky enough, I could get access to the container and find out postgresql and ssh services have not started. I started them and linked those startup script from /etc/init.d/ssh and /etc/init.d/postgresql to /etc/rc2.d/S99ssh and /etc/rc2.d/S99postgresql. Since it was an old-stable version of Debian, it still uses SystemV and does not support systemd init system.

Unfortunately, starting postgresql RDBMS server did not solve the problem. I tailed postgresql logs and didn’t see any error/information log related to connection or permission problem coming from application server.

We have pgbouncer installed on application server (that one which rebooted recently and couldn’t connect to database server which rebooted recently too). So, I sshed to the application server and found out pgbouncer is not running. I started pgbouncer service, but, when I got status, it told me FAILED!

I executed ss -tlnp command  and found out pgbouncer is not really running. I checked the pgbouncer log but latest log was before restarting the server. Starting or restarting pgbouncer didn’t record any log.

Running pgbouncer manually was the first thing I thought about. pgbouncer /etc/pgbouncer/pgbouncer.ini -v -u postgres successfully ran and our colleagues could open application web panel. So, pgbouncer’s configuration file located in /etc/pgbouncer/pgbouncer.ini was OK and I had to look for something else. The next thing I thought was that the problem could be related to pgbouncer’s init script. But, how could I make sure if that init script is OK or not?

It was easy! I had to download an original version of pgbouncer.deb package and compare its init script with the init script we had on our server. I found the original .deb package of pgbouncer here, downloaded it and extracted it somewhere.

Unfortunately, when I compared them, I found them quite similar! I had to execute plan C. Plan C was to look at pgbouncer’s init script deeply and find out what is happening when we execute /etc/init.d/pgbouncer start. I saw that the script checks a file before it starts. I opened that file using vim and got what is the problem! Here is that file and you will find out what ailed me for almost 8 minutes!

 

Resolve repetitive problems automatically using Zabbix !

Situation

We use Proxmox as hypervisor along with other technologies like vmware ESXi for virtualization in our company. About a year ago, we purchased a new server, an HP Proliant DL380P G8. I installed the latest version of Proxmox (it was proxmox-ve-4.2 those days) on that.

Problem

After a few month, we have several containers and virtual machines on this server, but unfortunately, problems started a few month ago. First, Zabbix complained about disk I/O overload on physical server and some of its virtual machines. Then we saw none of the VMs and containers response to requests!

All containers and VMs grayed out when we faced memory allocation problem on proxmox.

None of the containers and virtual machines were available via SSH. So, we checked their hypervisor logs (thanks to God, SSH and web panel were accessible on hypervisor) and guess what we just found on its logs?

A so weird message that I’ve never saw that before. It was repeating every 2 seconds. It was something like this:

Dec 18 05:48:00 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:58 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:56 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:54 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:53 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:51 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:49 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:47 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:45 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:43 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:41 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:39 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:37 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:35 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:33 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:31 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:29 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:27 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:25 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:23 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:21 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:19 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)

It seemed this should made that huge problem with inaccessibility of all containers and virtual machines on this physical server. After Googling this problem, we’ve realized that version of Proxmox (say that version of Debian Linux kernel) has this problem with XFS filesystem memory management.  The solution was quite easy. You just need to drop memory caches. In order to do that, you just need to invoke this command as root on Proxmox:

/bin/echo 3 > /proc/sys/vm/drop_caches

We thought this bug would be solve in the next update of Proxmox and we will not face that problem again.

After a couple weeks, we faced a new problem on this physical server running version 4.2 of Proxmox and that was using high usage of swap space. High usage of swap space on hypervisor caused lack of free swap space on containers and VMs running on this server.

Usually, Linux kernel will handle this problem too, but unfortunately, the version on Linux kernel used on Proxmox 4.2 could not solve the problem after a few hours. So, we had to solve it in Linux Sysadmin manual way! The solution wasn’t too hard and too weird. We just needed to do a swappofff and swapon:

/sbin/swapoff -a && /sbin/swapon -a

Just like XFS memory allocation problem, we’ve never thought to see this problem again, but as you probably guess, we faced not one of them, but both of these problems every couple of weeks.

Needless to say, Proxmox installer choose filesytem and amount of swap space automatically.

Permanent solution

I know, I know, the permanent solutions are to update Proxmox that probably includes update Linux kernel. Unfortunately, there is no Linux kernel update in proxmox update list. Re-installing Proxmox or install newer versions of Proxmox, may solve the problem, but as we do not want to have any downtime and we do not have enough resource to move all containers and virtual machines on another server and re-install Proxmox, we need to choose another solution.

I had prior experience with action feature of Zabbix when I was working for TENA. So, I choose to use Zabbix action to solve these problem.

I defined two new actions on Zabbix, one for a problem. Defining new action is not hard. I added two operations for every problem, one for informing sysadmin guys and one for real command execution. I usually enable recovery operations that usually inform sysadmin guys that the earlier mentioned problem has been solved successfully.

Thanks to Zabbix, we now, do not need to do any effort to solve such recurring problems manually anymore.