Disk I/O is overloaded on a PostgreSQL server

It’s a couple weeks that Zabbix shows occasionally the message “Disk I/O is overloaded on db” on a PostgreSQL database server.  It showed the message for from a range of a few minutes to a few hours and then it vanished.

The other day, I checked all that VM’s graphs on Zabbix and then logged in to the VM to check its health status manually. Everything seemed OK. I checked PostgreSQL logs and it was OK too at the first sight.

As this problem didn’t have any impact on performance or business revenue and I had more important things to do, I didn’t spend a lot of time on that. Today, I made enough time to investigate it more.

As, this VM is dedicated to a Postgresql server and it does not server anything else and all items of operating system seemed OK, I went directly to PostgreSQL service! First of all, I realized, logging slow queries are not enable on this PostgreSQL server. So, I un-commented line containing log_min_duration_statement and gave it a value of 30,000. Note that that number is in milliseconds, so, PostgreSQL will log any query with a run time of more than 30 seconds.

Thanks to postgres, it didn’t need any restart and a reload was enough to sense recently enabled configuration. So I executed systemctl reload postgresql and tailed postgresql logs with tail -f /var/log/postgresql/postgresql-9.6-main.log command. I spent a few minutes there, but there was nothing to indicate what is wrong with it. I poured a cup of coffee, did a flow task and returned again to see if postgres logged anything related to the problem. I saw this:

2018-01-17 08:26:11.337 +0330 [19994] USER@DB_NAME ERROR:  integer out of range
2018-01-17 08:26:11.337 +0330 [19994] USER@DB_NAME STATEMENT:  INSERT INTO user_comments (delivery_id,services,timestamp) VALUES (2376584071,'{255}',NOW())

 

I figured out the problem. Once a customer tries to leave a comment, we will face the problem. We don’t need change type of delivery_id this time since we do not use it anymore in that way. I asked Abraham to completely remove the code related to user comments.

Hint: This database is a massive database with the size of ~660GB. A few months ago we have a serious problem on this DB server. We couldn’t insert any row on main table of this DB and after investigation, we found out that id column of that table was defined as integer and we met its limitations of about 2.1 billion. This DB’s tables are quite old and its original architect didn’t predict such a day!

 

We changed type of id column from integers to big integers with a lot of hassle and a few hours of unwanted down time.

P.S.

Name Storage Size Description Range
smallint 2 bytes small-range integer -32768 to +32767
integer 4 bytes typical choice for integer -2147483648 to +2147483647
bigint 8 bytes large-range integer -9223372036854775808 to +9223372036854775807
decimal variable user-specified precision, exact up to 131072 digits before the decimal point; up to 16383 digits after the decimal point
numeric variable user-specified precision, exact up to 131072 digits before the decimal point; up to 16383 digits after the decimal point
real 4 bytes variable-precision, inexact 6 decimal digits precision
double precision 8 bytes variable-precision, inexact 15 decimal digits precision
smallserial 2 bytes small autoincrementing integer 1 to 32767
serial 4 bytes autoincrementing integer 1 to 2147483647
bigserial 8 bytes large autoincrementing integer 1 to 9223372036854775807

Above table shows range of numeric types in PostgreSQL 9.6 borrowed from here.

problems after restarting a mission critical physical server

This big problem that caused 28 minutes downtime, started when Zabbix informed us that free disk space is less than 20% on one of disk arrays on a physical server last day morning. I logged into hypervisor web interface to see which massive machines are filled up the disk. But Proxmox-ve panel showed the node itself offline and all VMs and containers on it were grayed out! It happened but, all machines on it were serving requests successfully.

I googled and tried to restart some pve services like pvestatd but it didn’t work!

Also, I couldn’t execute some simple commands like df on physical server. I decided to run that command manually because Zabbix had told us this physical server has less than 20% free space in one of its storage. The server couldn’t run that command after a long time. I was suspicious of not only proxmox-ve, but the Linux kernel itself. So, I executed journalctl -f, but there was not anything useful there. Executing tail -f /var/log/pveam.log had the same result. I was thinking of giving a reboot on this server because of its almost 2 years uptime. But, it was risky and we could not have down time.

This screenshot shows a Proxmox-ve with 684 days uptime!

Invoking /etc/init.d/pve-manager restart caused a stop on all virtual machines and containers on this server. I spend a few minutes to recover and bring all of them to operational status again, but none of them worked. I had to reboot the server! We had 2 mission critical database, 1 mission critical application server and a couple of non mission critical machines on this server.

Unfortunately, issuing reboot command couldn’t restart the server. So, thanks to HPE’s iLO technology, I restarted it using server’s iLO web interface.

The bigger problem raised when all of containers and virtual machines got back to operational mode. My colleagues couldn’t login to the application web panel. Abraham checked application server’s nginx logs and told me he gave database connection error when he tries to open panel link. First, I guessed it could be a database access/permission problem. I tried to ssh to database server but I couldn’t! I could ping that server though. So, I tried to get access to that machine using lxc-attach –name 105 from physical server. lucky enough, I could get access to the container and find out postgresql and ssh services have not started. I started them and linked those startup script from /etc/init.d/ssh and /etc/init.d/postgresql to /etc/rc2.d/S99ssh and /etc/rc2.d/S99postgresql. Since it was an old-stable version of Debian, it still uses SystemV and does not support systemd init system.

Unfortunately, starting postgresql RDBMS server did not solve the problem. I tailed postgresql logs and didn’t see any error/information log related to connection or permission problem coming from application server.

We have pgbouncer installed on application server (that one which rebooted recently and couldn’t connect to database server which rebooted recently too). So, I sshed to the application server and found out pgbouncer is not running. I started pgbouncer service, but, when I got status, it told me FAILED!

I executed ss -tlnp command  and found out pgbouncer is not really running. I checked the pgbouncer log but latest log was before restarting the server. Starting or restarting pgbouncer didn’t record any log.

Running pgbouncer manually was the first thing I thought about. pgbouncer /etc/pgbouncer/pgbouncer.ini -v -u postgres successfully ran and our colleagues could open application web panel. So, pgbouncer’s configuration file located in /etc/pgbouncer/pgbouncer.ini was OK and I had to look for something else. The next thing I thought was that the problem could be related to pgbouncer’s init script. But, how could I make sure if that init script is OK or not?

It was easy! I had to download an original version of pgbouncer.deb package and compare its init script with the init script we had on our server. I found the original .deb package of pgbouncer here, downloaded it and extracted it somewhere.

Unfortunately, when I compared them, I found them quite similar! I had to execute plan C. Plan C was to look at pgbouncer’s init script deeply and find out what is happening when we execute /etc/init.d/pgbouncer start. I saw that the script checks a file before it starts. I opened that file using vim and got what is the problem! Here is that file and you will find out what ailed me for almost 8 minutes!