problems after restarting a mission critical physical server

This big problem that caused 28 minutes downtime, started when Zabbix informed us that free disk space is less than 20% on one of disk arrays on a physical server last day morning. I logged into hypervisor web interface to see which massive machines are filled up the disk. But Proxmox-ve panel showed the node itself offline and all VMs and containers on it were grayed out! It happened but, all machines on it were serving requests successfully.

I googled and tried to restart some pve services like pvestatd but it didn’t work!

Also, I couldn’t execute some simple commands like df on physical server. I decided to run that command manually because Zabbix had told us this physical server has less than 20% free space in one of its storage. The server couldn’t run that command after a long time. I was suspicious of not only proxmox-ve, but the Linux kernel itself. So, I executed journalctl -f, but there was not anything useful there. Executing tail -f /var/log/pveam.log had the same result. I was thinking of giving a reboot on this server because of its almost 2 years uptime. But, it was risky and we could not have down time.

This screenshot shows a Proxmox-ve with 684 days uptime!

Invoking /etc/init.d/pve-manager restart caused a stop on all virtual machines and containers on this server. I spend a few minutes to recover and bring all of them to operational status again, but none of them worked. I had to reboot the server! We had 2 mission critical database, 1 mission critical application server and a couple of non mission critical machines on this server.

Unfortunately, issuing reboot command couldn’t restart the server. So, thanks to HPE’s iLO technology, I restarted it using server’s iLO web interface.

The bigger problem raised when all of containers and virtual machines got back to operational mode. My colleagues couldn’t login to the application web panel. Abraham checked application server’s nginx logs and told me he gave database connection error when he tries to open panel link. First, I guessed it could be a database access/permission problem. I tried to ssh to database server but I couldn’t! I could ping that server though. So, I tried to get access to that machine using lxc-attach –name 105 from physical server. lucky enough, I could get access to the container and find out postgresql and ssh services have not started. I started them and linked those startup script from /etc/init.d/ssh and /etc/init.d/postgresql to /etc/rc2.d/S99ssh and /etc/rc2.d/S99postgresql. Since it was an old-stable version of Debian, it still uses SystemV and does not support systemd init system.

Unfortunately, starting postgresql RDBMS server did not solve the problem. I tailed postgresql logs and didn’t see any error/information log related to connection or permission problem coming from application server.

We have pgbouncer installed on application server (that one which rebooted recently and couldn’t connect to database server which rebooted recently too). So, I sshed to the application server and found out pgbouncer is not running. I started pgbouncer service, but, when I got status, it told me FAILED!

I executed ss -tlnp command  and found out pgbouncer is not really running. I checked the pgbouncer log but latest log was before restarting the server. Starting or restarting pgbouncer didn’t record any log.

Running pgbouncer manually was the first thing I thought about. pgbouncer /etc/pgbouncer/pgbouncer.ini -v -u postgres successfully ran and our colleagues could open application web panel. So, pgbouncer’s configuration file located in /etc/pgbouncer/pgbouncer.ini was OK and I had to look for something else. The next thing I thought was that the problem could be related to pgbouncer’s init script. But, how could I make sure if that init script is OK or not?

It was easy! I had to download an original version of pgbouncer.deb package and compare its init script with the init script we had on our server. I found the original .deb package of pgbouncer here, downloaded it and extracted it somewhere.

Unfortunately, when I compared them, I found them quite similar! I had to execute plan C. Plan C was to look at pgbouncer’s init script deeply and find out what is happening when we execute /etc/init.d/pgbouncer start. I saw that the script checks a file before it starts. I opened that file using vim and got what is the problem! Here is that file and you will find out what ailed me for almost 8 minutes!