Meltdown and Spectre vulnerabilities

I know! I know! I am so active these days in blogging! That’s because of important international and domestic events! In my 2 previous posts, I have covered 2 important things happened to me in work during these 2 days.

I’ve read more about Meltdown and Spectre vulnerabilities these days. In recent years, we had bunch of vulnerabilities in computer software industry like shellshock, Heartbleed, Grub 2 authentication bypass by pressing backspace 28 times, and last but not least and probably the most important one, a critical bug in Microsoft Windows SMB server which finally caused WannaCry. Now, it’s time to find a critical vulnerable issue in CPU hardware!

Against earlier mentioned bugs, Spectre and Meltdown vulnerabilities are hardware related ones. These kinds of bugs are very very hard to find and very very hard to fix. As people can not change all their affected hardware, we need to address these bug using operating system’s kernel. Patching kernel for addressing such bugs will complicate OS kernels more and will move bugs from hardware to software/kernel bugs.

Almost all computer guys are talking about these hot bugs and I do not want to repeat them. I am just here to conclude three things from these vulnerabilities:

  1. I’ve studied Computer Organization and Design by David A. Patterson and John L. Hennessy and Modern Operating Systems by Andrew S. Tanenbaum during last 2 years. Thanks to these scientists and my prior studies in electronic engineering technology, I can understand how such a thing is possible. I also know how hard is to discover such a bug in hardware and this can not be done by a single person, but a group of security researchers with spending years of study and research. I suggest you to take a look at these books. You will love them. I promise.
  2. I made sure that independent people can participate and have a great effect on such a important papers. Paul Kocher had a great work on Spectre and a great help to find and address  Meltdown. He cited in those papers as an independent researcher. Also, for-profit-corporations like Cyberus Technology GmbH and Rambus, Cryptography Research Division could have effect on papers.
  3. This is the first time that I see a great co-operation from German company in a very important affair in computer science.

Disk I/O is overloaded on a PostgreSQL server

It’s a couple weeks that Zabbix shows occasionally the message “Disk I/O is overloaded on db” on a PostgreSQL database server.  It showed the message for from a range of a few minutes to a few hours and then it vanished.

The other day, I checked all that VM’s graphs on Zabbix and then logged in to the VM to check its health status manually. Everything seemed OK. I checked PostgreSQL logs and it was OK too at the first sight.

As this problem didn’t have any impact on performance or business revenue and I had more important things to do, I didn’t spend a lot of time on that. Today, I made enough time to investigate it more.

As, this VM is dedicated to a Postgresql server and it does not server anything else and all items of operating system seemed OK, I went directly to PostgreSQL service! First of all, I realized, logging slow queries are not enable on this PostgreSQL server. So, I un-commented line containing log_min_duration_statement and gave it a value of 30,000. Note that that number is in milliseconds, so, PostgreSQL will log any query with a run time of more than 30 seconds.

Thanks to postgres, it didn’t need any restart and a reload was enough to sense recently enabled configuration. So I executed systemctl reload postgresql and tailed postgresql logs with tail -f /var/log/postgresql/postgresql-9.6-main.log command. I spent a few minutes there, but there was nothing to indicate what is wrong with it. I poured a cup of coffee, did a flow task and returned again to see if postgres logged anything related to the problem. I saw this:

2018-01-17 08:26:11.337 +0330 [19994] USER@DB_NAME ERROR:  integer out of range
2018-01-17 08:26:11.337 +0330 [19994] USER@DB_NAME STATEMENT:  INSERT INTO user_comments (delivery_id,services,timestamp) VALUES (2376584071,'{255}',NOW())

 

I figured out the problem. Once a customer tries to leave a comment, we will face the problem. We don’t need change type of delivery_id this time since we do not use it anymore in that way. I asked Abraham to completely remove the code related to user comments.

Hint: This database is a massive database with the size of ~660GB. A few months ago we have a serious problem on this DB server. We couldn’t insert any row on main table of this DB and after investigation, we found out that id column of that table was defined as integer and we met its limitations of about 2.1 billion. This DB’s tables are quite old and its original architect didn’t predict such a day!

We changed type of id column from integers to big integers with a lot of hassle and a few hours of unwanted down time.

P.S.

Name Storage Size Description Range
smallint 2 bytes small-range integer -32768 to +32767
integer 4 bytes typical choice for integer -2147483648 to +2147483647
bigint 8 bytes large-range integer -9223372036854775808 to +9223372036854775807
decimal variable user-specified precision, exact up to 131072 digits before the decimal point; up to 16383 digits after the decimal point
numeric variable user-specified precision, exact up to 131072 digits before the decimal point; up to 16383 digits after the decimal point
real 4 bytes variable-precision, inexact 6 decimal digits precision
double precision 8 bytes variable-precision, inexact 15 decimal digits precision
smallserial 2 bytes small autoincrementing integer 1 to 32767
serial 4 bytes autoincrementing integer 1 to 2147483647
bigserial 8 bytes large autoincrementing integer 1 to 9223372036854775807

Above table shows range of numeric types in PostgreSQL 9.6 borrowed from here.

problems after restarting a mission critical physical server

This big problem that caused 28 minutes downtime, started when Zabbix informed us that free disk space is less than 20% on one of disk arrays on a physical server last day morning. I logged into hypervisor web interface to see which massive machines are filled up the disk. But Proxmox-ve panel showed the node itself offline and all VMs and containers on it were grayed out! It happened but, all machines on it were serving requests successfully.

I googled and tried to restart some pve services like pvestatd but it didn’t work!

Also, I couldn’t execute some simple commands like df on physical server. I decided to run that command manually because Zabbix had told us this physical server has less than 20% free space in one of its storage. The server couldn’t run that command after a long time. I was suspicious of not only proxmox-ve, but the Linux kernel itself. So, I executed journalctl -f, but there was not anything useful there. Executing tail -f /var/log/pveam.log had the same result. I was thinking of giving a reboot on this server because of its almost 2 years uptime. But, it was risky and we could not have down time.

This screenshot shows a Proxmox-ve with 684 days uptime!

Invoking /etc/init.d/pve-manager restart caused a stop on all virtual machines and containers on this server. I spend a few minutes to recover and bring all of them to operational status again, but none of them worked. I had to reboot the server! We had 2 mission critical database, 1 mission critical application server and a couple of non mission critical machines on this server.

Unfortunately, issuing reboot command couldn’t restart the server. So, thanks to HPE’s iLO technology, I restarted it using server’s iLO web interface.

The bigger problem raised when all of containers and virtual machines got back to operational mode. My colleagues couldn’t login to the application web panel. Abraham checked application server’s nginx logs and told me he gave database connection error when he tries to open panel link. First, I guessed it could be a database access/permission problem. I tried to ssh to database server but I couldn’t! I could ping that server though. So, I tried to get access to that machine using lxc-attach –name 105 from physical server. lucky enough, I could get access to the container and find out postgresql and ssh services have not started. I started them and linked those startup script from /etc/init.d/ssh and /etc/init.d/postgresql to /etc/rc2.d/S99ssh and /etc/rc2.d/S99postgresql. Since it was an old-stable version of Debian, it still uses SystemV and does not support systemd init system.

Unfortunately, starting postgresql RDBMS server did not solve the problem. I tailed postgresql logs and didn’t see any error/information log related to connection or permission problem coming from application server.

We have pgbouncer installed on application server (that one which rebooted recently and couldn’t connect to database server which rebooted recently too). So, I sshed to the application server and found out pgbouncer is not running. I started pgbouncer service, but, when I got status, it told me FAILED!

I executed ss -tlnp command  and found out pgbouncer is not really running. I checked the pgbouncer log but latest log was before restarting the server. Starting or restarting pgbouncer didn’t record any log.

Running pgbouncer manually was the first thing I thought about. pgbouncer /etc/pgbouncer/pgbouncer.ini -v -u postgres successfully ran and our colleagues could open application web panel. So, pgbouncer’s configuration file located in /etc/pgbouncer/pgbouncer.ini was OK and I had to look for something else. The next thing I thought was that the problem could be related to pgbouncer’s init script. But, how could I make sure if that init script is OK or not?

It was easy! I had to download an original version of pgbouncer.deb package and compare its init script with the init script we had on our server. I found the original .deb package of pgbouncer here, downloaded it and extracted it somewhere.

Unfortunately, when I compared them, I found them quite similar! I had to execute plan C. Plan C was to look at pgbouncer’s init script deeply and find out what is happening when we execute /etc/init.d/pgbouncer start. I saw that the script checks a file before it starts. I opened that file using vim and got what is the problem! Here is that file and you will find out what ailed me for almost 8 minutes!

 

How to solve recurring problems with Zabbix automatically!

Situation

We use Proxmox as hypervisor along with other technologies like vmware ESXi for virtualization in our company. About a year ago, we purchased a new server, an HP Proliant DL380P G8. I installed the latest version of Proxmox (it was proxmox-ve-4.2 those days) on that.

Problem

After a few month, we have several containers and virtual machines on this server, but unfortunately, problems started a few month ago. First, Zabbix complained about disk I/O overload on physical server and some of its virtual machines. Then we saw none of the VMs and containers response to requests!

All containers and VMs grayed out when we faced memory allocation problem on proxmox.

None of the containers and virtual machines were available via SSH. So, we checked their hypervisor logs (thanks to God, SSH and web panel were accessible on hypervisor) and guess what we just found on its logs?

A so weird message that I’ve never saw that before. It was repeating every 2 seconds. It was something like this:

Dec 18 05:48:00 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:58 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:56 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:54 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:53 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:51 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:49 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:47 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:45 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:43 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:41 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:39 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:37 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:35 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:33 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:31 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:29 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:27 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:25 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:23 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:21 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:19 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)

It seemed this should made that huge problem with inaccessibility of all containers and virtual machines on this physical server. After Googling this problem, we’ve realized that version of Proxmox (say that version of Debian Linux kernel) has this problem with XFS filesystem memory management.  The solution was quite easy. You just need to drop memory caches. In order to do that, you just need to invoke this command as root on Proxmox:

/bin/echo 3 > /proc/sys/vm/drop_caches

We thought this bug would be solve in the next update of Proxmox and we will not face that problem again.

After a couple weeks, we faced a new problem on this physical server running version 4.2 of Proxmox and that was using high usage of swap space. High usage of swap space on hypervisor caused lack of free swap space on containers and VMs running on this server.

Usually, Linux kernel will handle this problem too, but unfortunately, the version on Linux kernel used on Proxmox 4.2 could not solve the problem after a few hours. So, we had to solve it in Linux Sysadmin manual way! The solution wasn’t too hard and too weird. We just needed to do a swappofff and swapon:

/sbin/swapoff -a && /sbin/swapon -a

Just like XFS memory allocation problem, we’ve never thought to see this problem again, but as you probably guess, we faced not one of them, but both of these problems every couple of weeks.

Needless to say, Proxmox installer choose filesytem and amount of swap space automatically.

Permanent solution

I know, I know, the permanent solutions are to update Proxmox that probably includes update Linux kernel. Unfortunately, there is no Linux kernel update in proxmox update list. Re-installing Proxmox or install newer versions of Proxmox, may solve the problem, but as we do not want to have any downtime and we do not have enough resource to move all containers and virtual machines on another server and re-install Proxmox, we need to choose another solution.

I had prior experience with action feature of Zabbix when I was working for TENA. So, I choose to use Zabbix action to solve these problem.

I defined two new actions on Zabbix, one for a problem. Defining new action is not hard. I added two operations for every problem, one for informing sysadmin guys and one for real command execution. I usually enable recovery operations that usually inform sysadmin guys that the earlier mentioned problem has been solved successfully.

Thanks to Zabbix, we now, do not need to do any effort to solve such recurring problems manually anymore.

God bless Swift!

I chose Swift for learning as my next programming language after finishing Python 3. Now, I’m reading and practicing python 3 with Python Crash Course and even did a pull request on it for likely new review/version of that book.

I started digging Swift after thenextweb‘s article on Google’s considering Swift as a first class programming language for Android (although I am almost sure it is almost impossible). After a few month of that article, I forked an orphaned repository on github to bring Swift 4 on my favorite desktop distro, Fedora. Now, I have this swift4-on-fedora public repository on github and I’ll test building Swift 4 RPM package on Fedora 27 as soon as I installed it on my PC.

I read theverge‘s article on Swift code will run on Google’s Fuchsia OS and now, I’m more confident than ever on Apple’s Swift programming language bright future.

Another power supply unit defect!

Today, Mehrnoosh brought her PC to the company and asked me to take a look at it. She said, she can not turn her PC on anymore.

The first thing I did was plugging it to electricity and making sure its power cord is OK! Then opened  the case doors and detached power supply unit connectors from motherboard, HDD and ODD. Then tried to connect green wire to black wire via a jumper to make sure power button is not the problem. I realized power supply is not going to be turned on by connecting green wire to black ones. So, power supply is damaged (it’s possible that power button has problem too). We completely detached power supply and replaced it by a spare power supply. Everything was OK and operating system was booted. So, the only problem was the power supply itself. As, I know by experience, a lot of power supply makers use poor capacitors in their products, so they will blow up after a few years of working.

I opened the power supply ans saw I was right! Some of capacitors were blew up!

The main problem was related to a 10 volt/2200 micro Farad capacitor, so I suggested Mehrnoosh to replace it with a 16 volt or more, same capacity electrolytic (preferably, solid) capacitor or buy a reliable PSU like Cooler Master products. She chose the second option!

History of the United States (book)

I am not going to talk about project Gutenberg, but about a book I am reading these days. The book is History of the United States by Helen Pierson. Thanks to project Gutenberg, I can read books on my old Android phone. I did a  search books about people I always admire like, Benjamin Franklin, Abraham Lincoln and some other great people, but unfortunately there were not a lot of books about them. I can not hide my passion to the Unite States, So, I started reading EPUB version of the History of the United States in words of one syllable book on my phone.

There are several editions of this book, like 2010 paper back edition and the latest edition (up to the day I am writing this post!), but the edition I’m reading now is 1889 edition of the history of the United States.

Now, I’m reading page 33 and I really enjoyed reading these 33 pages. This is a thin book and I suggest you to read it.

My first serious coding with python3!

I made some free time to update my blog. I dedicate this post to my first serious code.

The situation:

Our company has a database of users who subscribed to at last one of our services and are MTNIrancell users. In the other side, MTNIrancell has a database of their users who are subscribed to one or more of our services.

The problem:

The problem was that MTNIrancell DB and our DB were out of sync.

MTNIrancell gave us their users list as a .csv file and asked us to compare the list with our DB and finally give the diff.

 

Abraham was struggling with our DB and MTNIrance’s csv file for a couple of days. Those days, I was reading Eric Matthes‘ Python Crash Course Files and Exceptions chapter. Abe asked me to help him. I was glad because I could develop a serious operational code.

 

Finally I developed this code using python3 and intellijIdea:

filename = 'ghosts.csv'

with open('irancell.csv') as mtn:
  mtn_lines = mtn.readlines()

with open('vada.csv') as vada:
  vada_lines = vada.read()

line_number = 0

for line in vada_lines:
  line_number += 1
  print("Now reading line number: " + str(line_number) +" of vada.csv")
  if line not in mtn_lines:
    with open(filename, 'a') as file_diff:
      file_diff.write(line)


The code was working and compare MTNIrancell’s csv with our csv file and exported the users exist in our DB which are not in MTNIrancell’s DB. The MTNIrancell’s DB had about 2 M records while our DB had about 1.1 M records.

My code was working but it was too slow. It compares those csv files about 5 records per second. Aberaham and Ali didn’t have enough free time to wait for the code to finish. So they googled and found comm tool. Comm could solve a problem in a very very short time rather than my Python3 code. It’s because my python3 code is a single thread/single process code while comm could use more thread/process. Although I am not sure I can use multi thread/multi processing to compare those files in a shorter time, but who knows, maybe I refactor the code someday to support multi process processing.

Disassemble Lenovo IdeaPad 300

Last day, Farshid asked me to take a look at Lenovo IdeaPad 300 that Mr. Ahmadi complaint he can not turn it on. Farshid spent some time on it and finally asked me for some help. I checked items Farshid has just thought may cause the problem. I thought it maybe something related to the software so we had to reboot it but as laptop was fully charged and waiting for depleting the battery might take a long time (because the device does not use battery. Finally as the battery is not removable in Ideapad 300, we decided to disassemble it! At this time Mehrnoosh joint us to see what is happening there. As I studied electronics and I had some prior experience on disassembling and mending devices, I started the work! Disassembling was somewhat easy. It got back to old days of hardware and electronic.

 

P.S. : The problem was something more serious, so we decided to sent it to one of Lenovo’s authorized repair shops.

What are the minimum features of a 2017 smartphone?

Before installing LineageOMS 14.1 on my Sony Xperia M phone, I decided to buy a new smartphone. In one side, I didn’t want to spend a lot of money for purchasing a high end smart phone. On the other side, I wanted to buy something had could be updatable for at least 2 years, so I can focus on other things rather than finding custom ROMs for updating the OS. Apple’s iPhones are very high quality smartphones, but they are very very expensive. I was thought about iPhone SE because its OS is iOS and it was not as expensive as iPhone 7 or even iPhone 6S. The only problem with iPhone SE was its front camera with 1.2 mega pixel camera. I continued googling but this time, looking for Android phones with minimum features I just like. Here are most important ones:

LED technology display

LCD era is over. Now, we are in LED displays. In a short sentence, LCD displays need a backlight and are not colorful. Colors are generated after passing light from red, green, blue filters. Read more about disadvantages of LCD. LEDs are naturally colorful (using different semi-conductors) and do not need any backlight. So they are brighter, use less energy and have vivid colors. AMOLEDs will be dominant display technology very soon. Sony, LG, HTC, Huawei and others awaited for Apple to introduce an iPhone with AMOLED display to follow its way and build their new display with AMOLED display.s

Notifications

These days, mobile phone manufacturers try to use efficient ways of notifying people. From substituting traditional LED notification light to use second display like what LG V20 has. I think the most efficient way for notification is using AOD (Always On Display) like what Samsung is using on its flagship smartphones and even mid range phones like Galaxy A5 2017.

Finger print sensor

Take much time to unlock your smartphone? Do not want everybody to watch your password or pattern? Finger print sensor is a good way to achieve this.

Fast Charge

Our usage of smartphone is increasing. So better to buy a smartphone with fast charge feature.

USB type C

Everybody in this world have problem with attaching USB (Type A, B, MicroUSB and all other types except type C) cable and gadgets. It is not reversible and we almost attach 50% of times in reverse direction. It is an annoyable thing and in worst case, you may damage the port. Reversible USB Type C is the solution!

Good selfie camera

Go to trip and hard to find someone to take a good and sharp of you? You have to buy a smartphone with a good selfie camera. It’s better to have front flash, but it’s not a big deal.

Price

At last but not least, price is an important item. You can use your smartphone maximum 3 years (if your phone is a flagship and 2 years if your phone is a mid level phone) if you want to update to the latest updates. So, I recommend you to not spend a lot of money for purchasing a very high end phone.

My suggestion

With this and that, in Aug 2017, my recommended mobile phones is Samsung Galaxy C5 pro. Its main features are 5.2″ display, Super AMOLED full HD display,  4GB of RAM, 64GB of internal storage, USB type C, AOD, fast charge, finger print sensor, 16 M pixel f 1.9 rear and front camera and Android Nougat with a price about 350 USD on Amazon.

If you can ignore using LED display,  Huawei Nova 2 plus can be an alternate.