Memcached weak default configuration issue may be abused for DDoS attack!

It was Thursday, March 1st 2018 09:30 AM. Someone from data center called us (on-call guys in our IT operations team) and told us about massive transfer between from one of our servers to several destinations. One of the destinations was nmap!

I or better say, we, started to investigate the problem if they are right. First of all, I checked Zabbix dashboard that VM’s important graphs on Zabbix and Grafana. Important ones were looked like these:

Massive network traffic caused by Memcached.
Massive CPU load caused by Memcached.


There is no default trigger set on network traffic on interfaces and we didn’t set such a trigger on that item. Also, processor load trigger were not triggered because of this problem of Memcached, because the default trigger value was set on >5, but processor load of the server rarely met 4.


I asked my colleagues if they had transferred anything from this VM. They denied all!

When I was investigating this problem and took a look at Zabbix dashboard, I realized that Zabbix-agent is not accessible on another server and also one of our services is stopped and strange enough, no one of our monitoring guys called us because of this problem.

I saw:

  1. At that moment the problem had gone!
  2. Data center had limited our network ports to 100 Mbps, so the abuse will not affect anyone else seriously anymore.
  3. Our usual transfer rate does need gigabit port and 100 Mbps is enough.
  4. Two other of my teammates were working on this problem.
  5. No one saw other problems, and
  6. Although the first problem is not affected the business directly, but these ones would.

So, I informed my teammates and started investigating on these problems. After enough investigating, I realized that a glitch during VM backup on one of our databases locked that specific VM and caused zabbix agent unreachable on this VM and too many process on application server because of it retries to get a connection to its DB server. These are sources where I realized this problem that could affect our business directly:

A disk I/O overload then zabbix agent unreachable on DB and backup glitch on hypervisor.
We are behind the same time yesterday.
Backup glitch on DB.


Let’s get back to the main problem! Well, it took some time to find the source of the problem. As we were blind on this problem, we had to investigate more to address this problem. As this problem shows itself as a massive network transfer and CPU load, we watched iftop and htop to see if it happens again and which process caused this. Fortunately we caught it!

iftop shows anomaly transfer on VM’s public network port.

With ss tool that is the next generation of obsolute netstat tool, we can realize which application and process is listening to port 11211:

ss tool shows that memcached is listening to 11211 port!


And a google showed us that it’s a problem related to Memcached! Thanks to this post of cloudflare’s engineering blog, we understood it’s sort of vulnerability of Memcached and we have to limit its listening IP address.

Unfortunately limiting listening IP address of Memcached might caused unexpected event on application, but still, we can solve the problem easily! It probably have several solutions, but we preferred  IPTables:

iptables -A INPUT -m tcp -p tcp -d PULIC_IP_OF_THE_SERVER --dport 11211 -j DROP
iptables -A INPUT -m udp -p udp -d PULIC_IP_OF_THE_SERVER --dport 11211 -j DROP

With this solution, we limited access to Memcached to just our LAN and loopback interfaces. Phew, here it is!

The strange thing for me is that neither CVE  nor netcraft do not address this problem yet. You may say CVE just address vulnerabilities and this one is not a vulnerability, but I believe they should record this issue as an entry for Memcached.

All researches about this problem go to this post of cloudflare’s engineering blog!

Meltdown and Spectre vulnerabilities

I know! I know! I am so active these days in blogging! That’s because of important international and domestic events! In my 2 previous posts, I have covered 2 important things happened to me in work during these 2 days.

I’ve read more about Meltdown and Spectre vulnerabilities these days. In recent years, we had bunch of vulnerabilities in computer software industry like shellshock, Heartbleed, Grub 2 authentication bypass by pressing backspace 28 times, and last but not least and probably the most important one, a critical bug in Microsoft Windows SMB server which finally caused WannaCry. Now, it’s time to find a critical vulnerable issue in CPU hardware!

Against earlier mentioned bugs, Spectre and Meltdown vulnerabilities are hardware related ones. These kinds of bugs are very very hard to find and very very hard to fix. As people can not change all their affected hardware, we need to address these bug using operating system’s kernel. Patching kernel for addressing such bugs will complicate OS kernels more and will move bugs from hardware to software/kernel bugs.

Almost all computer guys are talking about these hot bugs and I do not want to repeat them. I am just here to conclude three things from these vulnerabilities:

  1. I’ve studied Computer Organization and Design by David A. Patterson and John L. Hennessy and Modern Operating Systems by Andrew S. Tanenbaum during last 2 years. Thanks to these scientists and my prior studies in electronic engineering technology, I can understand how such a thing is possible. I also know how hard is to discover such a bug in hardware and this can not be done by a single person, but a group of security researchers with spending years of study and research. I suggest you to take a look at these books. You will love them. I promise.
  2. I made sure that independent people can participate and have a great effect on such a important papers. Paul Kocher had a great work on Spectre and a great help to find and address  Meltdown. He cited in those papers as an independent researcher. Also, for-profit-corporations like Cyberus Technology GmbH and Rambus, Cryptography Research Division could have effect on papers.
  3. This is the first time that I see a great co-operation from German company in a very important affair in computer science.

Disk I/O is overloaded on a PostgreSQL server

It’s a couple weeks that Zabbix shows occasionally the message “Disk I/O is overloaded on db” on a PostgreSQL database server.  It showed the message for from a range of a few minutes to a few hours and then it vanished.

The other day, I checked all that VM’s graphs on Zabbix and then logged in to the VM to check its health status manually. Everything seemed OK. I checked PostgreSQL logs and it was OK too at the first sight.

As this problem didn’t have any impact on performance or business revenue and I had more important things to do, I didn’t spend a lot of time on that. Today, I made enough time to investigate it more.

As, this VM is dedicated to a Postgresql server and it does not server anything else and all items of operating system seemed OK, I went directly to PostgreSQL service! First of all, I realized, logging slow queries are not enable on this PostgreSQL server. So, I un-commented line containing log_min_duration_statement and gave it a value of 30,000. Note that that number is in milliseconds, so, PostgreSQL will log any query with a run time of more than 30 seconds.

Thanks to postgres, it didn’t need any restart and a reload was enough to sense recently enabled configuration. So I executed systemctl reload postgresql and tailed postgresql logs with tail -f /var/log/postgresql/postgresql-9.6-main.log command. I spent a few minutes there, but there was nothing to indicate what is wrong with it. I poured a cup of coffee, did a flow task and returned again to see if postgres logged anything related to the problem. I saw this:

2018-01-17 08:26:11.337 +0330 [19994] USER@DB_NAME ERROR:  integer out of range
2018-01-17 08:26:11.337 +0330 [19994] USER@DB_NAME STATEMENT:  INSERT INTO user_comments (delivery_id,services,timestamp) VALUES (2376584071,'{255}',NOW())


I figured out the problem. Once a customer tries to leave a comment, we will face the problem. We don’t need change type of delivery_id this time since we do not use it anymore in that way. I asked Abraham to completely remove the code related to user comments.

Hint: This database is a massive database with the size of ~660GB. A few months ago we have a serious problem on this DB server. We couldn’t insert any row on main table of this DB and after investigation, we found out that id column of that table was defined as integer and we met its limitations of about 2.1 billion. This DB’s tables are quite old and its original architect didn’t predict such a day!


We changed type of id column from integers to big integers with a lot of hassle and a few hours of unwanted down time.


Name Storage Size Description Range
smallint 2 bytes small-range integer -32768 to +32767
integer 4 bytes typical choice for integer -2147483648 to +2147483647
bigint 8 bytes large-range integer -9223372036854775808 to +9223372036854775807
decimal variable user-specified precision, exact up to 131072 digits before the decimal point; up to 16383 digits after the decimal point
numeric variable user-specified precision, exact up to 131072 digits before the decimal point; up to 16383 digits after the decimal point
real 4 bytes variable-precision, inexact 6 decimal digits precision
double precision 8 bytes variable-precision, inexact 15 decimal digits precision
smallserial 2 bytes small autoincrementing integer 1 to 32767
serial 4 bytes autoincrementing integer 1 to 2147483647
bigserial 8 bytes large autoincrementing integer 1 to 9223372036854775807

Above table shows range of numeric types in PostgreSQL 9.6 borrowed from here.

problems after restarting a mission critical physical server

This big problem that caused 28 minutes downtime, started when Zabbix informed us that free disk space is less than 20% on one of disk arrays on a physical server last day morning. I logged into hypervisor web interface to see which massive machines are filled up the disk. But Proxmox-ve panel showed the node itself offline and all VMs and containers on it were grayed out! It happened but, all machines on it were serving requests successfully.

I googled and tried to restart some pve services like pvestatd but it didn’t work!

Also, I couldn’t execute some simple commands like df on physical server. I decided to run that command manually because Zabbix had told us this physical server has less than 20% free space in one of its storage. The server couldn’t run that command after a long time. I was suspicious of not only proxmox-ve, but the Linux kernel itself. So, I executed journalctl -f, but there was not anything useful there. Executing tail -f /var/log/pveam.log had the same result. I was thinking of giving a reboot on this server because of its almost 2 years uptime. But, it was risky and we could not have down time.

This screenshot shows a Proxmox-ve with 684 days uptime!

Invoking /etc/init.d/pve-manager restart caused a stop on all virtual machines and containers on this server. I spend a few minutes to recover and bring all of them to operational status again, but none of them worked. I had to reboot the server! We had 2 mission critical database, 1 mission critical application server and a couple of non mission critical machines on this server.

Unfortunately, issuing reboot command couldn’t restart the server. So, thanks to HPE’s iLO technology, I restarted it using server’s iLO web interface.

The bigger problem raised when all of containers and virtual machines got back to operational mode. My colleagues couldn’t login to the application web panel. Abraham checked application server’s nginx logs and told me he gave database connection error when he tries to open panel link. First, I guessed it could be a database access/permission problem. I tried to ssh to database server but I couldn’t! I could ping that server though. So, I tried to get access to that machine using lxc-attach –name 105 from physical server. lucky enough, I could get access to the container and find out postgresql and ssh services have not started. I started them and linked those startup script from /etc/init.d/ssh and /etc/init.d/postgresql to /etc/rc2.d/S99ssh and /etc/rc2.d/S99postgresql. Since it was an old-stable version of Debian, it still uses SystemV and does not support systemd init system.

Unfortunately, starting postgresql RDBMS server did not solve the problem. I tailed postgresql logs and didn’t see any error/information log related to connection or permission problem coming from application server.

We have pgbouncer installed on application server (that one which rebooted recently and couldn’t connect to database server which rebooted recently too). So, I sshed to the application server and found out pgbouncer is not running. I started pgbouncer service, but, when I got status, it told me FAILED!

I executed ss -tlnp command  and found out pgbouncer is not really running. I checked the pgbouncer log but latest log was before restarting the server. Starting or restarting pgbouncer didn’t record any log.

Running pgbouncer manually was the first thing I thought about. pgbouncer /etc/pgbouncer/pgbouncer.ini -v -u postgres successfully ran and our colleagues could open application web panel. So, pgbouncer’s configuration file located in /etc/pgbouncer/pgbouncer.ini was OK and I had to look for something else. The next thing I thought was that the problem could be related to pgbouncer’s init script. But, how could I make sure if that init script is OK or not?

It was easy! I had to download an original version of pgbouncer.deb package and compare its init script with the init script we had on our server. I found the original .deb package of pgbouncer here, downloaded it and extracted it somewhere.

Unfortunately, when I compared them, I found them quite similar! I had to execute plan C. Plan C was to look at pgbouncer’s init script deeply and find out what is happening when we execute /etc/init.d/pgbouncer start. I saw that the script checks a file before it starts. I opened that file using vim and got what is the problem! Here is that file and you will find out what ailed me for almost 8 minutes!


Resolve repetitive problems automatically using Zabbix !


We use Proxmox as hypervisor along with other technologies like vmware ESXi for virtualization in our company. About a year ago, we purchased a new server, an HP Proliant DL380P G8. I installed the latest version of Proxmox (it was proxmox-ve-4.2 those days) on that.


After a few month, we have several containers and virtual machines on this server, but unfortunately, problems started a few month ago. First, Zabbix complained about disk I/O overload on physical server and some of its virtual machines. Then we saw none of the VMs and containers response to requests!

All containers and VMs grayed out when we faced memory allocation problem on proxmox.

None of the containers and virtual machines were available via SSH. So, we checked their hypervisor logs (thanks to God, SSH and web panel were accessible on hypervisor) and guess what we just found on its logs?

A so weird message that I’ve never saw that before. It was repeating every 2 seconds. It was something like this:

Dec 18 05:48:00 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:58 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:56 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:54 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:53 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:51 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:49 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:47 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:45 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:43 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:41 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:39 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:37 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:35 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:33 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:31 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:29 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:27 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:25 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:23 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:21 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)
Dec 18 05:47:19 prx3 kernel: XFS: loop0(1739) possible memory allocation deadlock size 37344 in kmem_alloc (mode:0x2400240)

It seemed this should made that huge problem with inaccessibility of all containers and virtual machines on this physical server. After Googling this problem, we’ve realized that version of Proxmox (say that version of Debian Linux kernel) has this problem with XFS filesystem memory management.  The solution was quite easy. You just need to drop memory caches. In order to do that, you just need to invoke this command as root on Proxmox:

/bin/echo 3 > /proc/sys/vm/drop_caches

We thought this bug would be solve in the next update of Proxmox and we will not face that problem again.

After a couple weeks, we faced a new problem on this physical server running version 4.2 of Proxmox and that was using high usage of swap space. High usage of swap space on hypervisor caused lack of free swap space on containers and VMs running on this server.

Usually, Linux kernel will handle this problem too, but unfortunately, the version on Linux kernel used on Proxmox 4.2 could not solve the problem after a few hours. So, we had to solve it in Linux Sysadmin manual way! The solution wasn’t too hard and too weird. We just needed to do a swappofff and swapon:

/sbin/swapoff -a && /sbin/swapon -a

Just like XFS memory allocation problem, we’ve never thought to see this problem again, but as you probably guess, we faced not one of them, but both of these problems every couple of weeks.

Needless to say, Proxmox installer choose filesytem and amount of swap space automatically.

Permanent solution

I know, I know, the permanent solutions are to update Proxmox that probably includes update Linux kernel. Unfortunately, there is no Linux kernel update in proxmox update list. Re-installing Proxmox or install newer versions of Proxmox, may solve the problem, but as we do not want to have any downtime and we do not have enough resource to move all containers and virtual machines on another server and re-install Proxmox, we need to choose another solution.

I had prior experience with action feature of Zabbix when I was working for TENA. So, I choose to use Zabbix action to solve these problem.

I defined two new actions on Zabbix, one for a problem. Defining new action is not hard. I added two operations for every problem, one for informing sysadmin guys and one for real command execution. I usually enable recovery operations that usually inform sysadmin guys that the earlier mentioned problem has been solved successfully.

Thanks to Zabbix, we now, do not need to do any effort to solve such recurring problems manually anymore.

God bless Swift!

I chose Swift for learning as my next programming language after finishing Python 3. Now, I’m reading and practicing python 3 with Python Crash Course and even did a pull request on it for likely new review/version of that book.

I started digging Swift after thenextweb‘s article on Google’s considering Swift as a first class programming language for Android (although I am almost sure it is almost impossible). After a few month of that article, I forked an orphaned repository on github to bring Swift 4 on my favorite desktop distro, Fedora. Now, I have this swift4-on-fedora public repository on github and I’ll test building Swift 4 RPM package on Fedora 27 as soon as I installed it on my PC.

I read theverge‘s article on Swift code will run on Google’s Fuchsia OS and now, I’m more confident than ever on Apple’s Swift programming language bright future.



UPDATE: As I moved to development team as a DevOps engineer recently and our stack is based on Golang, so I stopped studying Swift and started Golang.

Another power supply unit defect!

Today, Mehrnoosh brought her PC to the company and asked me to take a look at it. She said, she can not turn her PC on anymore.

The first thing I did was plugging it to electricity and making sure its power cord is OK! Then opened  the case doors and detached power supply unit connectors from motherboard, HDD and ODD. Then tried to connect green wire to black wire via a jumper to make sure power button is not the problem. I realized power supply is not going to be turned on by connecting green wire to black ones. So, power supply is damaged (it’s possible that power button has problem too). We completely detached power supply and replaced it by a spare power supply. Everything was OK and operating system was booted. So, the only problem was the power supply itself. As, I know by experience, a lot of power supply makers use poor capacitors in their products, so they will blow up after a few years of working.

I opened the power supply ans saw I was right! Some of capacitors were blew up!

The main problem was related to a 10 volt/2200 micro Farad capacitor, so I suggested Mehrnoosh to replace it with a 16 volt or more, same capacity electrolytic (preferably, solid) capacitor or buy a reliable PSU like Cooler Master products. She chose the second option!

History of the United States (book)

I am not going to talk about project Gutenberg, but about a book I am reading these days. The book is History of the United States by Helen Pierson. Thanks to project Gutenberg, I can read books on my old Android phone. I did a  search books about people I always admire like, Benjamin Franklin, Abraham Lincoln and some other great people, but unfortunately there were not a lot of books about them. I can not hide my passion to the Unite States, So, I started reading EPUB version of the History of the United States in words of one syllable book on my phone.

There are several editions of this book, like 2010 paper back edition and the latest edition (up to the day I am writing this post!), but the edition I’m reading now is 1889 edition of the history of the United States.

Now, I’m reading page 33 and I really enjoyed reading these 33 pages. This is a thin book and I suggest you to read it.

My first serious coding with python3!

I made some free time to update my blog. I dedicate this post to my first serious code.

The situation:

Our company has a database of users who subscribed to at last one of our services and are MTNIrancell users. In the other side, MTNIrancell has a database of their users who are subscribed to one or more of our services.

The problem:

The problem was that MTNIrancell DB and our DB were out of sync.

MTNIrancell gave us their users list as a .csv file and asked us to compare the list with our DB and finally give the diff.


Abraham was struggling with our DB and MTNIrance’s csv file for a couple of days. Those days, I was reading Eric Matthes‘ Python Crash Course Files and Exceptions chapter. Abe asked me to help him. I was glad because I could develop a serious operational code.


Finally I developed this code using python3 and intellijIdea:

filename = 'ghosts.csv'

with open('irancell.csv') as mtn:
  mtn_lines = mtn.readlines()

with open('vada.csv') as vada:
  vada_lines =

line_number = 0

for line in vada_lines:
  line_number += 1
  print("Now reading line number: " + str(line_number) +" of vada.csv")
  if line not in mtn_lines:
    with open(filename, 'a') as file_diff:

The code was working and compare MTNIrancell’s csv with our csv file and exported the users exist in our DB which are not in MTNIrancell’s DB. The MTNIrancell’s DB had about 2 M records while our DB had about 1.1 M records.

My code was working but it was too slow. It compares those csv files about 5 records per second. Aberaham and Ali didn’t have enough free time to wait for the code to finish. So they googled and found comm tool. Comm could solve a problem in a very very short time rather than my Python3 code. It’s because my python3 code is a single thread/single process code while comm could use more thread/process. Although I am not sure I can use multi thread/multi processing to compare those files in a shorter time, but who knows, maybe I refactor the code someday to support multi process processing.

Disassemble Lenovo IdeaPad 300

Last day, Farshid asked me to take a look at Lenovo IdeaPad 300 that Mr. Ahmadi complaint he can not turn it on. Farshid spent some time on it and finally asked me for some help. I checked items Farshid has just thought may cause the problem. I thought it maybe something related to the software so we had to reboot it but as laptop was fully charged and waiting for depleting the battery might take a long time (because the device does not use battery. Finally as the battery is not removable in Ideapad 300, we decided to disassemble it! At this time Mehrnoosh joint us to see what is happening there. As I studied electronics and I had some prior experience on disassembling and mending devices, I started the work! Disassembling was somewhat easy. It got back to old days of hardware and electronic.


P.S. : The problem was something more serious, so we decided to sent it to one of Lenovo’s authorized repair shops.