Sometime I wish you an happy new year with music mixes, this time, this new year start by an issue, with redis database server, again. Looks like 2 years back with this post.
And we just lost 4.5 months of community data on the NodeBB community forum. That’s insane and a great lost for everybody.
How all started
On August 2017 I moved personal community forum from ovh VPS (debian 8.6) to Azure CH2i VM (debian 8.10). I spend one day for full migration (nodebb, redis and some other things) and all worked fine. Then I spent another 1/2 day to make redis-database backup working. I was unable to make this working even if redis-server was same version. Finally I got it working (not really sure what I done, typed so much command in this day) but backup file was created, so I though all was OK
Then Microsoft Azure maintenance email came
On December I received a maintenance update for my azure VM, no problem, got it
Dear Azure customer, Your Azure Virtual Machines (VMs) require an important security and maintenance update. The vast majority of Azure updates are performed without impact but, for this specific update, a reboot of your VMs is necessary. A maintenance window has been scheduled starting January 10th 2018 (00:00 UTC) during which, Azure will automatically perform the required VM reboot. An affected VM will be unavailable for several minutes, as it reboots. For any VM in an availability set or a VM scale set, Azure will reboot the VMs one Update Domain at a time to limit the impact to your environments. Additionally, operating system and data disks will be retained during this maintenance. You have one or more VMs that are eligible to initiate self-service maintenance proactively at any time between now and January 9th, 2018 (12:00 UTC). You can see which VMs are eligible for a self-service maintenance and initiate this step using the Azure Portal Also see the how-to guides for Windows/Linux VMs to learn more. If you complete this self-service maintenance step before January 9th, your VMs will be marked as updated and will not be impacted by the scheduled maintenance window. If you initiate self-service maintenance, the temporary disk will not be maintained.
Reboot does not mean all is okay
2 days ago while I was working on the VM, I remember that this one needed a reboot, so as usual I issued a sudo reboot then checked CH2i web site was back and left all alone, it was working (at least at first view)
Happy new year, start fine
Today I needed to read back a article I wrote on this blog to put a customer Raspberry PI Zero into read-only mode and I noticed that the comments on bottom of the article was missing (and some was very useful).
So I went to community forum web site and it was down (makes sense why I had no comments). No problem ssh on the server and look at nodeBB logs. It’s indicate nodeBB can’t start because it need a database schema update for current running nodeBB version. Humm, that’s strange but issue the famous ./nodebb upgrade , and then ./nodebb start and nodeBB came back to life.
But, … WTF?
Last forum post is dated from August 10th 2017? whoo, ok maybe a hacker got a security hole in nodeBB or redis-database and played with post (it already happened), so no problem I’ve got backup every day. restore the redis-database backup and yes it is, start again !!!
But why again nodeBB ask for upgrade database, it was working 2 days ago, I didn’t upgraded nodeBB version so what ?
You know what, WTF again
Hey, with database restore of last day, same things, data are 4.5 month old !!!
Ok go to server again, and take a look at those backup
root@srv:/var/lib/redis# l backup/ total 459920 drwxr-xr-x 2 redis redis 4096 Jan 3 21:40 . drwxrwxr-x 3 redis redis 4096 Jan 3 22:03 .. -rw-r--r-- 1 root root 15394559 Dec 4 11:13 6379-dump.201712041113.rdb -rw-r--r-- 1 root root 15394559 Dec 5 11:13 6379-dump.201712051113.rdb -rw-r--r-- 1 root root 15394559 Dec 6 11:13 6379-dump.201712061113.rdb -rw-r--r-- 1 root root 15394559 Dec 7 11:13 6379-dump.201712071113.rdb -rw-r--r-- 1 root root 15394559 Dec 8 11:13 6379-dump.201712081113.rdb -rw-r--r-- 1 root root 15394559 Dec 9 11:13 6379-dump.201712091113.rdb -rw-r--r-- 1 root root 15394559 Dec 10 11:13 6379-dump.201712101113.rdb -rw-r--r-- 1 root root 15394559 Dec 11 11:13 6379-dump.201712111113.rdb -rw-r--r-- 1 root root 15394559 Dec 12 11:13 6379-dump.201712121113.rdb -rw-r--r-- 1 root root 15394559 Dec 13 11:13 6379-dump.201712131113.rdb -rw-r--r-- 1 root root 15394559 Dec 14 11:13 6379-dump.201712141113.rdb -rw-r--r-- 1 root root 15394559 Dec 15 11:13 6379-dump.201712151113.rdb -rw-r--r-- 1 root root 15394559 Dec 16 11:13 6379-dump.201712161113.rdb -rw-r--r-- 1 root root 15394559 Dec 17 11:13 6379-dump.201712171113.rdb -rw-r--r-- 1 root root 15394559 Dec 18 11:13 6379-dump.201712181113.rdb -rw-r--r-- 1 root root 15394559 Dec 19 11:13 6379-dump.201712191113.rdb -rw-r--r-- 1 root root 15394559 Dec 20 11:13 6379-dump.201712201113.rdb -rw-r--r-- 1 root root 15394559 Dec 21 11:13 6379-dump.201712211113.rdb -rw-r--r-- 1 root root 15394559 Dec 22 11:13 6379-dump.201712221113.rdb -rw-r--r-- 1 root root 15394559 Dec 23 11:13 6379-dump.201712231113.rdb -rw-r--r-- 1 root root 15394559 Dec 24 11:13 6379-dump.201712241113.rdb -rw-r--r-- 1 root root 15394559 Dec 25 11:13 6379-dump.201712251113.rdb -rw-r--r-- 1 root root 15394559 Dec 26 11:13 6379-dump.201712261113.rdb -rw-r--r-- 1 root root 15394559 Dec 27 11:13 6379-dump.201712271113.rdb -rw-r--r-- 1 root root 15394559 Dec 28 11:13 6379-dump.201712281113.rdb -rw-r--r-- 1 root root 15394559 Dec 29 11:13 6379-dump.201712291113.rdb -rw-r--r-- 1 root root 15394559 Dec 30 11:13 6379-dump.201712301113.rdb -rw-r--r-- 1 root root 15394559 Dec 31 11:13 6379-dump.201712311113.rdb -rw-r--r-- 1 root root 15394559 Jan 3 19:37 6379-dump.201801011113.rdb -rw-r--r-- 1 root root 6088333 Jan 2 11:13 6379-dump.201801021113.rdb -rw-r--r-- 1 root root 6088333 Jan 3 11:13 6379-dump.201801031113.rdb -rw-r--r-- 1 root root 6089159 Jan 1 11:13 6379-dump.201801011113.rdb -rw-r--r-- 1 root root 6166505 Jan 3 21:40 6379-dump.201801032140.rdb root@srv:/var/lib/redis#
As you can see there is a backup every day at 11H13, but all the backup have the same size (except the last one after reboot)
What’s wrong, does this mean that all backup are with data from August, sounds like so !!!
What happened on reboot
Even if backup are wrong, why a clean shutdown/reboot break the current redis database ?
Let’s look at the log when rebooted the VM, what DB saved to disk ? that’s what redis says
[39644] 01 Jan 22:38:28.021 * Background saving started by pid 30262 [30262] 01 Jan 22:38:30.770 * DB saved on disk [30262] 01 Jan 22:38:30.771 * RDB: 1 MB of memory used by copy-on-write [39644] 01 Jan 22:38:30.827 * Background saving terminated with success [39644] 01 Jan 22:42:34.313 # User requested shutdown... [39644] 01 Jan 22:42:34.313 * Saving the final RDB snapshot before exiting. [39644] 01 Jan 22:42:38.170 * DB saved on disk [39644] 01 Jan 22:42:38.170 * Removing the pid file. [39644] 01 Jan 22:42:38.170 # Redis is now ready to exit, bye bye... [333] 01 Jan 22:43:02.402 # You requested maxclients of 10000 requiring at least 10032 max file descriptors. [333] 01 Jan 22:43:02.668 # Redis can't set maximum open files to 10032 because of OS error: Operation not permitted. [333] 01 Jan 22:43:02.668 # Current maximum open files is 1024. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'. _._ _.-``__ ''-._ _.-`` `. `_. ''-._ Redis 2.8.17 (00000000/0) 64 bit .-`` .-```. ```\/ _.,_ ''-._ ( ' , .-` | `, ) Running in stand alone mode |`-._`-...-` __...-.``-._|'` _.-'| Port: 6379 | `-._ `._ / _.-' | PID: 333 `-._ `-._ `-./ _.-' _.-' |`-._`-._ `-.__.-' _.-'_.-'| | `-._`-._ _.-'_.-' | http://redis.io `-._ `-._`-.__.-'_.-' _.-' |`-._`-._ `-.__.-' _.-'_.-'| | `-._`-._ _.-'_.-' | `-._ `-._`-.__.-'_.-' _.-' `-._ `-.__.-' _.-' `-._ _.-' `-.__.-' [333] 01 Jan 22:43:02.669 # Server started, Redis version 2.8.17 [333] 01 Jan 22:43:05.663 * DB loaded from disk: 2.993 seconds [333] 01 Jan 22:43:05.663 * The server is now ready to accept connections on port 6379 [333] 02 Jan 11:13:01.342 * Background saving started by pid 41017 [41017] 02 Jan 11:13:01.591 * DB saved on disk [41017] 02 Jan 11:13:01.591 * RDB: 0 MB of memory used by copy-on-write [333] 02 Jan 11:13:01.684 * Background saving terminated with success [333] 03 Jan 11:13:02.009 * Background saving started by pid 50778 [50778] 03 Jan 11:13:02.486 * DB saved on disk [50778] 03 Jan 11:13:02.486 * RDB: 0 MB of memory used by copy-on-write [333] 03 Jan 11:13:02.498 * Background saving terminated with success [333] 03 Jan 19:33:18.722 * 1 changes in 900 seconds. Saving... [333] 03 Jan 19:33:18.722 * Background saving started by pid 11705 [11705] 03 Jan 19:33:19.689 * DB saved on disk
So what, looks like something went wrong and I don’t know what. But what I know, for sure, is that we lost more than 4 months of data, and I apologize and feeling bad for this, how can this happen ? and how to prevent to happen again ?
Conclusion
I’m just stuck, For now I opened back a ticket on github because of course I’m not the only one who encounter this issue.
Any comments or help to understand this problem and avoid it again are welcome!
By the way I apologize for this, I’ve lost 2 month of living data and interesting posts and topics. Users account registered in the interval are also lost, it’s a big loss for my community and all fans.
To avoid this happening again, I’ll investigate on backup and check them.
While writing this post, a user posted a new message on forum, so after that, I rebooted the server, and the message is still there, unbelievable it seems to work and I didn’t changed anything.