Summary: what happened and what we are doing about it
- On March 4, our Ceph storage cluster suffered a catastrophic failure caused by disk corruption.
- We had recently added eight new U.2 NVMe drives, and soon after, corruption appeared in our erasure encoding pool.
- Attempts to repair or remove the faulty disks didn’t help; the corruption spread to previously healthy disks.
- With multiple OSDs failing and the cluster going offline, we had no choice but to reprovision everything.
- We changed the disk block size to 4096b, removed erasure encoding, switched fully to replication-3, and restored from backups.
- Unfortunately, some of our customers did not have automatic scheduled backups, leading to permanent data loss for them.
- Going forward, we will:
- Set up automatic backups for all customers by default
- Hire consultants, invest in staff training
- Use replication-3 to ensure higher redundancy and reduce the risk of similar failures
- We understand your frustration and sincerely apologize. We are committed to making sure this does not happen again.
- The root cause of the issue was due to a bug in Ceph 19.2
Hello, my name is David and I’m the CEO of HostUp. On the 4th of March at 14:30, our storage system experienced a catastrophic failure. This is a catalog of the events that followed and what we have done to ensure this will never happen again.
The root cause of the issue was disk corruption. For storage in each cluster, we use Ceph, a highly redundant solution that replicates data to multiple servers. We’ve been using Ceph for years without issues.
On the 2nd of March, our tech performed a routine task: adding more disks to our storage cluster. In total, we added eight U.2 NVMe drives. Just a few hours after the rebalancing process started, we received notices that OSDs were crashing. Upon investigation, we found that the OSD software panicked as soon as it read the corrupted data.
Checking with ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-X
, we confirmed fsck errors with lextent overlaps, indicating that parts of the disk were corrupt.
In the following hours, we tried to fix this issue, but it proved more troublesome than we anticipated. We first removed the new OSDs, but the corruption continued appearing on previously functional OSDs.
Since we had never seen such an issue before adding the new disks, we believed that as data replicated onto them, it was also rebalancing back from them onto the older disks, making those corrupt as well.
ceph-bluestore-tool repair
wouldn’t work, so we had to delete the OSD and let Ceph replicate healthy data onto it again. This happened repeatedly: another OSD would crash, we’d remove it, and Ceph would rebuild.
Each time, the corruption only showed up in our erasure encoding pool, even though most of our data was in the replicated pool (fsck
showed errors like 2#5:a6000b40:::rbd_data.2.
, where 2 is the ID of our erasure encoding 4+2 pool).
This led us to suspect a firmware incompatibility with the new disks when used with erasure encoding. We’ve previously used the same disks without issue, however this time these had a newer firmware. smartctl
showed no errors, so the disks themselves seemed healthy. We briefly considered a bad backplane, but because it happened on multiple servers, that seemed less likely.
Two days later, we hit a real emergency: multiple OSDs went offline on three servers at once. As a result, PGs went down and became unreadable. We had been working for two days to solve this without success, and with the storage cluster offline and OSDs failing to boot, we felt we were out of options.
We tried to dump the offline PG from the crashed OSD, but it failed as soon as it hit the corrupted data. We called 45drives requesting emergency support, but never heard back.
With no way to bring the cluster back online and increasing corruption issues, we decided to reprovision everything to get services running again.
Upon reprovisioning, we changed three major things:
- We changed the block size on the disks from 512b to 4096b.
- Since the corruption occurred in the small erasure encoding (EC) pool, we decided to discontinue EC.
- Previously, most data was stored in replication-2. Now, we fully switched to replication-3 (an add-on in the past).
Once the cluster was reprovisioned, we restored backups and addressed the 200 tickets in our queue related to the offline cluster. I personally stayed awake for about 35–40 hours working on the issue.
By March 7, most of our customers were back online with backups restored. They lost one day of progress because we only run scheduled backups once per day. Sadly, some of our customers did not have automatic scheduled backups, so they lost all of their data. This was very upsetting for us.
One major oversight was not automatically enabling backup schedules for our customers, despite it being included at no extra cost. The customer would have to manually set what days they’d like the backup to be taken. Generally, we consider ourselves thorough with backups, but this time, we fell short.
Looking back, we believe this could have been avoided if we had handled things differently. As of today, we are implementing changes to ensure it does not happen again.
Later, our developer identified the the root cause of the issue to be caused by a critical bug in Ceph Squid 19.2.
The bug triggered when adding new OSD’s and when there existed an Erasure Encoding pool. We also see other people in the past 1-2 weeks have experienced similar issues.
We are now 100% confident that this will not happen again since we moved everything away from Erasure Encoding 4 + 2 to Replication-3 for redundancy.
Additionally, to prevent such a bug from ever happening again going forward we’ll always stay one major version behind. if the current version is 19.2.1, we’ll stay at 18.2.4 until version 20 is released, only then do we upgrade to version 19.X. This should allow more time for the “new” version to be battle tested by others first.
I am truly sorry about this entire situation—customers lost data due to our failure, and it was compounded by not having automatic backups enabled. We know how awful it would feel if our own vital data was lost.
Going forward, we will hire consultants for every change we make to our cluster. We have already reached out to Croit (https://www.croit.io/solutions/ceph-storage-solution) to sign a contract. Our team will receive training, and we plan to have third-party SLA agreements for emergency support.
We fully understand that as a customer, you may be angry and feel the damage is already done. We are deeply sorry. We’re taking every step possible to strengthen our procedures so nothing like this happens again, and we remain committed to providing a reliable storage solution moving forward.