It’s late on a clear and cold Saturday night. You had a few with friends at the pub and now you’re heading home. You can’t wait to get back and snuggle with the cats. And then the email comes in.
“We have a problem. There’s something wrong with the database clusters’ latencies. I’m not sure what’s going on but the tail keeps rising. If this continues for much longer, we are going to be in violation of our SLAs and the brown stuff will hit the fan. Can you take an urgent look?”
Sigh, the cats will have to wait. Good thing you didn’t go for that last round at the pub. Let’s see. The database clusters look OK, no nodes have failed recently, CPU utilization is OK, query processing time within acceptable bounds, what the hell is going on?
And then you see it. Some of the new batch of SSDs are failing on some of the nodes. When they fail, Linux resets them and they come back until they fail again. They have been jittering for the last few hours, slowing down the nodes they’re on. And every time an SSD fails, the latency on that node spikes up, bringing the cluster’s entire tail latency up. It’s either kill those nodes, reducing capacity to a dangerous level, or make a midnight trip to the data center and take care of those drives.
As you navigate the quiet and empty streets on the way to the data center, it occurs to you. Drives have always failed and will continue failing. What you need is for those drives to just fail in place while everything continues working, no slowdown, no tail latency increase. Then you could be home with the cats right now. You keep driving.
At home, if anyone were listening, they might or might not hear the cats quietly meowing “LightOS… use LightOS. From Lightbits. Coming soon.”