Muli Ben-Yehuda's journal

March 4, 2019

Data center vignettes #5

Filed under: Uncategorized — Muli Ben-Yehuda @ 8:55 AM

It’s Monday. The rain drums on the roof. You are worried about next week’s upgrade to the storage servers. They are struggling. Linux, bless its little heart, is not keeping up with the new NVMe SSDs. With RAID5 and compression, it’s slower than a three legged turtle. The plan is to put in stronger CPUs and more RAM across the entire storage server fleet.

You sit up straight. An idea has just occurred to you. This could be big. Really big. You know how no one does machine learning in software anymore? How the big cloud guys build custom ASICs? They do it because at scale, a 20% reduction in CPU utilization is huge.

What if you could achieve the same efficiency as the big guys for your storage servers? What if there was a way to accelerate in hardware common storage operations and offload them from the CPU? The performance improvements would be nice, but the TCO savings, beginning with avoiding that messy data center wide upgrade next week — that’s going to be huge. It will delight your boss. And your CFO.

As the rain continues drumming, you realize that you don’t need to build the storage accelerator. Lightbits already built it. You put that upgrade on hold and give them a call to order a batch of LightFields. Your day just got a whole lot better. Even the rain has stopped.




March 2, 2019

Data center vignettes #4

Filed under: Uncategorized — Muli Ben-Yehuda @ 11:25 PM

It’s late on a clear and cold Saturday night. You had a few with friends at the pub and now you’re heading home. You can’t wait to get back and snuggle with the cats. And then the email comes in.

“We have a problem. There’s something wrong with the database clusters’ latencies. I’m not sure what’s going on but the tail keeps rising. If this continues for much longer, we  are going to be in violation of our SLAs and the brown stuff will hit the fan. Can you take an urgent look?”

Sigh, the cats will have to wait. Good thing you didn’t go for that last round at the pub. Let’s see. The database clusters look OK, no nodes have failed recently, CPU utilization is OK, query processing time within acceptable bounds, what the hell is going on?

And then you see it. Some of the new batch of SSDs are failing on some of the nodes. When they fail, Linux resets them and they come back until they fail again. They have been jittering for the last few hours, slowing down the nodes they’re on. And every time an SSD fails, the latency on that node spikes up, bringing the cluster’s entire  tail latency up. It’s either kill those nodes, reducing capacity to a dangerous level, or make a midnight trip to the data center and take care of those drives.

As you navigate the quiet and empty streets on the way to the data center, it occurs to you. Drives have always failed and will continue failing. What you need is for those drives to just fail in place while everything continues working, no slowdown, no tail latency increase. Then you could be home with the cats right now. You  keep driving.

At home, if anyone were listening, they might or might not hear the cats quietly meowing “LightOS… use LightOS. From Lightbits. Coming soon.”

March 1, 2019

Data center vignettes #3

Filed under: Uncategorized — Muli Ben-Yehuda @ 10:36 AM

It’s Friday. You’ve been hacking on this cool bit of code for awhile. It will be such a pleasure to deploy and see the user engagement numbers go up. The CI is green. The code is tight. You take a deep breath and deploy.

Ten minutes later, everything is fine. Ten minutes after that, still good. An hour passes. You check out the Grafana dashboard, and everything looks OK, except… why does the size of one of your key data stores continually increase with the new code?

This is not yet an emergency but it will become one if it keeps up. Each of your servers is limited to two SSDs. The infrastructure guys wanted to keep SKU sprawl to a minimum, and most of the CPU cycles are used for computation anyway, so they decided to “right size” the storage on each server to two SSDs per node. You crunch some numbers and realize that the data store is going to stop growing and stabilize — exactly 100GB after it exhausts all available space on your servers. You curse and roll back the code until they can install more SSDs, sometime next decade.

Wouldn’t it have been nice if there was no limitation on the amount of storage your application could use, while still enjoying the benefits of direct attached SSDs? Enter LightOS, coming soon from Lightbits Labs.

Blog at