It’s a slow Wednesday afternoon. The rain drips outside, collecting in large puddles. The data centers are humming along, the developers are drinking coffee and writing code, the customers’ orders keep coming in though the web sites. All is well in the world of Foo Corp’s infrastructure.
Suddenly, PagerDuty starts paging. The dashboards are turning red. There was a massive spike in demand, and Foo Corp’s databases are struggling to meet it. Request latency is shooting through the roof. Demand is high and growing higher and the systems are unable to handle it.
The ops team jolts into action and the database guys start flooding the relevant slack channels. In a few minutes, you see what happened: the storage systems everything is built on are no longer serving storage. It might be a network issue with the expensive RDMA network you put in; it might be an issue with the new NVMe SSDs you bought that take a looong tiiiime to run their garbage collection cycles. Maybe you’ll figure it out later and write a nice postmortem no one will read. But right now, whatever it is, it’s painful to leave customer orders on the floor because the infrastructure just can’t serve.
Sounds painful? We think so too. Good thing Lightbits LightOS is coming soon.
Leave a Reply