Online gaming and metaverse platform Roblox is expanding its infrastructure in the wake of a 73-hour outage in October that left its 50 million daily users offline.
The company will add a data center and expand its availability zones, hoping to address the October downtime. The company’s CEO said a key factor in the outage was “the growth in the number of servers in our data centers.” The downtime cost Roblox an estimated $25 million in lost bookings, the company said.
The outage was reviewed in an incident report released last week, which outlined how several software services contended for resources, making it harder to diagnose a bug in a database. The incident illustrates how the growing complexity of online applications can sometimes make it harder to trouble-shoot automated infrastructure, leading to lengthier outages.
The Roblox downtime was one of several extended cloud-level outages in 2021, which are driving a renewed focus on reliability engineering for complex infrastructures. DCF highlighted this issue in our 2022 Forecast, noting that “uptime is becoming more complex, requiring backup and failover strategies that span cloud, colo, on-premise facilities and edge infrastructure.”
The expansion by Roblox also underscores how metaverse-style applications will rely on significant amounts of infrastructure – a reality that could generate additional demand for digital infrastructure like data centers and network connectivity, as was noted in our recent Data Center Executive Roundtable (The Metaverse Will Need A Lot of Data Centers).
The Challenges of Growth
Roblox is an online platform for games and virtual experiences, which is available across multiple OSes and devices. Along with Minecraft and Fortnite, it has been cited as a early example of a metaverse – a collection of virtual worlds, landscapes and characters available through an immersive online environment.
Roblox is free to play and download, but operates a large in-world economy based on the Robux currency. More than 9.5 million developers have deployed games and apps, and can make money by selling items (such as clothing or avatars) in online storefronts. Roblox went public through an IPO last year, and had $509 million in revenue in the third quarter of 2021.
That’s why the lengthy outage in October became a significant business event for Roblox, which reported that the outage led to $25 million in lost revenue, and prompted the company to make $6.8 million in credits to developers as compensation for lost sales.
“This was not due to any peak in external traffic or any particular experience,” founder and CEO David Baszucki wrote on Oct 31. “Rather the failure was caused by the growth in the number of servers in our data centers. The result was that most services at Roblox were unable to effectively communicate and deploy.”
Roblox operates several data centers with more than 18,000 servers, which support more than 170,000 software containers. The company uses a software suite from HashiCorp to manage its infrastructure. The software issues that contributed to the Roblox outage are complex, but one clear response was the need for more diverse infrastructure.
“Running all Roblox backend services on one Consul (service mesh) cluster left us exposed to an outage of this nature,” Roblox engineer Daniel Sturman in a detailed blog post. “We have already built out the servers and networking for an additional, geographically distinct data center that will host our backend services.
“We have efforts underway to move to multiple availability zones within these data centers,” he added. “We have made major modifications to our engineering roadmap and our staffing plans in order to accelerate these efforts.”
Within the last month, Roblox has advertised for a new data center manager position based in Ashburn, the leading cloud and connectivity hub in Northern Virginia.
Uncovering A ‘Pathological Performance Issue’
As is the case in many extended outages, the Roblox issues were difficult to diagnose due to confusion about the root cause of the problems. This passage in the incident report provides a high-level description.
“The root cause was due to two issues. Enabling a relatively new streaming feature on Consul under unusually high read and write load led to excessive contention and poor performance. In addition, our particular load conditions triggered a pathological performance issue in BoltDB. The open source BoltDB system is used within Consul to manage write-ahead-logs for leader election and data replication.
- A single Consul cluster supporting multiple workloads exacerbated the impact of these issues.
- Challenges in diagnosing these two primarily unrelated issues buried deep in the Consul implementation were largely responsible for the extended downtime.
- Critical monitoring systems that would have provided better visibility into the cause of the outage relied on affected systems, such as Consul. This combination severely hampered the triage process.
For additional technical details, see the Roblox incident report.
As the service continues to improve and expand its infrastructure, it continues to experience periodic shorter outages, most recently on Saturday.