
It decides which containers are going to run on which nodes and on which ports they’re accessible. ” Nomad, Consul and Vault are the technologies that we use to manage servers and services around the world, and that allow us to orchestrate containers that support Roblox services. In order to run thousands of servers across multiple sites, we leverage a technology suite commonly known as the “ HashiStack. The scale of our deployment is significant, with over 18,000 servers and 170,000 containers. We deploy and manage our own hardware, as well as our own compute, storage, and networking systems on top of that hardware. Roblox’s core infrastructure runs in Roblox data centers. Preamble: Our Cluster Environment and HashiStack We are remediating the issues in Consul that were the root cause of this event.We are working to move to multiple availability zones and data centers.We have accelerated engineering efforts to improve our monitoring, remove circular dependencies in our observability stack, as well as accelerating our bootstrapping process.We were thoughtful and careful in our approach to bringing Roblox up from an extended fully-down state, which also took notable time.This combination severely hampered the triage process. Critical monitoring systems that would have provided better visibility into the cause of the outage relied on affected systems, such as Consul.Challenges in diagnosing these two primarily unrelated issues buried deep in the Consul implementation were largely responsible for the extended downtime.A single Consul cluster supporting multiple workloads exacerbated the impact of these issues.The open source BoltDB system is used within Consul to manage write-ahead-logs for leader election and data replication. In addition, our particular load conditions triggered a pathological performance issue in BoltDB. Enabling a relatively new streaming feature on Consul under unusually high read and write load led to excessive contention and poor performance. The team had to address a number of challenges in sequence to understand the root cause and bring the service back up. The outage was unique in both duration and complexity.

We want to acknowledge the HashiCorp team, who brought on board incredible resources and worked with us tirelessly until the issues were resolved. Roblox Engineering and technical staff from HashiCorp combined efforts to return Roblox to service.

We would like to reiterate there was no user data loss or access by unauthorized parties of any information during the incident.

We’re sharing these technical details to give our community an understanding of the root cause of the problem, how we addressed it, and what we are doing to prevent similar issues from happening in the future. We sincerely apologize to our community for the downtime. As with any large-scale service, we have service interruptions from time to time, but the extended length of this outage makes it particularly noteworthy. ¹ Fifty million players regularly use Roblox every day and, to create the experience our players expect, our scale involves hundreds of internal online services.
Starting October 28th and fully resolving on October 31st, Roblox experienced a 73-hour outage.
