Pokémon GO: Architecture of the #1 AR Game in the World

How Pokémon GO started, its architecture, and the individual components driving this popular game.

Young And Old Enjoying Pokémon GO

My mother-in-law is absolutely obsessed with this game.

As a developer, I had a burning desire to understand how Pokémon GO started, its architecture, and the individual components driving this popular game.

This is Pokémon GO as I know it.

Origin Story

In the last 30 days, there’s been more than 81 million active players. Astronomical numbers, but it didn’t start off this way. In fact, Pokémon GO is the result of many iterations.

Niantic, the company behind Pokémon GO, was founded by John Hanke and Phil Keslin. Not only were they Google veterans, but they were originally the co-founders of a company called Keyhole.

Keyhole was the company Google acquired to start Google Earth and later Google Maps. The core technology behind Keyhole was the ability to stream large databases of mapping data across the internet.

As experts in 3D mapping, John and Phil will go on to leverage this technology to create Niantic’s first augmented-reality multiplayer game, Ingress. Ingress played a crucial role in the development of Pokemon GO by forming the data pool of locations that later became the PokéStops and gyms in the game.

A combination of historical markers, geo-tagged photos from Google, and popular player-submitted places were chosen as the starting point for Pokémon GO.

High-Level Architecture

Today, Pokémon GO handles millions of player requests a second through every day gameplay.

With many founders and early engineers originally from Google, choosing Google Cloud Platform (GCP) as their cloud infrastructure was the logical choice.

Pokémon GO: High-Level Architecture

At a high level, Pokémon GO’s architecture consists of the following key components and services:

  • Google Kubernetes Engine (GKE): Manages and scales workloads and services needed for the game.

  • Cloud Load Balancing: Distributes traffic across K8 clusters and services. Entry point for all requests.

  • Cloud CDN: Caches and serves content to users.

  • Spanner: Stores player, map and Pokémon data.

  • Pub/Sub: Handles data use for analytics.

  • Bigtable: Used for logging and tracking user actions.

  • Dataflow: Processes player logs.

  • BigQuery: Stores game data for analyzing player behavior and verifying features.

  • Frontend Service: Handles player interactions in the app.

  • Spatial Query Backend: Caches map data based on location.

In order to handle such high traffic, the engineering team leveraged two key GCP services: Google Kubernetes Engine (GKE) and Spanner.

GKE is a Kubernetes cluster manager and orchestration system for running Docker containers. Not only did Google create Kubernetes, but GKE implements the full Kubernetes API with robust autoscaling and support…up to 15,000 nodes!

With cluster autoscaling, the cluster’s node pool can automatically resize based on workload demands. Through GKE, pods can scale out horizontally based on CPU utilization or custom metrics (e.g. high network traffic). Additionally, pods can scale up as well based on CPU and memory usage.

From Datastore to Spanner

Originally, Niantic made the decision to store Pokémon GO data using Datastore (Firestore is the new version), a fully-managed NoSQL database service. With a managed service, the engineering team was able to meet the initial demands of Pokémon GO.

Though Datastore is considered highly scalable and can theoretically handle up to 740k operations per second, this is based on a ramp up schedule using the 500/50/5 rule. This rule states a maximum of 500 operations per second and increasing traffic by 50% every 5 minutes.

Based on these limitations as well as the natural growth of the game, the engineering team decided to migrate to another managed service called Spanner.

What’s particularly interesting is the transition from a NoSQL database to a relational database, which is most likely due to a need for global consistency and high transactionality. Spanner also boasts unlimited write and read scalability by decoupling compute resources from data storage.

At any given time, there’s more than 5,000 Spanner nodes and thousands of Kubernetes nodes running for Pokémon GO. This is not including the GKE nodes needed for other microservices that augment the gaming experience.

Navigating the Pokémon World

When a user first opens the Pokémon GO app, all static media such as Pokémon images and music files are downloaded to the user’s phone.

Photo by David Grandmougin

These files are stored in Cloud Storage and also cached with Cloud CDN at the Cloud Load Balancing level. By caching these files, content will load faster, traffic to the web servers are decreased, and overall user experience is improved.

As the user walks around in the physical world, the app uses the phone’s GPS to determine your location. This is important because your location determines the Pokémon you see on the map, and also the PokéStops, gyms, and landmarks you see around you.

All of the mapping data is fetched from Spanner and handled by a service called the Spatial Query Backend. This service caches all the mapping data which is sharded based on location.

As the user interacts with the app, requests are reverse proxied through NGINX to the frontend service, where the appropriate data is served back to the user, either from Spanner or from cache.

In a nutshell, the frontend handles player interactions within the game, while the Spatial Query Backend handles the geospatial data needed to render the map, what the user sees (e.g. Pokémon, Gyms, etc), and any other features that are location-based.

So what exactly happens behind the scenes when a user tries to catch a Pokémon?

Catching a Pokémon

Pokémon can be seen on the phone through Augmented Reality (AR). Once the user interacts with the Pokémon and the Pokémon is caught, an event is sent from the frontend service to Spanner through the Spanner API.

Catching Pokémon in AR+ Mode

This is where the player entity and player data is normally stored. The only difference is that the data pertaining to catching a Pokémon is not cached in the Spatial Query Backend, unlike the data for gym battles and PokéStops.

These events are handled as write requests, and since Spanner is strongly consistent, all nodes see the same data at the same time, regardless of which node is accessed. As a result, all subsequent read operations from any node will return the most recent write value.

For gym battles and PokéStops, once an update is received, the spatial data is updated in memory for caching, and then used to serve future requests from the frontend service. It’s important to note that these updates don’t apply to just a single user, but all the players in the same location.

All user actions are also stored in Bigtable for logging and tracking (e.g. business telemetry), and messages are published to GCP’s Pub/Sub messaging service for analytics.

Live Events

In addition to normal gameplay, there are regular live events. Events like Raid Battles, can see transactions increase from 400k per second to close to a million in minutes.

Players Teaming Up For Raid Battles

Niantic found that their GCP servers would get overloaded during the initial phase of these events when players were forming teams. This is due to many players trying to join the Raid lobby in preparation for the event, causing a large influx of traffic.

Even though Pokémon GO operates in a multi-server environment and players are typically evenly distributed across all the servers, Raids require all players to be on the same server in order for everyone to access the game data. As a result, a large amount of data needs to be shared across the team of players participating in the Raid, leading to increased server traffic and latency.

Not only does this cause lag and delays for players joining the Raid, but also other players in the vicinity who may not be participating. This is due to being on the same server.

So what can be done to handle live events that require high volumes of data? Enter Redis.

Niantic initially designed Pokémon GO around a stateful architecture, making scaling and server restarts challenging.

Since Pokémon GO runs on GKE, pods need to be cordoned and existing sessions to expire before servers can be restarted to add more players. Cordon is a Kubernetes operation that marks a node as unschedulable. This is usually done in a Kubernetes cluster when a node needs to be taken offline for maintenance or upgrading.

Not only is this process time consuming, sometimes as long as 30 minutes, but costly as well. The architecture team realized that in order to improve the player experience, the Pokémon GO servers needed to be stateless. This would allow them to scale up and down quickly to match the increased loads.

The team chose Redis on Google Cloud as their solution, providing zero-downtime scaling.

With the new Raid architecture, it’s no longer needed for players to be on the same server. The data is stored in a centralized Redis cache where all servers have access to the same information.

As players join a Raid lobby, data such as player information and time for each group is stored as JSON keys in Redis. Statistics such as player locations, tendencies, and performance are also stored here.

Because Raid events typically only last 15 minutes, there isn’t a need for persistent data storage and all data can be cached in-memory.

Max Latency Decreased from 1 Second to ~250 Milliseconds

Not only has server hot spots reduced significantly, max latency has dropped from over 1 second to 250 milliseconds. This is a 75 percent reduction!

Niantic not only greatly improved the player experience but also did so by maintaining high availability and performance while reducing operational costs.

Real-Time Experience

When Raids were introduced into the game, an influx of players wanted to participate. Since these events happened in real time, it’s unacceptable for notifications to be delayed as players could miss the event. With an average notification latency of 9 seconds, a better real-time notification system was needed.

Notification System: High-Level Architecture

Originally, the Pokémon GO client would poll the player’s message inbox every 15 seconds to retrieve new notifications. This led to redundant polling, and more importantly, the possibility of high latency if the notification was sent right after the client polled.

The team decided to pivot to a new pub-sub service called PushGateway. This service sits between the backend servers and game clients, and pushes messages through the WebSocket to the clients in real time.

In-App Notification Latency With PushGateway

These architecture changes led to huge improvements. Pokémon Go’s in-app notification latency dropped from 9 seconds to 1 second and the load from querying player inboxes for notifications was reduced by 85%.

If you made it this far, thank you for reading! I hope you enjoyed it.

If I made a mistake, please let me know.

Resources

[1] “Optimizing Pokémon GO,” redis.com.
https://redis.com/customers/niantic/.

[2] “How Pokémon GO Scales to Millions of Requests,” cloud.google.com.
https://cloud.google.com/blog/topics/developers-practitioners/how-pokémon-go-scales-millions-requests.

[3] “Improving the Pokémon GO Player Experience with PushGateway,” nianticlabs.com.
https://nianticlabs.com/news/improving-pokemon-go-player-experience?hl=en.

[4] “Using Spanner for Non-Relational Workloads,” cloud.google.com.
https://cloud.google.com/blog/products/databases/using-spanner-for-non-relational-workloads.

[5] “How the Gurus Behind Google Earth Created Pokémon GO,” mashable.com.
https://mashable.com/article/john-hanke-pokemon-go.

[6] “Pokémon GO’s Wild First Year: A Timeline,” theverge.com.
https://www.theverge.com/2017/7/6/15888210/pokemon-go-one-year-anniversary-timeline.

[7] “Pokémon GO Catch $6 Billion in Lifetime Player Spending,” sensortower.com.
https://sensortower.com/blog/pokemon-go-6-billion-revenue.

P.S. If you’re enjoying the content of this newsletter, please share it with your network: https://www.fullstackexpress.io/subscribe

What'd you think of today's edition?

Login or Subscribe to participate in polls.

Join the conversation

or to participate.