The Evolution of SoundCloud's Architecture: Part 1

Sponsored by

Stephen here.

I spend my weekends researching, learning, and creating content for this newsletter.

It would mean the world to me, if you took a few seconds of your time to check out Clumio.

At least see why they were able to raise $75 million. Even Atlassian is using it for Jira.

Doing so supports my work. Now let’s get on with the show.

With rising costs for Amazon S3 storage and potentially devastating business consequences from data loss, you need a holistic approach to cutting unnecessary spending and guarding against risks. Lawrence Miller, a consultant to multinational corporations who holds numerous networking certifications, has authored a concise volume that lays out the path to success in managing backup and compliance for S3 data lakes.

Remember how difficult it used to be to find music?

People used to use record players. And then cassette players. And then MP3 players.

People pirated music through Napster and LimeWire.

Separate devices had to be carried around in order to listen to music on the go.

And then everything changed.

The Apple iPhone was released in 2007, transforming the way we access and listen to music. iPod, phone, and Internet all-in-one.

This revolutionary product paved the way for companies like SoundCloud today.

As engineers, we can learn many valuable lessons analyzing the evolution of SoundCloud’s architecture over the last two decades.

Scaling is a Luxury Problem

From the start, the engineering team optimized for opportunity.

Instead of designing an architecture that could support millions of users, they started with a simple setup: Ruby on Rails application (called Mothership), Apache web server, and MySQL database.

SoundCloud’s Initial Architecture

Simple right?

SoundCloud launched in 2008. There was no high availability. In fact, the architecture wasn’t even asynchronous.

If a new comment was posted on a track, communication was blocked until all followers were notified.

What would be the reason for this?

The answer goes back to optimizing for opportunity. They leveraged a simple tech stack the team knew well, and focused on delivering value to their users.

As a result, SoundCloud was able to move fast and build a “sticky” platform with strong product-market fit.

One way of showing this was by eating their own dog food.

From the beginning, SoundCloud’s public API was developed alongside their website.

Third-party applications integrating with SoundCloud used the same exact APIs used by the engineering team’s internal application.

Shortly after, Apache was switched out for Nginx (incremental changes).

SoundCloud Changes Web Servers

Nginx provided better connection pooling and simplified routing configurations between different environments.

The Grocery Store and the Post Office

As SoundCloud grew, traffic grew.

And as traffic grew, there was a growing problem: some workloads took much longer than others (hundreds of milliseconds).

This was a problem with Nginx at the time. More specifically HTTP/1.

In computer networking, there’s something called Head-of-Line (HoL) blocking. This happens when slower requests clog up connections, and other requests are stuck waiting to be processed.

Imagine waiting in line at the grocery store to checkout. Because the first customer has a cart full of items, the rest of the line is delayed.

Grocery Store HoL Blocking Example

Today, HoL blocking is partially solved with HTTP/2 through multiplexing requests over a connection. This only solves blocking at the HTTP level but blocking can still happen at the TCP level.

This begs the question, how can the current architecture process requests concurrently?

When SoundCloud’s architecture was initially developed (2008), concurrent request processing in Rails was still considered immature.

Instead of investing more time auditing dependencies, the engineering team decided to stay with the existing model:

  1. Single concurrency per application server process.

  2. Multiple processes per host.

Even though there are multiple processes per host, SoundCloud was already experiencing high traffic. Several long-running requests could easily recreate the HoL blocking problem.

For example, with 5 processes instead of 1, the system would theoretically be able to handle an average of 5 times as many slow requests.

Again, picture yourself at the post office. You’re waiting in line for an available worker to help you with your package. With more people, there’s a higher chance of delays (people with multiple packages).

Post Office Example

How could this HoL problem be solved?

The engineering team came to a realization. What they wanted was a system that never queued. Or at least a queue with minimal wait time.

To accomplish this, they had to make sure each Rails application server never received more than one request at a time.

They made the following changes:

  1. Ensured servers were stateless.

  2. Added HAProxy to infrastructure.

  3. Configured backend with a maximum connection count of 1.

With a multi-server queueing model (M/M/c) and HAProxy as the queueing load balancer, any temporary back-pressure would be buffered.

Again, simple design choices.

Synchronous to Asynchronous

These new changes may have solved the HoL issue, but long-running requests is still a problem in and of itself.

One example is user notifications.

When a user uploads a new track to SoundCloud, the user’s followers are notified. This may be fine for a users with less followers, but for more popular users, there were huge delays.

In fact, the fan-out of notifications would frequently exceed tens of seconds. These long-running requests needed to be jobs (queue) instead.

Remember how everything was still synchronous in this architecture?

Enter RabbitMQ.

Since storage was also growing rapidly for sounds and images, the team decided to offload assets to Amazon S3. Storage scaled nicely, while transcoding compute stayed in Amazon EC2.

To connect everything, RabbitMQ’s Advanced Messaging Queuing Protocol (AMQP) was used as a middleware to manage the lifecycle of these jobs.

“One broker to queue them all.”

Jobs fell into two main categories based on work time:

  1. Interactive - less than 250ms work time

  2. Batch - everything else

Identifying Points of Scale

At this point, SoundCloud was at the hundreds of thousands user mark.

In order for the architecture to continue to evolve, the engineering team knew they needed to decouple the read and write paths.

The read and write paths could then be individually optimized.

One area of focus was the widget.

As it turns out, SoundCloud’s highest volume request was a single endpoint delivering data for the widget.

Memcached and Varnish were added to cache the following:

  1. Full pages

  2. DOM fragments

  3. Partially rendered templates

  4. Read-only API responses

These performance improvements solved CPU issues (rendering engine and runtime) in the application tier.

Another area of focus was the Dashboard, a user’s personalized view of activities.

When the Dashboard receives an update, the appropriate users are notified across all devices and third party applications.

The read path needed to be optimized for sequential access per user over a time range.

The write path needed to be optimized for random access where one event can affect the indexing of millions of users.

To account for these optimizations, Cassandra was chosen as the storage system and Elasticsearch was chosen to enhance search. These solutions provided persistence and scaling.

The final monolithic architecture looked like this:

SoundCloud’s Monolithic Architecture

The Story Continues

SoundCloud was known as the “YouTube for audio”, with unique features such as the waveform player, track comments, and collaborations.

The engineering team’s architecture decisions in its early years allowed the company to be adaptive and product-driven.

The key takeaways from SoundCloud’s early success can be summarized in three points:

  1. Optimize for opportunity. Focus on the product.

  2. Scaling is a luxury problem. Architect for growth over time.

  3. Identify points of scale. Define integration points well for organic growth.

In the next part of this series, we’ll learn about SoundCloud’s transition to a microservices architecture, and their challenges of scaling to hundreds of millions of users.

If you made it this far, thank you for reading! I hope you enjoyed it.

If I made a mistake, please let me know.

P.S. If you’re enjoying the content of this newsletter, please share it with your network and subscribe:


[1] “Evolution of SoundCloud’s Architecture,”

[2] “Building Products at SoundCloud,”

[3] “Service Architecture at SoundCloud,”

[4] “How SoundCloud Uses HAProxy with Kubernetes for User-Facing Traffic,”

[5] “The End of the Public API Strangler,”

[6] “How We Develop New Features Using Offsites and Clean Architecture,”

[7] “How We Ended Up With Microservices,”

[8] “BFF @ SoundCloud,”

[9] “Pattern: Backends For Frontends,”

[10] “Pattern: Backends For Frontends,”

What'd you think of today's edition?

Login or Subscribe to participate in polls.

Join the conversation

or to participate.