The Evolution of SoundCloud's Architecture: Part 2

This is Part 2 of the SoundCloud series. If you haven’t read Part 1, I encourage you to do so. You can read it here.

Previously, we learned SoundCloud wasn’t initially built for scale. The platform was built for opportunity.

The team was adaptive, product-driven, and developed a simple architecture.

To SoundCloud, scaling was a luxury problem. A problem that could be addressed over time.

SoundCloud at this point was a combination of a large social network and music distribution platform.

Every minute, there was about 12 hours of music uploaded, with hundreds of millions of daily users.

The Mothership was battle-tested against peak performance, and great tooling was created to facilitate the next level of growth.

But as it turns out, technical challenges were only a small piece of the puzzle.

The core issue holding SoundCloud back was in their development process.

Visualizing the Development Process

The engineering team at the time was really made up of two core teams:

  1. App team: Responsible for the Mothership and old (current) user interface.

  2. Web team: Responsible for v2, branded as The Next SoundCloud, a single-page JavaScript web application.

There was a big disconnect as both teams worked in isolation (separate buildings) and communicated mainly through issue trackers and IRC.

In order to identify problems in the development process, the teams used a method called value-stream mapping.

This tool can be used to visualize critical steps in a process, and quantify the time taken in each stage.

In this case, it would be the different stages the teams need to deliver a feature.

Value-Stream Mapping: Original Process

With the current process, it would take over two months to go live with a feature.

More than half the time was waiting for the feature to be picked up by the next person responsible.

The flow looked something like this:

  1. There’s an idea for a new feature. Spec and screen mockups are written and stored in Google Drive.

  2. Waiting

  3. Design team gets the spec and designs user experience. Development card is added to the Web team’s Trello board.

  4. Waiting

  5. A front-end engineer converts the design into client-side code using mock data. A story is created in Pivotal Tracker with the necessary API (Rails) changes.

  6. Waiting

  7. A back-end engineer writes the code, integration tests, and any changes needed to get the API live. Card in Trello is updated.

  8. Waiting

  9. A front-end engineer implements the new backend changes and gives the green light for deployment.

  10. Deployments were risky and painful, so the App team waits for several features before deploying to production. Rollbacks happened frequently due to unrelated code issues.

Out of a grand total of 66 days, only 11 days was actually used for development.

From 66 Days to 24 Days

In order to improve the development process, fundamental changes were needed.

Through value-stream mapping, letting features accumulate in the main branch before deployment was not ideal.

Instead, the engineering team adopted an Agile Release Train (ART) approach for the Mothership. In fact, regardless of how many features were in the main branch, deployments would happen every day after standup.

As a result, features were deployed to production when they were ready, not several days later.

Another problem was the disconnect between the Web and App teams. Not only were the teams working in isolation, but the back-end developers felt they didn’t have a say in the product.

The process was extremely frontend-driven.

To address this issue head on, the process was modified to pair the front-end and back-end developers together.

Value-Stream Mapping: Pairing Front-End and Back-End Engineers

This change ensured constant communication between the two and would hold the pair responsible for the completion of a feature.

Last but not least, the team added a mandatory code review to the process. Every pull request had to at least be approved by a second engineer before the changes could be merged to the main branch.

The collaboration-driven approach worked well for the engineering team as a whole. So much so, it was also implemented across the design and product teams as well.

Value-Stream Mapping: Final Process

The end result was 24 days.

A 42 day improvement for a feature to go from idea to go live!

Building Alongside the Mothership

Even though the process improved significantly, the previous diagram still showed features sitting in GitHub for 7 days before being reviewed.

Why would a new feature sit in queue for so long?

The Mothership codebase was massive and complex.

Not only did large changes require time to be reviewed, big changes were extremely risky due to all the tightly-coupled code over the years.

The first thought is to create smaller pull requests.

Breaking large code changes into smaller pull requests meant that things would be more manageable and reviews would be easier, but could also lead to unforeseen architecture mistakes.

Well, why does a single codebase need to implement so many features and components?

Why not break the monolith into multiple, smaller systems?

Now that the development process has been addressed, it’s time to tackle the technical challenges.

Even though the Mothership had a battle-tested architecture, in order for the team and the platform to continue to scale, a new strategy had to be implemented.

A strategy that could be implemented incrementally and could deliver value right away.

Enter microservices.

Adding Microservices Into the Mix

Instead of splitting up the monolith immediately, the engineering team made the decision to first build new features as microservices.

Nothing new would be added to the Mothership, and any feature that needed refactoring in the monolith, would be extracted out.

But as more microservices were created, a new problem arose.

Since the Mothership still had the majority of the logic, the majority of the microservices still had to communicate with it in some shape or form.

A common approach would be to have the microservices access the Mothership’s database directly. The issue with this approach is the public versus published interfaces problem.

Having both the Mothership and microservices share the same tables would make it difficult to make table structure changes down the road.

Instead, the engineering team decided to have the microservices consume the Public API, a published interface.

This meant that SoundCloud’s own internal microservices would behave exactly like the third-party applications integrating with the platform.

This model had a major flaw: The microservices needed to be aware of user activity updates.

For example, if a track received a new comment, the push notifications system needed to know about it.

With the current architecture, there wasn’t a way for these microservices to be notified. Polling would also be a bad solution at this scale.

The engineering team decided on two major architecture changes:

  1. A new model called Semantic Events was created.

  2. A new Internal API was created using Rails’ engines.

SoundCloud’s Transition to Microservices Architecture

Whenever there was a change in the domain objects, a corresponding message would be dispatched to the broker.

This message could then be consumed by any of the microservices, allowing them to react appropriately to the event.

These new changes also enabled Event Sourcing, allowing the microservices to deal with shared data.

Since the Public API was used by external applications, there were limitations on certain data that could be accessed. By creating an Internal API, microservices could access private data and notify users of private information.

The main idea around the new architecture changes was to break the coupling between the Mothership and the new microservices, by adding push and pull interfaces.

Contrary to what most people think, the catalyst for SoundCloud’s move towards a microservice architecture was driven mostly to improve team productivity.

Without streamlining the development process first, the team wouldn’t be able to produce and scale fast enough to build out the features the platform needed.

These new changes led to shorter feedback cycles, and allowed the team to push out production-ready features at a faster pace.

In Part 3 of this series, we’ll learn about SoundCloud’s new challenges and design patterns with the microservices architecture.

If you made it this far, thank you for reading! I hope you enjoyed it.

If I made a mistake, please let me know.

P.S. If you’re enjoying the content of this newsletter, please share it with your network and subscribe: https://www.fullstackexpress.io/subscribe

Resources

[1] “Evolution of SoundCloud’s Architecture,” developers.soundcloud.com.
https://developers.soundcloud.com/blog/evolution-of-soundclouds-architecture.

[2] “Building Products at SoundCloud,” developers.soundcloud.com.
https://developers.soundcloud.com/blog/building-products-at-soundcloud-part-1-dealing-with-the-monolith.

[3] “Service Architecture at SoundCloud,” developers.soundcloud.com.
https://developers.soundcloud.com/blog/service-architecture-1.

[4] “How SoundCloud Uses HAProxy with Kubernetes for User-Facing Traffic,” developers.soundcloud.com.
https://developers.soundcloud.com/blog/how-soundcloud-uses-haproxy-with-kubernetes-for-user-facing-traffic.

[5] “The End of the Public API Strangler,” developers.soundcloud.com.
https://developers.soundcloud.com/blog/end-of-the-strangler.

[6] “How We Develop New Features Using Offsites and Clean Architecture,” developers.soundcloud.com.
https://developers.soundcloud.com/blog/how-we-develop-new-features-using-offsites-and-clean-architecture.

[7] “How We Ended Up With Microservices,” philcalcado.com.
https://philcalcado.com/2015/09/08/how_we_ended_up_with_microservices.html.

[8] “BFF @ SoundCloud,” thoughtworks.com.
https://www.thoughtworks.com/insights/blog/bff-soundcloud.

[9] “Pattern: Backends For Frontends,” samnewman.io.
https://samnewman.io/patterns/architectural/bff/.

[10] “Pattern: Backends For Frontends,” developers.soundcloud.com.
https://developers.soundcloud.com/blog/announcing-twinagle.

[11] “Microservices and the Monolith,” developers.soundcloud.com.
https://developers.soundcloud.com/blog/microservices-and-the-monolith.

[12] “How SoundCloud’s Broken Business Model Drove Artists Away,” theverge.com.
https://www.theverge.com/2017/7/21/15999172/soundcloud-business-model-future-spotify-streaming.

What'd you think of today's edition?

Login or Subscribe to participate in polls.

Join the conversation

or to participate.