DoorDash's Holiday Forecasting Strategy

Together with Quick Byte

Good morning and welcome back to another edition of Full Stack Express, your weekly newsletter on web development, software architecture, and system design.

Here’s the notable announcements for the week:

We’ll also do deep dives into:

  • DoorDash’s Holiday Forecasting Strategy

  • How PayPal Scales Kafka to Trillions of Messages

And finish off with:

  • The latest tech news and articles

  • Interesting tools and packages created by the community

  • A useful but lesser-known Linux tip

  • A spicy JavaScript meme

ELEVATE YOUR TECH JOURNEY WITH QUICK BYTE (SPONSOR)

Stay ahead of the curve with Quick Byte! Dive into the latest tech trends 📈, discover sizzling startups 📱, and master AI tips & tricks 🤖.

Complement your Full Stack Express insights with a daily dose of tech treats from Quick Byte. Don't miss out!

DOORDASH’S HOLIDAY FORECASTING STRATEGY

One of the many challenges DoorDash faces is forecasting supply and demand during holidays.

For a company that services thousands of markets, maintaining high-quality experiences at scale for both customers and Dashers (delivery drivers) is extremely challenging.

Operations must be proactively planned in order to ensure that enough Dashers are available and adding additional compensation when less drivers are anticipated.

DoorDash’s Mobilization System

DoorDash’s mobilization system is a collaborative effort between operations, finance, engineering, and machine learning teams.

DoorDash Mobilization System

To run flawlessly, reliable forecasts and predictions are needed to accommodate supply and demand for all days of the year, including holidays.

Limitations to DoorDash’s Current Model

Traditional tree-based models such as random forest and gradient boosting machine (GPM) are popular in the industry due to their ease of use and high accuracy.

While typically excelling at time-series forecasting, they struggle with high variations and anomalies, a common pattern during holiday demand.

Tree-Based Model Generating Inaccurate Forecasts For Holidays

From this example, the model gave a prediction of -35%, the average of all the holidays.

In other words, the Fourth of July (-10%) and Christmas (-5%) were over-forecasted, while Thanksgiving was under-forecasted (+15%), a pretty significant margin of error.

Cascade Modeling Approach

To account for these discrepancies, DoorDash adopted a cascade modeling approach through a series of steps:

  1. Calculating holiday multipliers

  2. Preprocessing holiday multipliers and model training

  3. Generating forecasts and post-processing

Cascade Model in Multiple Steps

These changes led to a significant improvement in the GBM model’s accuracy, measured in terms of weight mean absolute percentage error (wMAPE).

Model Performance

The results speak for itself.

Cascade Approach vs Traditional Feature Engineering

The wMAPE saw a dramatic reduction, dropping from a range of 60-70% to just 10-20% during the Christmas time. Thanksgiving showed a comparable improvement.

When averaged over an entire year of holidays, the wMAPE improved by an absolute 10%.

In conclusion, DoorDash's cascade modeling approach offers a promising solution to the challenges of holiday forecasting, balancing the need for accuracy with the complexities of real-world operations.

Key Takeaways

  • Adaptability is Key: Traditional models may not be suitable for all scenarios. Being flexible and open to adopting new techniques can yield better results.

  • Complexity vs. Gain: While adding complexity can improve model performance, it's crucial to weigh the benefits against the added complexity and computational resources needed.

  • Multi-Disciplinary Approach: Collaboration between operations, finance, engineering, and machine learning teams is essential for implementing a robust forecasting system.

  • Validation Challenges: In cases where traditional experimental designs are not feasible, alternative validation methods like backtests and limited A/B tests can be useful.

  • Stakeholder Communication: Transparently communicating the benefits and limitations of a new approach can help in gaining stakeholder buy-in, which is crucial for successful implementation.

HOW PAYPAL SCALES KAFKA TO TRILLIONS OF MESSAGES

PayPal, a giant in the fintech space with over 350 million customers, processes trillions of messages per day on their payments platform.

How exactly does PayPal handle all of these messages at such a high scale while also ensuring high availability, fault tolerance, and optimal performance?

PayPal at Scale

Since its introduction at PayPal in 2015, Kafka has scaled dramatically to meet the company's growing data needs.

Initially starting with a few isolated clusters, Kafka now powers a wide range of applications at PayPal, from first-party tracking to analytics, each handling over 100 billion messages per day.

The platform boasts over 1,500 brokers, 20,000 topics, and nearly 2,000 Mirror Maker nodes (mirror data among clusters), achieving a remarkable 99.99% availability.

During peak times like Black Friday in 2022, Kafka at PayPal handled a staggering 1.3 trillion messages in a single day.

Incoming Messages Per Second Over Holiday Season

The system continues to scale seamlessly, especially during high-traffic holiday seasons, without impacting business operations.

Total Messages Per Day Over Holiday Season

How Kafka is Used at PayPal

At a high-level, PayPal’s infrastructure consists of multiple geographically distributed data centers and security zones.

Kafka clusters are deployed across these zones depending on their data classification and business requirements.

Kafka Cluster Deployments in Security Zones Within a Data Center

Data is then mirrored across the data centers, helping with disaster recovery and zone communication.

But what libraries and components do these Kafka clusters interact with?

At PayPal, a wide array of clients are supported:

  • Java

  • Python

  • Spark

  • Node

  • Golang

  • Internal frameworks

Kafka Libraries and Components

With multiple frameworks, tech stacks, and applications that need to be supported, reducing and managing operational overhead is always a challenge.

Operational Challenges and Solutions

Over the years, PayPal has identified four key areas of improvement in order to maintain security, consistency, and platform stability:

  • Cluster management

  • Monitoring and alerting

  • Configuration management

  • Enhancements and automation

To improve cluster management, PayPal introduced a Kafka Config service to manage broker IPs and configurations, which cut down on maintenance from upgrades, patching, etc.

Furthermore, Access Control Lists (ACLs) were added to secure the platform for business-critical workflows.

To enhance security and streamline integration across various frameworks, PayPal has developed its own set of Kafka libraries.

The Resilient Client Library simplifies connectivity and configuration, improving developer efficiency and business resilience.

Meanwhile, the Monitoring Library enables real-time health checks and alerts, and the Security Library automates SSL authentication, significantly reducing operational overhead for managing certificates and keys.

Supported Tech Stack for Kafka Libraries

Wrapping Up

Kafka has been instrumental in supporting PayPal's massive data streaming needs, and the journey has been filled with challenges and learnings.

Through strategic investments in tooling, monitoring, and automation, PayPal has managed to scale Kafka efficiently while maintaining high availability and performance.

As the platform continues to evolve, PayPal remains committed to enhancing its Kafka infrastructure to provide a seamless and robust experience for its end-users.

Key Takeaways

  • Tooling for Scalable Kafka Management: Custom tools and automation are indispensable for efficiently managing large-scale Kafka deployments. These tools help in reducing manual overhead and errors, making it easier to scale the system as the data traffic grows.

  • Precision Monitoring for Availability: Fine-tuning the monitoring metrics and setting up a responsive alerting system are crucial for maintaining high availability. This ensures that any issues can be quickly identified and resolved, minimizing downtime and impact on mission-critical applications.

  • Securing Access with ACLs: Implementing ACLs has proven to be effective in enhancing the security of Kafka clusters. ACLs also provide better control over which applications can access specific Kafka resources, thereby making application management more structured and secure.

  • Benchmarking for Insight and Efficiency: Conducting performance benchmarks across different operational environments, such as on-premises and cloud, provides valuable insights. These insights help in making informed decisions about optimizing performance and cost-efficiency.

BYTE-SIZED TOPICS

INTERESTING PRODUCTS, TOOLS & PACKAGES

TIP OF THE WEEK

A lesser-known but useful Linux tip is the use of the !! (double exclamation mark) to repeat the last command you executed.

This is particularly handy when you forget to run a command with sudo.

For example, let's say you try to update the package list with apt:

apt update

You'll likely get a permission denied error because you didn't use sudo. Instead of retyping the whole command, you can simply type:

sudo !!

This will execute the last command (apt update) but with sudo prepended, effectively running sudo apt update.

MEME OF THE WEEK

Reply

or to participate.