Full Stack Express
Posts
DoorDash's Holiday Forecasting Strategy

DoorDash's Holiday Forecasting Strategy

Stephen Sun
September 13, 2023

Together with Quick Byte

Good morning and welcome back to another edition of Full Stack Express, your weekly newsletter on web development, software architecture, and system design.

Here’s the notable announcements for the week:

Bun 1.0 released on September 8, the all-in-one JS/TS toolkit.
React Aria components is now in Beta.
OpenAI’s first developer conference in San Francisco.
Turbo 8 is dropping TypeScript.
Visual Studio Code adds port forwarding.

We’ll also do deep dives into:

DoorDash’s Holiday Forecasting Strategy
How PayPal Scales Kafka to Trillions of Messages

And finish off with:

The latest tech news and articles
Interesting tools and packages created by the community
A useful but lesser-known Linux tip
A spicy JavaScript meme

ELEVATE YOUR TECH JOURNEY WITH QUICK BYTE (SPONSOR)

Stay ahead of the curve with Quick Byte! Dive into the latest tech trends 📈, discover sizzling startups 📱, and master AI tips & tricks 🤖.

Complement your Full Stack Express insights with a daily dose of tech treats from Quick Byte. Don't miss out!

Subscribe to Quick Byte Today!

DOORDASH’S HOLIDAY FORECASTING STRATEGY

One of the many challenges DoorDash faces is forecasting supply and demand during holidays.

For a company that services thousands of markets, maintaining high-quality experiences at scale for both customers and Dashers (delivery drivers) is extremely challenging.

Operations must be proactively planned in order to ensure that enough Dashers are available and adding additional compensation when less drivers are anticipated.

DoorDash’s Mobilization System

DoorDash’s mobilization system is a collaborative effort between operations, finance, engineering, and machine learning teams.

DoorDash Mobilization System

To run flawlessly, reliable forecasts and predictions are needed to accommodate supply and demand for all days of the year, including holidays.

Limitations to DoorDash’s Current Model

Traditional tree-based models such as random forest and gradient boosting machine (GPM) are popular in the industry due to their ease of use and high accuracy.

While typically excelling at time-series forecasting, they struggle with high variations and anomalies, a common pattern during holiday demand.

Tree-Based Model Generating Inaccurate Forecasts For Holidays

From this example, the model gave a prediction of -35%, the average of all the holidays.

In other words, the Fourth of July (-10%) and Christmas (-5%) were over-forecasted, while Thanksgiving was under-forecasted (+15%), a pretty significant margin of error.

Cascade Modeling Approach

To account for these discrepancies, DoorDash adopted a cascade modeling approach through a series of steps:

Calculating holiday multipliers
Preprocessing holiday multipliers and model training
Generating forecasts and post-processing

Cascade Model in Multiple Steps

These changes led to a significant improvement in the GBM model’s accuracy, measured in terms of weight mean absolute percentage error (wMAPE).

Model Performance

The results speak for itself.

Cascade Approach vs Traditional Feature Engineering

The wMAPE saw a dramatic reduction, dropping from a range of 60-70% to just 10-20% during the Christmas time. Thanksgiving showed a comparable improvement.

When averaged over an entire year of holidays, the wMAPE improved by an absolute 10%.

In conclusion, DoorDash's cascade modeling approach offers a promising solution to the challenges of holiday forecasting, balancing the need for accuracy with the complexities of real-world operations.

Key Takeaways

Adaptability is Key: Traditional models may not be suitable for all scenarios. Being flexible and open to adopting new techniques can yield better results.
Complexity vs. Gain: While adding complexity can improve model performance, it's crucial to weigh the benefits against the added complexity and computational resources needed.
Multi-Disciplinary Approach: Collaboration between operations, finance, engineering, and machine learning teams is essential for implementing a robust forecasting system.
Validation Challenges: In cases where traditional experimental designs are not feasible, alternative validation methods like backtests and limited A/B tests can be useful.
Stakeholder Communication: Transparently communicating the benefits and limitations of a new approach can help in gaining stakeholder buy-in, which is crucial for successful implementation.

Source

HOW PAYPAL SCALES KAFKA TO TRILLIONS OF MESSAGES

PayPal, a giant in the fintech space with over 350 million customers, processes trillions of messages per day on their payments platform.

How exactly does PayPal handle all of these messages at such a high scale while also ensuring high availability, fault tolerance, and optimal performance?

PayPal at Scale

Since its introduction at PayPal in 2015, Kafka has scaled dramatically to meet the company's growing data needs.

Initially starting with a few isolated clusters, Kafka now powers a wide range of applications at PayPal, from first-party tracking to analytics, each handling over 100 billion messages per day.

The platform boasts over 1,500 brokers, 20,000 topics, and nearly 2,000 Mirror Maker nodes (mirror data among clusters), achieving a remarkable 99.99% availability.

During peak times like Black Friday in 2022, Kafka at PayPal handled a staggering 1.3 trillion messages in a single day.

Incoming Messages Per Second Over Holiday Season

The system continues to scale seamlessly, especially during high-traffic holiday seasons, without impacting business operations.

Total Messages Per Day Over Holiday Season

How Kafka is Used at PayPal

At a high-level, PayPal’s infrastructure consists of multiple geographically distributed data centers and security zones.

Kafka clusters are deployed across these zones depending on their data classification and business requirements.

Kafka Cluster Deployments in Security Zones Within a Data Center

Data is then mirrored across the data centers, helping with disaster recovery and zone communication.

But what libraries and components do these Kafka clusters interact with?

At PayPal, a wide array of clients are supported:

Java
Python
Spark
Node
Golang
Internal frameworks

Kafka Libraries and Components

With multiple frameworks, tech stacks, and applications that need to be supported, reducing and managing operational overhead is always a challenge.

Operational Challenges and Solutions

Over the years, PayPal has identified four key areas of improvement in order to maintain security, consistency, and platform stability:

Cluster management
Monitoring and alerting
Configuration management
Enhancements and automation

To improve cluster management, PayPal introduced a Kafka Config service to manage broker IPs and configurations, which cut down on maintenance from upgrades, patching, etc.

Furthermore, Access Control Lists (ACLs) were added to secure the platform for business-critical workflows.

To enhance security and streamline integration across various frameworks, PayPal has developed its own set of Kafka libraries.

The Resilient Client Library simplifies connectivity and configuration, improving developer efficiency and business resilience.

Meanwhile, the Monitoring Library enables real-time health checks and alerts, and the Security Library automates SSL authentication, significantly reducing operational overhead for managing certificates and keys.

Supported Tech Stack for Kafka Libraries

Wrapping Up

Kafka has been instrumental in supporting PayPal's massive data streaming needs, and the journey has been filled with challenges and learnings.

Through strategic investments in tooling, monitoring, and automation, PayPal has managed to scale Kafka efficiently while maintaining high availability and performance.

As the platform continues to evolve, PayPal remains committed to enhancing its Kafka infrastructure to provide a seamless and robust experience for its end-users.

Key Takeaways

Tooling for Scalable Kafka Management: Custom tools and automation are indispensable for efficiently managing large-scale Kafka deployments. These tools help in reducing manual overhead and errors, making it easier to scale the system as the data traffic grows.
Precision Monitoring for Availability: Fine-tuning the monitoring metrics and setting up a responsive alerting system are crucial for maintaining high availability. This ensures that any issues can be quickly identified and resolved, minimizing downtime and impact on mission-critical applications.
Securing Access with ACLs: Implementing ACLs has proven to be effective in enhancing the security of Kafka clusters. ACLs also provide better control over which applications can access specific Kafka resources, thereby making application management more structured and secure.
Benchmarking for Insight and Efficiency: Conducting performance benchmarks across different operational environments, such as on-premises and cloud, provides valuable insights. These insights help in making informed decisions about optimizing performance and cost-efficiency.

Source

BYTE-SIZED TOPICS

Wifi can read through walls

UCSB researchers’ new method enables high-quality imaging of still objects with WiFi by using the Geometrical Theory of Diffraction and the corresponding Keller cones to trace edges of the objects.

news.ucsb.edu/2023/021198/wifi-can-read-through-walls

Node.js (Fastify) vs Go: Performance comparison for JWT verify and MySQL query

Find out which tech performs better for the use case of verifying JWT and reading record from MySQL: Node.js (Fastify) or Go?

medium.com/deno-the-complete-reference/node-js-fastify-vs-go-performance-comparison-for-jwt-verify-and-mysql-query-73c056f19dcc

Parallel Processing: How Modern Cloud Servers Leverage Different System Architectures To Optimize Parallel Compute - Scaleyourapp

Modern cloud servers leverage several system architectures to process data parallelly, which increases throughput, minimizes latency and optimizes resource consumption.

scaleyourapp.com/parallel-processing

Nvidia Says New Software Will Double LLM Inference Speed On H100 GPU | CRN

Nvidia said it plans to release open-source software that will significantly speed up inference performance for large language models powered by its GPUs, including the H100.

www.crn.com/news/components-peripherals/nvidia-says-new-software-will-double-llm-inference-speed-on-h100-gpu

INTERESTING PRODUCTS, TOOLS & PACKAGES

GitHub - modularml/mojo: The Mojo Programming Language

The Mojo Programming Language. Contribute to modularml/mojo development by creating an account on GitHub.

github.com/modularml/mojo

GitHub - oven-sh/awesome-bun: ⚡️ A curated list of awesome things related to Bun

⚡️ A curated list of awesome things related to Bun - GitHub - oven-sh/awesome-bun: ⚡️ A curated list of awesome things related to Bun

github.com/oven-sh/awesome-bun

GitHub - microsoft/promptflow: Build high-quality LLM apps - from prototyping, testing to production deployment and monitoring.

Build high-quality LLM apps - from prototyping, testing to production deployment and monitoring.

github.com/microsoft/promptflow

TIP OF THE WEEK

A lesser-known but useful Linux tip is the use of the !! (double exclamation mark) to repeat the last command you executed.

This is particularly handy when you forget to run a command with sudo.

For example, let's say you try to update the package list with apt:

apt update

You'll likely get a permission denied error because you didn't use sudo. Instead of retyping the whole command, you can simply type:

sudo !!

This will execute the last command (apt update) but with sudo prepended, effectively running sudo apt update.

MEME OF THE WEEK

Reply

or to participate.