- Full Stack Express
- Posts
- DoorDash's Holiday Forecasting Strategy
DoorDash's Holiday Forecasting Strategy


Together with Quick Byte
Good morning and welcome back to another edition of Full Stack Express, your weekly newsletter on web development, software architecture, and system design.
Here’s the notable announcements for the week:
Bun 1.0 released on September 8, the all-in-one JS/TS toolkit.
React Aria components is now in Beta.
OpenAI’s first developer conference in San Francisco.
Turbo 8 is dropping TypeScript.
Visual Studio Code adds port forwarding.
We’ll also do deep dives into:
DoorDash’s Holiday Forecasting Strategy
How PayPal Scales Kafka to Trillions of Messages
And finish off with:
The latest tech news and articles
Interesting tools and packages created by the community
A useful but lesser-known Linux tip
A spicy JavaScript meme
ELEVATE YOUR TECH JOURNEY WITH QUICK BYTE (SPONSOR)
Stay ahead of the curve with Quick Byte! Dive into the latest tech trends 📈, discover sizzling startups 📱, and master AI tips & tricks 🤖.
Complement your Full Stack Express insights with a daily dose of tech treats from Quick Byte. Don't miss out!
DOORDASH’S HOLIDAY FORECASTING STRATEGY

One of the many challenges DoorDash faces is forecasting supply and demand during holidays.
For a company that services thousands of markets, maintaining high-quality experiences at scale for both customers and Dashers (delivery drivers) is extremely challenging.
Operations must be proactively planned in order to ensure that enough Dashers are available and adding additional compensation when less drivers are anticipated.
DoorDash’s Mobilization System
DoorDash’s mobilization system is a collaborative effort between operations, finance, engineering, and machine learning teams.

DoorDash Mobilization System
To run flawlessly, reliable forecasts and predictions are needed to accommodate supply and demand for all days of the year, including holidays.
Limitations to DoorDash’s Current Model
Traditional tree-based models such as random forest and gradient boosting machine (GPM) are popular in the industry due to their ease of use and high accuracy.
While typically excelling at time-series forecasting, they struggle with high variations and anomalies, a common pattern during holiday demand.

Tree-Based Model Generating Inaccurate Forecasts For Holidays
From this example, the model gave a prediction of -35%, the average of all the holidays.
In other words, the Fourth of July (-10%) and Christmas (-5%) were over-forecasted, while Thanksgiving was under-forecasted (+15%), a pretty significant margin of error.
Cascade Modeling Approach
To account for these discrepancies, DoorDash adopted a cascade modeling approach through a series of steps:
Calculating holiday multipliers
Preprocessing holiday multipliers and model training
Generating forecasts and post-processing

Cascade Model in Multiple Steps
These changes led to a significant improvement in the GBM model’s accuracy, measured in terms of weight mean absolute percentage error (wMAPE).
Model Performance
The results speak for itself.

Cascade Approach vs Traditional Feature Engineering
The wMAPE saw a dramatic reduction, dropping from a range of 60-70% to just 10-20% during the Christmas time. Thanksgiving showed a comparable improvement.
When averaged over an entire year of holidays, the wMAPE improved by an absolute 10%.
In conclusion, DoorDash's cascade modeling approach offers a promising solution to the challenges of holiday forecasting, balancing the need for accuracy with the complexities of real-world operations.
Key Takeaways
Adaptability is Key: Traditional models may not be suitable for all scenarios. Being flexible and open to adopting new techniques can yield better results.
Complexity vs. Gain: While adding complexity can improve model performance, it's crucial to weigh the benefits against the added complexity and computational resources needed.
Multi-Disciplinary Approach: Collaboration between operations, finance, engineering, and machine learning teams is essential for implementing a robust forecasting system.
Validation Challenges: In cases where traditional experimental designs are not feasible, alternative validation methods like backtests and limited A/B tests can be useful.
Stakeholder Communication: Transparently communicating the benefits and limitations of a new approach can help in gaining stakeholder buy-in, which is crucial for successful implementation.
HOW PAYPAL SCALES KAFKA TO TRILLIONS OF MESSAGES

PayPal, a giant in the fintech space with over 350 million customers, processes trillions of messages per day on their payments platform.
How exactly does PayPal handle all of these messages at such a high scale while also ensuring high availability, fault tolerance, and optimal performance?
PayPal at Scale
Since its introduction at PayPal in 2015, Kafka has scaled dramatically to meet the company's growing data needs.
Initially starting with a few isolated clusters, Kafka now powers a wide range of applications at PayPal, from first-party tracking to analytics, each handling over 100 billion messages per day.
The platform boasts over 1,500 brokers, 20,000 topics, and nearly 2,000 Mirror Maker nodes (mirror data among clusters), achieving a remarkable 99.99% availability.
During peak times like Black Friday in 2022, Kafka at PayPal handled a staggering 1.3 trillion messages in a single day.

Incoming Messages Per Second Over Holiday Season
The system continues to scale seamlessly, especially during high-traffic holiday seasons, without impacting business operations.

Total Messages Per Day Over Holiday Season
How Kafka is Used at PayPal
At a high-level, PayPal’s infrastructure consists of multiple geographically distributed data centers and security zones.
Kafka clusters are deployed across these zones depending on their data classification and business requirements.

Kafka Cluster Deployments in Security Zones Within a Data Center
Data is then mirrored across the data centers, helping with disaster recovery and zone communication.
But what libraries and components do these Kafka clusters interact with?
At PayPal, a wide array of clients are supported:
Java
Python
Spark
Node
Golang
Internal frameworks

Kafka Libraries and Components
With multiple frameworks, tech stacks, and applications that need to be supported, reducing and managing operational overhead is always a challenge.
Operational Challenges and Solutions
Over the years, PayPal has identified four key areas of improvement in order to maintain security, consistency, and platform stability:
Cluster management
Monitoring and alerting
Configuration management
Enhancements and automation
To improve cluster management, PayPal introduced a Kafka Config service to manage broker IPs and configurations, which cut down on maintenance from upgrades, patching, etc.
Furthermore, Access Control Lists (ACLs) were added to secure the platform for business-critical workflows.

To enhance security and streamline integration across various frameworks, PayPal has developed its own set of Kafka libraries.
The Resilient Client Library simplifies connectivity and configuration, improving developer efficiency and business resilience.
Meanwhile, the Monitoring Library enables real-time health checks and alerts, and the Security Library automates SSL authentication, significantly reducing operational overhead for managing certificates and keys.

Supported Tech Stack for Kafka Libraries
Wrapping Up
Kafka has been instrumental in supporting PayPal's massive data streaming needs, and the journey has been filled with challenges and learnings.
Through strategic investments in tooling, monitoring, and automation, PayPal has managed to scale Kafka efficiently while maintaining high availability and performance.
As the platform continues to evolve, PayPal remains committed to enhancing its Kafka infrastructure to provide a seamless and robust experience for its end-users.
Key Takeaways
Tooling for Scalable Kafka Management: Custom tools and automation are indispensable for efficiently managing large-scale Kafka deployments. These tools help in reducing manual overhead and errors, making it easier to scale the system as the data traffic grows.
Precision Monitoring for Availability: Fine-tuning the monitoring metrics and setting up a responsive alerting system are crucial for maintaining high availability. This ensures that any issues can be quickly identified and resolved, minimizing downtime and impact on mission-critical applications.
Securing Access with ACLs: Implementing ACLs has proven to be effective in enhancing the security of Kafka clusters. ACLs also provide better control over which applications can access specific Kafka resources, thereby making application management more structured and secure.
Benchmarking for Insight and Efficiency: Conducting performance benchmarks across different operational environments, such as on-premises and cloud, provides valuable insights. These insights help in making informed decisions about optimizing performance and cost-efficiency.
BYTE-SIZED TOPICS
INTERESTING PRODUCTS, TOOLS & PACKAGES
TIP OF THE WEEK
A lesser-known but useful Linux tip is the use of the !!
(double exclamation mark) to repeat the last command you executed.
This is particularly handy when you forget to run a command with sudo
.
For example, let's say you try to update the package list with apt
:
apt update
You'll likely get a permission denied error because you didn't use sudo
. Instead of retyping the whole command, you can simply type:
sudo !!
This will execute the last command (apt update
) but with sudo
prepended, effectively running sudo apt update
.
MEME OF THE WEEK

Reply