Deep Dive Into Instacart's Machine Learning Platform

🌙 Hello world ☀️ 

The 2023 State of JS survey is still open for another week (ends December 12) and helps the JavaScript community measure popularity of current features and new trends. Take the survey here.

In this week’s email:

  • Machine Learning: Deep dive into Griffin 2.0, the core of Instacart’s ML platform.

  • JavaScript: Learn hands-on how modern JavaScript frameworks work.

  • React: A router-driven approach to building React applications.

  • HTML/CSS: Five new modern web development features.

  • Internet Trends: How Black Friday and Cyber Monday affect internet traffic.

The best way to predict the future is to invent it.

Alan Kay

Created with Midjourney


Machine learning is essential to Instacart’s platform, the core of nearly every product and operational innovation at the company. The technology is used to:

  1. Enhance customer experience by matching customer preferences with suitable choices from a catalog containing over 1 billion products.

  2. Optimize the efficiency of over 600,000 shoppers, enabling swift delivery to millions of customers across the US and Canada.

  3. Integrate artificial intelligence into the Instacart platform, enhancing support for over 800 retailers across 70,000 stores in more than 5,500 cities throughout the US and Canada.

  4. Facilitates connections between over 5,000 brand partners and potential customers.

In order to create a seamless shopping experience that also handles the day-to-day scalability challenges, the machine learning infrastructure team created Griffin, Instacart’s MLOps Platform.

Shortcomings of Griffin 1.0

Griffin 1.0 at Instacart, while comprehensive for end-to-end ML, revealed several limitations that led to the development of Griffin 2.0.

These included a steep learning curve with complex in-house command-line tools, a complicated deployment process involving AWS ECS, and the need for specialized system tuning.

It also lacked standardization, relying heavily on GitHub PRs for multiple tasks, and had unoptimized scalability, with limited horizontal scalability and an insufficient model registry for high query volumes.

Furthermore, Griffin 1.0 offered a fragmented user experience due to multiple third-party integrations and had inadequate metadata management for training and deployment.

Development of Griffin 2.0

In developing Griffin 2.0's Machine Learning Training Platform (MLTP), the goal was to create a unified, centralized platform for machine learning engineers (MLEs) to easily create, track, and manage training workloads.

The platform was designed to support distributed machine learning, including distributed training, batch inference, and fine-tuning of Large Language Models (LLMs), while addressing the limitations of Griffin 1.0.

Key strategic design decisions for Griffin 2.0 included:

  1. A Singular Interface: Unlike Griffin 1.0, which required navigating multiple systems, Griffin 2.0 integrates all tools into one unified web interface, simplifying user experience and streamlining model training development.

  2. Centralized Unity: Griffin 2.0 consolidates various training backends into a single Kubernetes platform, reducing maintenance overhead compared to the fragmented approach in Griffin 1.0.

  3. Standard ML Runtime: Addressing the lack of standardized modeling frameworks in Griffin 1.0, Griffin 2.0 introduces standard runtimes across different ML frameworks, ensuring consistency in building blocks and package versions.

  4. Horizontal Scalability: Griffin 2.0 uses Ray to enable horizontal scalability for distributed workloads, overcoming the vertical scaling limitations of Griffin 1.0.

  5. Metadata Store for All: Griffin 2.0 implements a centralized metadata store, enhancing model lineage management and lifecycle oversight, an area where Griffin 1.0 was deficient.

System Architecture

ML Training Platform Architecture

The Machine Learning Training Platform (MLTP) at Instacart is built with several key components to provide a centralized service with distributed computation capabilities:

  1. Metadata Store: Includes Model Store for untrained models, Offline Feature Store for training metadata, Workflow Run for managing training jobs, and Model Registry for post-training model information.

  2. API Endpoints: Provides RESTful APIs for interacting with the Metadata Store, managing model architectures, registries, features, datasets, and training jobs.

  3. Workflow Orchestrator: Comprises the MLTP API service for customizing training jobs and the ISC worker integrated with Kubernetes and Ray for orchestrating and managing training workloads.

Workflow Orchestrator

MLTP Training Process

The process of creating a training job in the Machine Learning Training Platform (MLTP) involves several steps:

  1. Customization: Users start by customizing their inputs, including organizing features and experimenting with model designs. They select a model from the Model Store, choose data from the Training Dataset, and configure training settings.

  2. Initiation Options: To start a training workload, users can either use the Griffin UI or Python SDKs to send requests to the workflow services.

  3. Resource Creation: The workflow services then generate Kubernetes resources based on the user's inputs, which can range from a simple single-container job to a complex multi-node Ray cluster.

  4. Post-Training: Upon completion, training results like MLFlow metrics and Datadog logs are displayed, and the model weights and other relevant items are stored in the Model Registry for future use in evaluation and inference.

Screenshot of Griffin Model

The design of MLTP emphasizes a streamlined and standardized approach:

  • A centralized service and APIs provide a consistent interface for managing the entire model development lifecycle.

  • Users can prototype using Ray clusters from their laptops or Jupyter servers, leveraging distributed computation.

  • For production-ready models, the Griffin UI is used to create a production workflow definition.

  • In cases with an existing Airflow production pipeline, Griffin's task and sensor operators are used for integrating with workflow APIs.

  • Throughout the lifecycle, users interact with MLTP through the same API interface.

Lessons Learned

During the development of Instacart’s next-generation ML training infrastructure, the team learned about:

  1. Unified Solutions: By unifying ML training solutions, the team achieved a more consistent training job process and user experience, especially after transitioning to Kubernetes as the sole orchestration platform. This unification not only simplified management but also brought benefits like distributed computation and better metadata management.

  2. Balancing Flexibility and Standardization: The platform was designed to be highly flexible to support a broad spectrum of ML applications, while also incorporating standardization to cater to the majority of use cases and enhance development speed.

  3. Considering the Bigger Picture: The redesign of the ML Training Platform (MLTP) went beyond just training; it included model serving and feature engineering. This holistic approach involved collaboration with various Griffin 2.0 stakeholders, leading to co-designed data models for training jobs that integrate seamlessly into the end-to-end ML process, simplifying deployment and improving metadata management.

P.S. If you’re enjoying the content of this newsletter, please share it with your network:

Created with Midjourney

Big picture: This article provides an in-depth exploration of modern JavaScript frameworks by guiding the reader through the process of building a simple framework.

What you’ll learn: It emphasizes key concepts like reactivity, DOM rendering, and using modern JavaScript APIs, illustrating the underlying principles and techniques that power frameworks like React, Vue, and Svelte.

Why this matters: Understanding these concepts is crucial for JavaScript developers, as it deepens their knowledge of client-side technologies, improves their ability to choose or build suitable frameworks for their projects, and enhances their skills in solving complex front-end challenges.

Created with Midjourney

Big picture: The TanStack Router is a modern React routing solution that integrates advanced features like type-safe routing, coordinated data loading, and suspense-first design, representing a significant improvement in how web applications are structured and managed.

What you’ll learn: React developers gain perspective on a router-driven approach that leads to more efficient UI state management, enabling smoother and more responsive user experiences with less coding effort.

Why this matters: It signifies a shift towards more intuitive and powerful web development practices, aligning with higher demands for dynamic, user-centric web applications.

Created with Midjourney

Big picture: This article introduces five modern web development features that are transforming the field: Native HTML Dialog, Native HTML Popovers, Container Queries for Responsiveness, CSS Color Mix, and CSS Nesting.

What you’ll learn: It teaches how these features simplify coding, enhance design capabilities, and improve the user experience, without the need for extensive custom coding or external libraries.

Why this matters: The significance lies in their potential to revolutionize web projects, making applications more efficient, visually appealing, and responsive, thus empowering developers to stay ahead in the rapidly evolving field of web development.

Created with Midjourney

Big picture: The article presents a comprehensive analysis of internet traffic and e-commerce trends during key shopping events like Black Friday and Cyber Monday, offering insights into global and regional online behavior patterns.

What you’ll learn: Developers will learn about the significant shifts in traffic on major e-commerce dates, the varying device usage trends (mobile vs. desktop), and the cybersecurity landscape, particularly regarding DDoS attacks around these peak times.

Why this matters: This information is crucial for developers in optimizing website performance, enhancing user experience, and bolstering cybersecurity measures during high-traffic periods, which are pivotal for the success and resilience of e-commerce platforms.

Maximum Subarray

Missed the solution to the latest coding challenge?

This question is asked by Amazon and Microsoft. Learn the algorithm used to solve this problem here.

JS Weekly Pulse

  • 📢 State of JS Survey: The JavaScript user survey aims to track JavaScript feature and library trends, and will openly publish results post-December 12 for community and company use.

  • 📢 2023 State of the API Report: APIs significantly generate revenue, API pricing is a growing concern, positive outlook on API investments, widespread use of AI in API development, and more.

  • 📢 Announcing Deno Cron: Deno Cron simplifies web development by offering a streamlined way to create and manage scheduled jobs within Deno's serverless platform, Deno Deploy, ensuring efficient and non-overlapping task execution.

  • 📜 Using Infer to Unpack Nested Types: How to use TypeScript's infer keyword for efficient type extraction in nested structures, simplifying the development of type-safe REST APIs.

  • 📜 Debugging Cloudflare Workers: Cloudflare Workers now offers enhanced debugging with breakpoints, integrating console logs, Chrome DevTools, and IDE support for easier development and troubleshooting.

  • 📜 How Marketing Changed OOP: How JavaScript's prototypal inheritance, influenced by the Self language and marketed as "Java for the web," has shaped current JavaScript development practices, particularly in the use and understanding of prototypes.

  • 🚀 Fresh 1.6: Tailwind CSS plugin, partials with forms and error pages, improved islands bundling strategy, support for pre-generated assets, and more.

  • 🚀 Bun 1.0.15: 2x faster start for tsc, 40% speed increase for Prettier, stable WebSocket client, syntax-highlighted errors, and more.

  • 🚀 XState 5: Actor as the main unit of abstraction, inspect API, deep persistence, stronger type inference, dynamic parameters, and enqueue actions.

  • 🚀 ESBuild 0.19.7: Standardization of import attributes and deprecation of import assertions.

To-Do List

 Interesting: Global business and finance systems heavily rely on the 60-year-old COBOL programming language, now understood by few programmers, with IBM considering Watson as a potential aid.

 Big Tech: Google's code review tool Critique, featuring AI enhancements and efficient guidelines, secures 97% developer satisfaction by simplifying and improving the code review process.

 Learn: Matteo dives into Node.js's event loop, exploring its role in efficient asynchronous operation handling and offering best practices for optimizing its performance.

 Regex: The use of Regular Expressions in code searches, highlighting their usefulness in simplifying pattern matching and data cleaning in everyday engineering tasks.

 Side Projects: Developing a Dropbox-like application, creating MySQL solutions, writing about Regular Expressions, and designing themes for apps like Obsidian are examples of side projects that have actually helped individuals secure jobs.

Tools and Packages

📦️ WatermelonDB: Reactive & asynchronous database for React and React Native apps. Built for performance and scale.

📦️ Package Majors: Enhanced tool for analyzing and comparing major version downloads of a package within the last week. Here’s an example with Node.

📦️ Win11React: Recreation of the Windows 11 desktop experience in React. Pretty awesome…for Windows.

📦️ Zx: Write more complex scripts with JavaScript.

🤣 🤣🤣🤣🤣

What'd you think of today's edition?

Login or Subscribe to participate in polls.

Join the conversation

or to participate.