Close Menu
    Main Menu
    • Home
    • News
    • Tech
    • Robotics
    • ML & Research
    • AI
    • Digital Transformation
    • AI Ethics & Regulation
    • Thought Leadership in AI

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Past vibes: correctly choose the proper LLM for the proper job

    October 20, 2025

    Shapeshifting gentle robotic makes use of electrical fields to swing like a gymnast

    October 20, 2025

    7 Pandas Tips to Deal with Massive Datasets

    October 20, 2025
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Home»Machine Learning & Research»How I Constructed a Information Cleansing Pipeline Utilizing One Messy DoorDash Dataset
    Machine Learning & Research

    How I Constructed a Information Cleansing Pipeline Utilizing One Messy DoorDash Dataset

    Oliver ChambersBy Oliver ChambersOctober 20, 2025No Comments7 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    How I Constructed a Information Cleansing Pipeline Utilizing One Messy DoorDash Dataset
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    How I Built a Data Cleaning Pipeline Using One Messy DoorDash Dataset
    Picture by Editor

     

    # Introduction

     
    In line with CrowdFlower’s survey, knowledge scientists spend 60% of their time organizing and cleansing the info.

    On this article, we’ll stroll by way of constructing an information cleansing pipeline utilizing a real-life dataset from DoorDash. It accommodates practically 200,000 meals supply data, every of which incorporates dozens of options resembling supply time, complete objects, and retailer class (e.g., Mexican, Thai, or American delicacies).

     

    # Predicting Meals Supply Occasions with DoorDash Information

     
    Predicting Food Delivery Times with DoorDash DataPredicting Food Delivery Times with DoorDash Data
     
    DoorDash goals to estimate the time it takes to ship meals precisely, from the second a buyer locations an order to the time it arrives at their door. In this knowledge undertaking, we’re tasked with growing a mannequin that predicts the overall supply length primarily based on historic supply knowledge.

    Nevertheless, we gained’t do the entire undertaking—i.e., we gained’t construct a predictive mannequin. As an alternative, we’ll use the dataset supplied within the undertaking and create an information cleansing pipeline.

    Our workflow consists of two main steps.

     
    Data Cleaning PipelineData Cleaning Pipeline
     

     

    # Information Exploration

     
    Data Cleaning PipelineData Cleaning Pipeline
     

    Let’s begin by loading and viewing the primary few rows of the dataset.

     

    // Load and Preview the Dataset

    import pandas as pd
    df = pd.read_csv("historical_data.csv")
    df.head()

     

    Right here is the output.

     
    Data Cleaning PipelineData Cleaning Pipeline
     

    This dataset consists of datetime columns that seize the order creation time and precise supply time, which can be utilized to calculate supply length. It additionally accommodates different options resembling retailer class, complete merchandise depend, subtotal, and minimal merchandise value, making it appropriate for numerous sorts of knowledge evaluation. We will already see that there are some NaN values, which we’ll discover extra intently within the following step.

     

    // Discover The Columns With information()

    Let’s examine all column names with the information() methodology. We’ll use this methodology all through the article to see the modifications in column worth counts; it’s indicator of lacking knowledge and general knowledge well being.

     

    Right here is the output.

     
    Data Cleaning PipelineData Cleaning Pipeline
     

    As you may see, we have now 15 columns, however the variety of non-null values differs throughout them. This implies some columns include lacking values, which may have an effect on our evaluation if not dealt with correctly. One final thing: the created_at and actual_delivery_time knowledge varieties are objects; these needs to be datetime.

     

    # Constructing Information Cleansing Pipeline

     
    On this step, we construct a structured knowledge cleansing pipeline to organize the dataset for modeling. Every stage addresses frequent points resembling timestamp formatting, lacking values, and irrelevant options.
     
    Building Data Cleaning PipelineBuilding Data Cleaning Pipeline
     

    // Fixing the Date and Time Columns Information Sorts

    Earlier than doing knowledge evaluation, we have to repair the columns that present the time. In any other case, the calculation that we talked about (actual_delivery_time – created_at) will go flawed.

    What we’re fixing:

    • created_at: when the order was positioned
    • actual_delivery_time: when the meals arrived

    These two columns are saved as objects, so to have the ability to do calculations accurately, we have now to transform them to the datetime format. To do this, we will use datetime features in pandas. Right here is the code.

    import pandas as pd
    df = pd.read_csv("historical_data.csv")
    # Convert timestamp strings to datetime objects
    df["created_at"] = pd.to_datetime(df["created_at"], errors="coerce")
    df["actual_delivery_time"] = pd.to_datetime(df["actual_delivery_time"], errors="coerce")
    df.information()

     

    Right here is the output.

     
    Building Data Cleaning PipelineBuilding Data Cleaning Pipeline
     

    As you may see from the screenshot above, the created_at and actual_delivery_time are datetime objects now.

     
    Building Data Cleaning PipelineBuilding Data Cleaning Pipeline
     

    Among the many key columns, store_primary_category has the fewest non-null values (192,668), which suggests it has essentially the most lacking knowledge. That’s why we’ll concentrate on cleansing it first.

     

    // Information Imputation With mode()

    One of many messiest columns within the dataset, evident from its excessive variety of lacking values, is store_primary_category. It tells us what sort of meals shops can be found, like Mexican, American, and Thai. Nevertheless, many rows are lacking this info, which is an issue. For example, it could restrict how we will group or analyze the info. So how can we repair it?

    We’ll fill these rows as a substitute of dropping them. To do this, we are going to use smarter imputation.

    We write a dictionary that maps every store_id to its most frequent class, after which use that mapping to fill in lacking values. Let’s see the dataset earlier than doing that.

     
    Data Imputation With modeData Imputation With mode
     

    Right here is the code.

    import numpy as np
    
    # World most-frequent class as a fallback
    global_mode = df["store_primary_category"].mode().iloc[0]
    
    # Construct store-level mapping to essentially the most frequent class (quick and strong)
    store_mode = (
        df.groupby("store_id")["store_primary_category"]
          .agg(lambda s: s.mode().iloc[0] if not s.mode().empty else np.nan)
    )
    
    # Fill lacking classes utilizing the store-level mode, then fall again to world mode
    df["store_primary_category"] = (
        df["store_primary_category"]
          .fillna(df["store_id"].map(store_mode))
          .fillna(global_mode)
    )
    
    df.information()

     

    Right here is the output.

     
    Data Imputation With modeData Imputation With mode
     

    As you may see from the screenshot above, the store_primary_category column now has the next non-null depend. However let’s double-check with this code.

    df["store_primary_category"].isna().sum()

     

    Right here is the output exhibiting the variety of NaN values. It’s zero; we removed all of them.

     
    Data Imputation With modeData Imputation With mode
     

    And let’s see the dataset after the imputation.

     
    Data Imputation With modeData Imputation With mode

     

    // Dropping Remaining NaNs

    Within the earlier step, we corrected the store_primary_category, however did you discover one thing? The non-null counts throughout the columns nonetheless don’t match!

    It is a clear signal that we’re nonetheless coping with lacking values in some a part of the dataset. Now, relating to knowledge cleansing, we have now two choices:

    • Fill these lacking values
    • Drop them

    Provided that this dataset accommodates practically 200,000 rows, we will afford to lose some. With smaller datasets, you’d must be extra cautious. In that case, it’s advisable to research every column, set up requirements (determine how lacking values will likely be stuffed—utilizing the imply, median, most frequent worth, or domain-specific defaults), after which fill them.

    To take away the NaNs, we are going to use the dropna() methodology from the pandas library. We’re setting inplace=True to use the modifications on to the DataFrame while not having to assign it once more. Let’s see the dataset at this level.

     
    Dropping NaNsDropping NaNs
     

    Right here is the code.

    df.dropna(inplace=True)
    df.information()

     

    Right here is the output.

     
    Dropping NaNsDropping NaNs
     

    As you may see from the screenshot above, every column now has the identical variety of non-null values.

    Let’s see the dataset after all of the modifications.

     
    Dropping NaNsDropping NaNs
     

     

    // What Can You Do Subsequent?

    Now that we have now a clear dataset, right here are some things you are able to do subsequent:

    • Carry out EDA to know supply patterns.
    • Engineer new options like supply hours or busy dashers ratio so as to add extra that means to your evaluation.
    • Analyze correlations between variables to extend your mannequin’s efficiency.
    • Construct totally different regression fashions and discover the best-performing mannequin.
    • Predict the supply length with the best-performing mannequin.

     

    # Last Ideas

     
    On this article, we have now cleaned the real-life dataset from DoorDash by addressing frequent knowledge high quality points, resembling fixing incorrect knowledge varieties and dealing with lacking values. We constructed a easy knowledge cleansing pipeline tailor-made to this knowledge undertaking and explored potential subsequent steps.

    Actual-world datasets may be messier than you suppose, however there are additionally many strategies and tips to resolve these points. Thanks for studying!
     
     

    Nate Rosidi is an information scientist and in product technique. He is additionally an adjunct professor instructing analytics, and is the founding father of StrataScratch, a platform serving to knowledge scientists put together for his or her interviews with actual interview questions from high corporations. Nate writes on the most recent tendencies within the profession market, offers interview recommendation, shares knowledge science initiatives, and covers all the pieces SQL.



    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Oliver Chambers
    • Website

    Related Posts

    Past vibes: correctly choose the proper LLM for the proper job

    October 20, 2025

    The Full Information to Vector Databases for Machine Studying

    October 19, 2025

    Coaching Software program Engineering Brokers and Verifiers with SWE-Fitness center

    October 19, 2025
    Top Posts

    Evaluating the Finest AI Video Mills for Social Media

    April 18, 2025

    Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

    April 18, 2025

    Midjourney V7: Quicker, smarter, extra reasonable

    April 18, 2025

    Meta resumes AI coaching utilizing EU person knowledge

    April 18, 2025
    Don't Miss

    Past vibes: correctly choose the proper LLM for the proper job

    By Oliver ChambersOctober 20, 2025

    Choosing the proper massive language mannequin (LLM) to your use case is changing into each…

    Shapeshifting gentle robotic makes use of electrical fields to swing like a gymnast

    October 20, 2025

    7 Pandas Tips to Deal with Massive Datasets

    October 20, 2025

    New Tech Help Rip-off Makes use of Microsoft Brand to Faux Browser Lock, Steal Information

    October 20, 2025
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    UK Tech Insider
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms Of Service
    • Our Authors
    © 2025 UK Tech Insider. All rights reserved by UK Tech Insider.

    Type above and press Enter to search. Press Esc to cancel.