Close Menu
    Main Menu
    • Home
    • News
    • Tech
    • Robotics
    • ML & Research
    • AI
    • Digital Transformation
    • AI Ethics & Regulation
    • Thought Leadership in AI

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Pricing Breakdown and Core Characteristic Overview

    March 12, 2026

    65% of Organisations Nonetheless Detect Unauthorised Shadow AI Regardless of Visibility Optimism

    March 12, 2026

    Nvidia's new open weights Nemotron 3 tremendous combines three totally different architectures to beat gpt-oss and Qwen in throughput

    March 12, 2026
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Home»Machine Learning & Research»Working with Billion-Row Datasets in Python (Utilizing Vaex)
    Machine Learning & Research

    Working with Billion-Row Datasets in Python (Utilizing Vaex)

    Oliver ChambersBy Oliver ChambersFebruary 3, 2026No Comments10 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Working with Billion-Row Datasets in Python (Utilizing Vaex)
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link



    Picture by Creator

     

    # Introduction

     
    Dealing with huge datasets containing billions of rows is a significant problem in knowledge science and analytics. Conventional instruments like Pandas work properly for small to medium datasets that slot in system reminiscence, however as dataset sizes develop, they grow to be gradual, use a considerable amount of random entry reminiscence (RAM) to perform, and infrequently crash with out of reminiscence (OOM) errors.

    That is the place Vaex, a high-performance Python library for out-of-core knowledge processing, is available in. Vaex allows you to verify, modify, visualize, and analyze massive tabular datasets effectively and memory-friendly, even on a regular laptop computer.

     

    # What Is Vaex?

     
    Vaex is a Python library for lazy, out-of-core DataFrames (much like Pandas) designed for knowledge bigger than your RAM.

    Key traits:

    Vaex is designed to deal with huge datasets effectively by working straight with knowledge on disk and studying solely the parts wanted, avoiding loading total information into reminiscence.

    Vaex makes use of lazy analysis, which means operations are solely computed when outcomes are literally requested, and it might probably open columnar databases — which retailer knowledge by column as an alternative of rows — like HDF5, Apache Arrow, and Parquet immediately by way of reminiscence mapping.

    Constructed on optimized C/C++ backends, Vaex can compute statistics and carry out operations on billions of rows per second, making large-scale evaluation quick even on modest {hardware}.

    It has a Pandas-like software programming interface (API) that makes the transition smoother for customers already acquainted with Pandas, serving to them leverage large knowledge capabilities with out a steep studying curve.

     

    # Evaluating Vaex And Dask

     
    Vaex just isn’t much like Dask as an entire however is much like Dask DataFrames, that are constructed on high of Pandas DataFrames. Because of this Dask inherits sure Pandas points, such because the requirement that knowledge be loaded utterly into RAM to be processed in some contexts. This isn’t the case for Vaex. Vaex doesn’t make a DataFrame copy, so it might probably course of bigger DataFrames on machines with much less essential reminiscence. Each Vaex and Dask use lazy processing. The first distinction is that Vaex calculates the sector solely when wanted, whereas with Dask, we have to explicitly name the compute() perform. Knowledge must be in HDF5 or Apache Arrow format to take full benefit of Vaex.

     

    # Why Conventional Instruments Wrestle

     
    Instruments like Pandas load all the dataset into RAM earlier than processing. For datasets bigger than reminiscence, this results in:

    • Gradual efficiency
    • System crashes (OOM errors)
    • Restricted interactivity

    Vaex by no means masses all the dataset into reminiscence; as an alternative, it:

    • Streams knowledge from disk
    • Makes use of digital columns and lazy analysis to delay computation
    • Solely materializes outcomes when explicitly wanted

    This permits evaluation of enormous datasets even on modest {hardware}.

     

    # How Vaex Works Beneath The Hood

     

    // Out-of-Core Execution

    Vaex reads knowledge from disk as wanted utilizing reminiscence mapping. This permits it to function on knowledge information a lot bigger than RAM can maintain.

     

    // Lazy Analysis

    As a substitute of performing every operation instantly, Vaex builds a computation graph. Calculations are solely executed once you request a outcome (e.g. when printing or plotting).

     

    // Digital Columns

    Digital columns are expressions outlined on the dataset that don’t occupy reminiscence till computed. This protects RAM and hastens workflows.

     

    # Getting Began With Vaex

     

    // Putting in Vaex

    Create a clear digital setting:

    conda create -n vaex_demo python=3.9
    conda activate vaex_demo

     

    Set up Vaex with pip:

    pip set up vaex-core vaex-hdf5 vaex-viz

     

    Improve Vaex:

    pip set up --upgrade vaex

     

    Set up supporting libraries:

    pip set up pandas numpy matplotlib

     

     

    // Opening Giant Datasets

    Vaex helps varied fashionable storage codecs for dealing with massive datasets. It might work straight with HDF5, Apache Arrow, and Parquet information, all of that are optimized for environment friendly disk entry and quick analytics. Whereas Vaex also can learn CSV information, it first must convert them to a extra environment friendly format to enhance efficiency when working with massive datasets.

    The best way to open a Parquet file:

    import vaex
    
    df = vaex.open("your_huge_dataset.parquet")
    print(df)

     

    Now you possibly can examine the dataset construction with out loading it into reminiscence.

     

    // Core Operations In Vaex

    Filtering knowledge:

    filtered = df[df.sales > 1000]

     

    This doesn’t compute the outcome instantly; as an alternative, the filter is registered and utilized solely when wanted.

    Group-by and aggregations:

    outcome = df.groupby("class", agg=vaex.agg.imply("gross sales"))
    print(outcome)

     

    Vaex computes aggregations effectively utilizing parallel algorithms and minimal reminiscence.

    Computing statistics:

    mean_price = df["price"].imply()
    print(mean_price)

     

    Vaex computes this on the fly by scanning the dataset in chunks.

     

    // Demonstrating With A Taxi Dataset

    We are going to create a practical 50 million row taxi dataset to exhibit Vaex’s capabilities:

    import vaex
    import numpy as np
    import pandas as pd
    import time

     

    Set random seed for reproducibility:

    np.random.seed(42)
    print("Creating 50 million row dataset...")
    n = 50_000_000

     

    Generate reasonable taxi journey knowledge:

    knowledge = {
        'passenger_count': np.random.randint(1, 7, n),
        'trip_distance': np.random.exponential(3, n),
        'fare_amount': np.random.gamma(10, 1.5, n),
        'tip_amount': np.random.gamma(2, 1, n),
        'total_amount': np.random.gamma(12, 1.8, n),
        'payment_type': np.random.selection(['credit', 'cash', 'mobile'], n),
        'pickup_hour': np.random.randint(0, 24, n),
        'pickup_day': np.random.randint(1, 8, n),
    }

     

    Create Vaex DataFrame:

    df_vaex = vaex.from_dict(knowledge)

     

    Export to HDF5 format (environment friendly for Vaex):

    df_vaex.export_hdf5('taxi_50M.hdf5')
    print(f"Created dataset with {n:,} rows")

     

    Output:

    Form: (50000000, 8)
    Created dataset with 50,000,000 rows

     

    We now have a 50 million row dataset with 8 columns.

     

    // Vaex vs. Pandas Efficiency

    Opening massive information with Vaex memory-mapped opening:

    begin = time.time()
    df_vaex = vaex.open('taxi_50M.hdf5')
    vaex_time = time.time() - begin
    
    print(f"Vaex opened {df_vaex.form[0]:,} rows in {vaex_time:.4f} seconds")
    print(f"Reminiscence utilization: ~0 MB (memory-mapped)")

     

    Output:

    Vaex opened 50,000,000 rows in 0.0199 seconds
    Reminiscence utilization: ~0 MB (memory-mapped)

     

    Pandas: Load into reminiscence (don’t do this with 50M rows!):

    # This may fail on most machines
    df_pandas = pd.read_hdf('taxi_50M.hdf5')

     

    This can lead to a reminiscence error! Vaex opens information virtually immediately, no matter measurement, as a result of it doesn’t load knowledge into reminiscence.

    Primary aggregations: Calculate statistics on 50 million rows:

    begin = time.time()
    stats = {
        'mean_fare': df_vaex.fare_amount.imply(),
        'mean_distance': df_vaex.trip_distance.imply(),
        'total_revenue': df_vaex.total_amount.sum(),
        'max_fare': df_vaex.fare_amount.max(),
        'min_fare': df_vaex.fare_amount.min(),
    }
    agg_time = time.time() - begin
    
    print(f"nComputed 5 aggregations in {agg_time:.4f} seconds:")
    print(f"  Imply fare: ${stats['mean_fare']:.2f}")
    print(f"  Imply distance: {stats['mean_distance']:.2f} miles")
    print(f"  Complete income: ${stats['total_revenue']:,.2f}")
    print(f"  Fare vary: ${stats['min_fare']:.2f} - ${stats['max_fare']:.2f}")

     

    Output:

    Computed 5 aggregations in 0.8771 seconds:
      Imply fare: $15.00
      Imply distance: 3.00 miles
      Complete income: $1,080,035,827.27
      Fare vary: $1.25 - $55.30

     

    Filtering operations: Filter lengthy journeys:

    begin = time.time()
    long_trips = df_vaex[df_vaex.trip_distance > 10]
    filter_time = time.time() - begin
    
    print(f"nFiltered for journeys > 10 miles in {filter_time:.4f} seconds")
    print(f"  Discovered: {len(long_trips):,} lengthy journeys")
    print(f"  Share: {(len(long_trips)/len(df_vaex)*100):.2f}%")

     

    Output:

    Filtered for journeys > 10 miles in 0.0486 seconds
    Discovered: 1,784,122 lengthy journeys
    Share: 3.57%

     

    A number of circumstances:

    begin = time.time()
    premium_trips = df_vaex[(df_vaex.trip_distance > 5) & 
                            (df_vaex.fare_amount > 20) & 
                            (df_vaex.payment_type == 'credit')]
    multi_filter_time = time.time() - begin
    
    print(f"nMultiple situation filter in {multi_filter_time:.4f} seconds")
    print(f"  Premium journeys (>5mi, >$20, credit score): {len(premium_trips):,}")

     

    Output:

    A number of situation filter in 0.0582 seconds
    Premium journeys (>5mi, >$20, credit score): 457,191

     

    Group-by operations:

    begin = time.time()
    by_payment = df_vaex.groupby('payment_type', agg={
        'mean_fare': vaex.agg.imply('fare_amount'),
        'mean_tip': vaex.agg.imply('tip_amount'),
        'total_trips': vaex.agg.rely(),
        'total_revenue': vaex.agg.sum('total_amount')
    })
    groupby_time = time.time() - begin
    
    print(f"nGroupBy operation in {groupby_time:.4f} seconds")
    print(by_payment.to_pandas_df())

     

    Output:

    GroupBy operation in 5.6362 seconds
      payment_type  mean_fare  mean_tip  total_trips  total_revenue
    0       credit score  15.001817  2.000065     16663623   3.599456e+08
    1       cell  15.001200  1.999679     16667691   3.600165e+08
    2         money  14.999397  2.000115     16668686   3.600737e+08

     

    Extra complicated group-by:

    begin = time.time()
    by_hour = df_vaex.groupby('pickup_hour', agg={
        'avg_distance': vaex.agg.imply('trip_distance'),
        'avg_fare': vaex.agg.imply('fare_amount'),
        'trip_count': vaex.agg.rely()
    })
    complex_groupby_time = time.time() - begin
    
    print(f"nGroupBy by hour in {complex_groupby_time:.4f} seconds")
    print(by_hour.to_pandas_df().head(10))

     

    Output:

    GroupBy by hour in 1.6910 seconds
       pickup_hour  avg_distance   avg_fare  trip_count
    0            0      2.998120  14.997462     2083481
    1            1      3.000969  14.998814     2084650
    2            2      3.003834  15.001777     2081962
    3            3      3.001263  14.998196     2081715
    4            4      2.998343  14.999593     2083882
    5            5      2.997586  15.003988     2083421
    6            6      2.999887  15.011615     2083213
    7            7      3.000240  14.996892     2085156
    8            8      3.002640  15.000326     2082704
    9            9      2.999857  14.997857     2082284

     

    // Superior Vaex Options

    Digital columns (computed columns) enable including columns with no knowledge copying:

    df_vaex['tip_percentage'] = (df_vaex.tip_amount / df_vaex.fare_amount) * 100
    df_vaex['is_generous_tipper'] = df_vaex.tip_percentage > 20
    df_vaex['rush_hour'] = (df_vaex.pickup_hour >= 7) & (df_vaex.pickup_hour <= 9) | 
                            (df_vaex.pickup_hour >= 17) & (df_vaex.pickup_hour <= 19)

     

    These are computed on the fly with no reminiscence overhead:

    print("Added 3 digital columns with zero reminiscence overhead")
    generous_tippers = df_vaex[df_vaex.is_generous_tipper]
    print(f"Beneficiant tippers (>20% tip): {len(generous_tippers):,}")
    
    rush_hour_trips = df_vaex[df_vaex.rush_hour]
    print(f"Rush hour journeys: {len(rush_hour_trips):,}")

     

    Output:

    VIRTUAL COLUMNS
    Added 3 digital columns with zero reminiscence overhead
    Beneficiant tippers (>20% tip): 11,997,433
    Rush hour journeys: 12,498,848

     

    Correlation evaluation:

    corr = df_vaex.correlation(df_vaex.trip_distance, df_vaex.fare_amount)
    print(f"Correlation (distance vs fare): {corr:.4f}")

     

    Percentiles:

    attempt:
        percentiles = df_vaex.percentile_approx('fare_amount', [25, 50, 75, 90, 95, 99])
    besides AttributeError:
        percentiles = [
            df_vaex.fare_amount.quantile(0.25),
            df_vaex.fare_amount.quantile(0.50),
            df_vaex.fare_amount.quantile(0.75),
            df_vaex.fare_amount.quantile(0.90),
            df_vaex.fare_amount.quantile(0.95),
            df_vaex.fare_amount.quantile(0.99),
        ]
    
    print(f"nFare percentiles:")
    print(f"twenty fifth: ${percentiles[0]:.2f}")
    print(f"fiftieth (median): ${percentiles[1]:.2f}")
    print(f"seventy fifth: ${percentiles[2]:.2f}")
    print(f"ninetieth: ${percentiles[3]:.2f}")
    print(f"ninety fifth: ${percentiles[4]:.2f}")
    print(f"99th: ${percentiles[5]:.2f}")

     

    Customary deviation:

    std_fare = df_vaex.fare_amount.std()
    print(f"nStandard deviation of fares: ${std_fare:.2f}")

     

    Extra helpful statistics:

    print(f"nAdditional statistics:")
    print(f"Imply: ${df_vaex.fare_amount.imply():.2f}")
    print(f"Min: ${df_vaex.fare_amount.min():.2f}")
    print(f"Max: ${df_vaex.fare_amount.max():.2f}")

     

    Output:

    Correlation (distance vs fare): -0.0001
    
    Fare percentiles:
      twenty fifth: $11.57
      fiftieth (median): $nan
      seventy fifth: $nan
      ninetieth: $nan
      ninety fifth: $nan
      99th: $nan
    
    Customary deviation of fares: $4.74
    
    Extra statistics:
      Imply: $15.00
      Min: $1.25
      Max: $55.30

     

     

    // Knowledge Export

    # Export filtered knowledge
    high_value_trips = df_vaex[df_vaex.total_amount > 50]

     

    Exporting to totally different codecs:

    begin = time.time()
    high_value_trips.export_hdf5('high_value_trips.hdf5')
    export_time = time.time() - begin
    print(f"Exported {len(high_value_trips):,} rows to HDF5 in {export_time:.4f}s")

     

    You can even export to CSV, Parquet, and so forth.:

    high_value_trips.export_csv('high_value_trips.csv')
    high_value_trips.export_parquet('high_value_trips.parquet')

     

    Output:

    Exported 13,054 rows to HDF5 in 5.4508s

     

    Efficiency Abstract Dashboard

    print("VAEX PERFORMANCE SUMMARY")
    print(f"Dataset measurement:           {n:,} rows")
    print(f"File measurement on disk:      ~2.4 GB")
    print(f"RAM utilization:              ~0 MB (memory-mapped)")
    print()
    print(f"Open time:              {vaex_time:.4f} seconds")
    print(f"Single aggregation:     {agg_time:.4f} seconds")
    print(f"Easy filter:          {filter_time:.4f} seconds")
    print(f"Advanced filter:         {multi_filter_time:.4f} seconds")
    print(f"GroupBy operation:      {groupby_time:.4f} seconds")
    print()
    print(f"Throughput:             ~{n/groupby_time:,.0f} rows/second")

     

    Output:

    VAEX PERFORMANCE SUMMARY
    Dataset measurement:           50,000,000 rows
    File measurement on disk:      ~2.4 GB
    RAM utilization:              ~0 MB (memory-mapped)
    
    Open time:              0.0199 seconds
    Single aggregation:     0.8771 seconds
    Easy filter:          0.0486 seconds
    Advanced filter:         0.0582 seconds
    GroupBy operation:      5.6362 seconds
    
    Throughput:             ~8,871,262 rows/second

     

     

    # Concluding Ideas

     
    Vaex is right if you end up working with massive datasets which are better than 1GB and don’t slot in RAM, exploring large knowledge, performing characteristic engineering with hundreds of thousands of rows, or constructing knowledge preprocessing pipelines.

    You shouldn’t use Vaex for datasets smaller than 100MB. For these, utilizing Pandas is less complicated. If you’re coping with complicated joins throughout a number of tables, utilizing structured question language (SQL) databases could also be higher. While you want the complete Pandas API, be aware that Vaex has restricted compatibility. For real-time streaming knowledge, different instruments are extra acceptable.

    Vaex fills a spot within the Python knowledge science ecosystem: the power to work on billion-row datasets effectively and interactively with out loading every little thing into reminiscence. Its out-of-core structure, lazy execution mannequin, and optimized algorithms make it a strong instrument for giant knowledge exploration even on a laptop computer. Whether or not you’re exploring huge logs, scientific surveys, or high-frequency time collection, Vaex helps bridge the hole between ease of use and large knowledge scalability.
     
     

    Shittu Olumide is a software program engineer and technical author obsessed with leveraging cutting-edge applied sciences to craft compelling narratives, with a eager eye for element and a knack for simplifying complicated ideas. You can even discover Shittu on Twitter.



    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Oliver Chambers
    • Website

    Related Posts

    We ran 16 AI Fashions on 9,000+ Actual Paperwork. Here is What We Discovered.

    March 12, 2026

    Quick Paths and Sluggish Paths – O’Reilly

    March 11, 2026

    Speed up customized LLM deployment: Effective-tune with Oumi and deploy to Amazon Bedrock

    March 11, 2026
    Top Posts

    Evaluating the Finest AI Video Mills for Social Media

    April 18, 2025

    Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

    April 18, 2025

    Midjourney V7: Quicker, smarter, extra reasonable

    April 18, 2025

    Meta resumes AI coaching utilizing EU person knowledge

    April 18, 2025
    Don't Miss

    Pricing Breakdown and Core Characteristic Overview

    By Amelia Harper JonesMarch 12, 2026

    When utilized to informal discuss, scenario-based roleplay, or extra specific dialogue, Chatto AI Story and…

    65% of Organisations Nonetheless Detect Unauthorised Shadow AI Regardless of Visibility Optimism

    March 12, 2026

    Nvidia's new open weights Nemotron 3 tremendous combines three totally different architectures to beat gpt-oss and Qwen in throughput

    March 12, 2026

    How To Change A Company Tradition With Kate Johnson, CEO of Lumen Applied sciences

    March 12, 2026
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    UK Tech Insider
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms Of Service
    • Our Authors
    © 2026 UK Tech Insider. All rights reserved by UK Tech Insider.

    Type above and press Enter to search. Press Esc to cancel.