Close Menu
    Main Menu
    • Home
    • News
    • Tech
    • Robotics
    • ML & Research
    • AI
    • Digital Transformation
    • AI Ethics & Regulation
    • Thought Leadership in AI

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Google’s Veo 3.1 Simply Made AI Filmmaking Sound—and Look—Uncomfortably Actual

    October 17, 2025

    North Korean Hackers Use EtherHiding to Cover Malware Inside Blockchain Good Contracts

    October 16, 2025

    Why the F5 Hack Created an ‘Imminent Menace’ for 1000’s of Networks

    October 16, 2025
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Home»Machine Learning & Research»How To Use Artificial Information To Construct a Portfolio Challenge
    Machine Learning & Research

    How To Use Artificial Information To Construct a Portfolio Challenge

    Oliver ChambersBy Oliver ChambersSeptember 23, 2025No Comments11 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    How To Use Artificial Information To Construct a Portfolio Challenge
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    How To Use Synthetic Data To Build a Portfolio Project
    Picture by Creator | Canva

     

    # Introduction

     
    Discovering real-world datasets could be difficult as a result of they’re typically personal (protected), incomplete (lacking options), or costly (behind a paywall). Artificial datasets can clear up these issues by letting you generate the info based mostly in your venture wants.

    Artificial information is artificially generated info that mimics real-life datasets. You may management the scale, complexity, and realism of the artificial dataset to tailor it based mostly in your information wants.

    On this article, we’ll discover artificial information era strategies. We’ll then construct a portfolio venture by analyzing the info, making a machine studying mannequin, and utilizing AI to develop a whole portfolio venture with a Streamlit app.

     

    # The way to Generate Artificial Information

     
    Artificial information is commonly created randomly, utilizing simulations, guidelines, or AI.

     
    How to Generate Synthetic DataHow to Generate Synthetic Data
     

    // Technique 1: Random Information Era

    To generate information randomly, we’ll use easy features to create values with none particular guidelines.

    It’s helpful for testing, nevertheless it received’t seize life like relationships between options. We’ll do it utilizing NumPy’s random methodology and create a Pandas DataFrame.

    import numpy as np
    import pandas as pd
    np.random.seed(42)
    df_random = pd.DataFrame({
        "feature_a": np.random.randint(1, 100, 5),
        "feature_b": np.random.rand(5),
        "feature_c": np.random.alternative(["X", "Y", "Z"], 5)
    })
    df_random.head()

     

    Right here is the output.

     
    How to Generate Synthetic DataHow to Generate Synthetic Data
     

    // Technique 2: Rule-Based mostly Information Era

    Rule-based information era is a wiser and extra life like methodology than random information era. It follows a exact formulation or algorithm. This makes the output purposeful and constant.

    In our instance, the scale of a home is straight linked to its value. To point out this clearly, we’ll create a dataset with each measurement and value. We’ll outline the connection with a formulation:

    Worth = measurement × 300 + ε (random noise)

    This fashion, you may see the correlation whereas conserving the info fairly life like.

    np.random.seed(42)
    n = 5
    measurement = np.random.randint(500, 3500, n)
    value = measurement * 300 + np.random.randint(5000, 20000, n)
    
    df_rule = pd.DataFrame({
        "size_sqft": measurement,
        "price_usd": value
    })
    df_rule.head()

     

    Right here is the output.

     
    How to Generate Synthetic DataHow to Generate Synthetic Data
     

    // Technique 3: Simulation-Based mostly Information Era

    The simulation-based information era methodology combines random variation with guidelines from the actual world. This combine creates datasets that behave like actual ones.

    What will we learn about housing?

    • Larger properties normally price extra
    • Some cities price greater than others
    • A baseline value

    How will we construct the dataset?

    1. Decide a metropolis at random
    2. Draw a house measurement
    3. Set bedrooms between 1 and 5
    4. Compute the value with a transparent rule

    Worth rule: We begin with a base value, add a metropolis value bump, after which add measurement × price.

    price_usd = base_price × city_bump + sqft × price

    Right here is the code.

    import numpy as np
    import pandas as pd
    rng = np.random.default_rng(42)
    CITIES = ["los_angeles", "san_francisco", "san_diego"]
    # Metropolis value bump: greater means pricier metropolis
    CITY_BUMP = {"los_angeles": 1.10, "san_francisco": 1.35, "san_diego": 1.00}
    
    def make_data(n_rows=10):
        metropolis = rng.alternative(CITIES, measurement=n_rows)
        # Most properties are close to 1,500 sqft, some smaller or bigger
        sqft = rng.regular(1500, 600, n_rows).clip(350, 4500).spherical()
        beds = rng.integers(1, 6, n_rows)
    
        base = 220_000
        price = 350  # {dollars} per sqft
    
        bump = np.array([CITY_BUMP[c] for c in metropolis])
        value = base * bump + sqft * price
    
        return pd.DataFrame({
            "metropolis": metropolis,
            "sqft": sqft.astype(int),
            "beds": beds,
            "price_usd": value.spherical(0).astype(int),
        })
    
    df = make_data()
    df.head()

     

    Right here is the output.

     
    How do we build the synthetic datasetHow do we build the synthetic dataset
     

    // Technique 4: AI-Powered Information Era

    To have AI create your dataset, you want a transparent immediate. AI is highly effective, nevertheless it works greatest if you set easy, sensible guidelines.

    Within the following immediate, we’ll embrace:

    • Area: What’s the information about?
    • Options: Which columns do we would like?
      • Metropolis, neighborhood, sqft, bedrooms, bogs
    • Relationships: How do the options join?
      • Worth depends upon metropolis, sqft, bedrooms, and crime index
    • Format: How ought to AI return it?

    Right here is the immediate.

     

    Generate Python code that creates an artificial California actual property dataset.
    The dataset ought to have 10,000 rows with columns: metropolis, neighborhood, latitude, longitude, sqft, bedrooms, bogs, lot_sqft, year_built, property_type, has_garage, situation, school_score, crime_index, dist_km_center, price_usd.
    Cities: Los Angeles, San Francisco, San Diego, San Jose, Sacramento.
    Worth ought to depend upon metropolis premium, sqft, bedrooms, bogs, lot measurement, college rating, crime index, and distance from metropolis heart.
    Embody some random noise, lacking values, and some outliers.
    Return the end result as a Pandas DataFrame and put it aside to ‘ca_housing_synth.csv’

     

    Let’s use this immediate with ChatGPT.

     
    How do we build the synthetic datasetHow do we build the synthetic dataset
     

    It returned the dataset as a CSV. Right here is the method that reveals how ChatGPT created it.

     
    How do we build the synthetic datasetHow do we build the synthetic dataset
     

    That is essentially the most advanced dataset now we have generated by far. Let’s see the primary few rows of this dataset.

     
    How do we build the synthetic datasetHow do we build the synthetic dataset

     

    # Constructing a Portfolio Challenge from Artificial Information

     
    We used 4 completely different strategies to create an artificial dataset. We’ll use the AI-generated information to construct a portfolio venture.

    First, we’ll discover the info, after which construct a machine studying mannequin. Subsequent, we’ll visualize the outcomes with Streamlit by leveraging AI, and within the closing step, we’ll uncover which steps to observe to deploy the mannequin to manufacturing.

     
    Building a Portfolio Project from Synthetic DataBuilding a Portfolio Project from Synthetic Data

     

    // Step 1: Exploring and Understanding the Artificial Dataset

    We’ll begin exploring the info by first studying it with pandas and exhibiting the primary few rows.

    df = pd.read_csv("ca_housing_synth.csv")
    df.head()

     

    Right here is the output.

     
    Building a Portfolio Project from Synthetic DataBuilding a Portfolio Project from Synthetic Data
     

    The dataset contains location (metropolis, neighborhood, latitude, longitude) and property particulars (measurement, rooms, 12 months, situation), in addition to the goal value. Let’s examine the knowledge within the column names, measurement, and size through the use of the data methodology.

     

    Building a Portfolio Project from Synthetic DataBuilding a Portfolio Project from Synthetic Data
     

    We’ve got 15 columns, with some, like has_garage or dist_km_center, being fairly particular.

     

    // Step 2: Mannequin Constructing

    The following step is to construct a machine studying mannequin that predicts residence costs.

    We’ll observe these steps:

    Right here is the code.

    from sklearn.compose import ColumnTransformer
    from sklearn.pipeline import Pipeline
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import OneHotEncoder, StandardScaler
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
    from sklearn.inspection import permutation_importance
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    
    # --- Step 1: Outline columns based mostly on the generated dataset
    num_cols = ["sqft", "bedrooms", "bathrooms", "lot_sqft", "year_built", 
                "school_score", "crime_index", "dist_km_center", "latitude", "longitude"]
    cat_cols = ["city", "neighborhood", "property_type", "condition", "has_garage"]
    
    # --- Step 2: Cut up the info
    X = df.drop(columns=["price_usd"])
    y = df["price_usd"]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    # --- Step 3: Preprocessing pipelines
    num_pipe = Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ])
    cat_pipe = Pipeline([
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("encoder", OneHotEncoder(handle_unknown="ignore"))
    ])
    
    preprocessor = ColumnTransformer([
        ("num", num_pipe, num_cols),
        ("cat", cat_pipe, cat_cols)
    ])
    
    # --- Step 4: Mannequin
    mannequin = RandomForestRegressor(n_estimators=300, random_state=42, n_jobs=-1)
    
    pipeline = Pipeline([
        ("preprocessor", preprocessor),
        ("model", model)
    ])
    
    # --- Step 5: Practice
    pipeline.match(X_train, y_train)
    
    # --- Step 6: Consider
    y_pred = pipeline.predict(X_test)
    mae = mean_absolute_error(y_test, y_pred)
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    r2 = r2_score(y_test, y_pred)
    
    print(f"MAE:  {mae:,.0f}")
    print(f"RMSE: {rmse:,.0f}")
    print(f"R²:   {r2:.3f}")
    
    # --- Step 7: (Non-obligatory) Permutation Significance on a subset for velocity
    pi = permutation_importance(
        pipeline, X_test.iloc[:1000], y_test.iloc[:1000],
        n_repeats=3, random_state=42, scoring="r2"
    )
    
    # --- Step 8: Plot Precise vs Predicted
    plt.determine(figsize=(6, 5))
    plt.scatter(y_test, y_pred, alpha=0.25)
    vmin, vmax = min(y_test.min(), y_pred.min()), max(y_test.max(), y_pred.max())
    plt.plot([vmin, vmax], [vmin, vmax], linestyle="--", shade="purple")
    plt.xlabel("Precise Worth (USD)")
    plt.ylabel("Predicted Worth (USD)")
    plt.title(f"Precise vs Predicted (MAE={mae:,.0f}, RMSE={rmse:,.0f}, R²={r2:.3f})")
    plt.tight_layout()
    plt.present()

     

    Right here is the output.

     
    Building a Portfolio Project from Synthetic DataBuilding a Portfolio Project from Synthetic Data
     

    Mannequin Efficiency:

    • MAE (85,877 USD): On common, predictions are off by about $86K, which is cheap given the variability in housing costs
    • RMSE (113,512 USD): Bigger errors are penalized extra; RMSE confirms the mannequin handles appreciable deviations pretty nicely
    • R² (0.853): The mannequin explains ~85% of the variance in residence costs, exhibiting robust predictive energy for artificial information

     

    // Step 3: Visualize the Information

    On this step, we’ll present our course of, together with EDA and mannequin constructing, utilizing the Streamlit dashboard. Why are we utilizing Streamlit? You may construct a Streamlit dashboard rapidly and simply deploy it for others to view and work together with.

    Utilizing Gemini CLI

    To construct the Streamlit software, we’ll use Gemini CLI.

    Gemini CLI is an AI-powered open-source command-line agent. You may write code and construct purposes utilizing Gemini CLI. It’s simple and free.

    To put in it, use the next command in your terminal.

    npm set up -g @google/gemini-cli

     

    After putting in, use this code to provoke.

     

    It should ask you to log in to your Google account, and then you definitely’ll see the display the place you’ll construct this Streamlit app.

     
    Building a Portfolio Project from Synthetic DataBuilding a Portfolio Project from Synthetic Data
     

    Constructing a Dashboard

    To construct a dashboard, we have to create a immediate that’s tailor-made to your particular information and mission. Within the following immediate, we clarify every thing AI must construct a Streamlit dashboard.

    Construct a Streamlit app for the California Actual Property dataset through the use of this dataset ( path-to-dataset )
    Right here is the dataset info: 
    • Area: California housing — Los Angeles, San Francisco, San Diego, San Jose, Sacramento.
    • Location: metropolis, neighborhood, lat, lon, and dist_km_center (haversine to metropolis heart).
    • Dwelling options: sqft, beds, baths, lot_sqft, year_built, property_type, has_garage, situation.
    • Context: school_score, crime_index.
    • Goal: price_usd.
    • Worth logic: metropolis premium + measurement + rooms + lot measurement + college/crime + distance to heart + property kind + situation + noise.
    • Recordsdata you have got: ca_housing_synth.csv (information) and real_estate_model.pkl (educated pipeline).
    
    The Streamlit app ought to have:
    • A brief dataset overview part (form, column record, small preview).
    • Sidebar inputs for each mannequin function besides the goal:
    - Categorical dropdowns: metropolis, neighborhood, property_type, situation, has_garage.
    - Numeric inputs/sliders: lat, lon, sqft, beds, baths, lot_sqft, year_built, school_score, crime_index.
    - Auto-compute dist_km_center from the chosen metropolis utilizing the haversine formulation and that metropolis’s heart.
    • A Predict button that:
    - Builds a one-row DataFrame with the precise coaching columns (order-safe).
    - Calls pipeline.predict(...) from real_estate_model.pkl.
    - Shows Estimated Worth (USD) with hundreds separators.
    • One chart solely: What-if: sqft vs value line chart (all different inputs mounted to the sidebar values).
    - High quality of life: cache mannequin load, primary enter validation, clear labels/tooltips, English UI.

     

    Subsequent, Gemini will ask your permission to create this file.

     
    Building a Portfolio Project from Synthetic DataBuilding a Portfolio Project from Synthetic Data
     

    Let’s approve and proceed. As soon as it has completed coding, it’s going to robotically open the streamlit dashboard.

    If not, go to the working listing of the app.py file and run streamlit run app.py to start out this Streamlit app.

    Right here is our Streamlit dashboard.

     
    Building a Portfolio Project from Synthetic DataBuilding a Portfolio Project from Synthetic Data
     

    When you click on on the info overview, you may see a piece representing the info exploration.

     
    Building a Portfolio Project from Synthetic DataBuilding a Portfolio Project from Synthetic Data
     

    From the property options on the left-hand facet, we are able to customise the property and make predictions accordingly. This a part of the dashboard represents what we did in mannequin constructing, however with a extra responsive look.

    Let’s choose Richmond, San Francisco, single-family, wonderful situation, 1500 sqft, and click on on the “Predict Worth” button:

     
    Building a Portfolio Project from Synthetic DataBuilding a Portfolio Project from Synthetic Data
     

    The anticipated value is $1.24M. Additionally, you may see the precise vs predicted value within the second graph for all the dataset when you scroll down.

     
    Building a Portfolio Project from Synthetic DataBuilding a Portfolio Project from Synthetic Data
     

    You may modify extra options within the left panel, just like the 12 months constructed, crime index, or the variety of bogs.

     
    Building a Portfolio Project from Synthetic DataBuilding a Portfolio Project from Synthetic Data
     

    // Step 4: Deploy the Mannequin

    The following step is importing your mannequin to manufacturing. To do this, you may observe these steps:

     

    # Remaining Ideas

     
    On this article, now we have found completely different strategies to create artificial datasets, comparable to random, rule-based, simulation-based, or AI-powered. Subsequent, we’ve constructed a portfolio information venture by ranging from information exploration and constructing a machine studying mannequin.

    We additionally used an open-source command-line-based AI agent (Gemini CLI) to develop a dashboard that explores the dataset and predicts home costs based mostly on chosen options, together with the variety of bedrooms, crime index, and sq. footage.

    Creating your artificial information permits you to keep away from privateness hurdles, steadiness your examples, and transfer quick with out expensive information assortment. The draw back is that it will possibly mirror your assumptions and miss real-world quirks. In the event you’re searching for extra inspiration, take a look at this record of machine studying tasks you could adapt on your portfolio.

    Lastly, we checked out the right way to add your mannequin to manufacturing utilizing Streamlit Group Cloud. Go forward and observe these steps to construct and showcase your portfolio venture immediately!
     
     

    Nate Rosidi is an information scientist and in product technique. He is additionally an adjunct professor educating analytics, and is the founding father of StrataScratch, a platform serving to information scientists put together for his or her interviews with actual interview questions from prime firms. Nate writes on the newest developments within the profession market, offers interview recommendation, shares information science tasks, and covers every thing SQL.



    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Oliver Chambers
    • Website

    Related Posts

    Easy methods to Run Your ML Pocket book on Databricks?

    October 16, 2025

    Reworking enterprise operations: 4 high-impact use circumstances with Amazon Nova

    October 16, 2025

    Reinvent Buyer Engagement with Dynamics 365: Flip Insights into Motion

    October 16, 2025
    Top Posts

    Evaluating the Finest AI Video Mills for Social Media

    April 18, 2025

    Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

    April 18, 2025

    Midjourney V7: Quicker, smarter, extra reasonable

    April 18, 2025

    Meta resumes AI coaching utilizing EU person knowledge

    April 18, 2025
    Don't Miss

    Google’s Veo 3.1 Simply Made AI Filmmaking Sound—and Look—Uncomfortably Actual

    By Amelia Harper JonesOctober 17, 2025

    Google’s newest AI improve, Veo 3.1, is blurring the road between artistic device and film…

    North Korean Hackers Use EtherHiding to Cover Malware Inside Blockchain Good Contracts

    October 16, 2025

    Why the F5 Hack Created an ‘Imminent Menace’ for 1000’s of Networks

    October 16, 2025

    3 Should Hear Podcast Episodes To Assist You Empower Your Management Processes

    October 16, 2025
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    UK Tech Insider
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms Of Service
    • Our Authors
    © 2025 UK Tech Insider. All rights reserved by UK Tech Insider.

    Type above and press Enter to search. Press Esc to cancel.