On this article, you’ll learn to use Python’s itertools module to simplify widespread function engineering duties with clear, environment friendly patterns.
Matters we’ll cowl embody:
- Producing interplay, polynomial, and cumulative options with itertools.
- Constructing lookup grids, lag home windows, and grouped aggregates for structured information workflows.
- Utilizing iterator-based instruments to write down cleaner, extra composable function engineering code.
On we go.
7 Important Python Itertools for Characteristic Engineering
Picture by Editor
Introduction
Characteristic engineering is the place many of the actual work in machine studying occurs. An excellent function typically improves a mannequin greater than switching algorithms. But this step often results in messy code with nested loops, handbook indexing, hand-built mixtures, and the like.
Python’s itertools module is a regular library toolkit that the majority information scientists know exists however not often attain for when constructing options. That’s a missed alternative, as itertools is designed for working with iterators effectively. Numerous function engineering, at its core, is structured iteration over pairs of variables, sliding home windows, grouped sequences, or each doable subset of a function set.
On this article, you’ll work by way of seven itertools capabilities that resolve widespread function engineering issues. We’ll spin up pattern e-commerce information and canopy interplay options, lag home windows, class mixtures, and extra. By the top, you’ll have a set of patterns you’ll be able to drop instantly into your personal function engineering pipelines.
You will get the code on GitHub.
1. Producing Interplay Options with mixtures
Interplay options seize the connection between two variables — one thing neither variable expresses alone. Manually itemizing each pair from a multi-column dataset is tedious. mixtures within the itertools module does it in a single line.
Let’s code an instance to create interplay options utilizing mixtures:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
import itertools import pandas as pd
df = pd.DataFrame({ “avg_order_value”: [142.5, 89.0, 210.3, 67.8, 185.0], “discount_rate”: [0.10, 0.25, 0.05, 0.30, 0.15], “days_since_signup”: [120, 45, 380, 12, 200], “items_per_order”: [3.2, 1.8, 5.1, 1.2, 4.0], “return_rate”: [0.05, 0.18, 0.02, 0.22, 0.08], })
numeric_cols = df.columns.tolist()
for col_a, col_b in itertools.mixtures(numeric_cols, 2): feature_name = f“{col_a}_x_{col_b}” df[feature_name] = df[col_a] * df[col_b]
interaction_cols = [c for c in df.columns if “_x_” in c] print(df[interaction_cols].head()) |
Truncated output:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
avg_order_value_x_discount_rate avg_order_value_x_days_since_signup 0 14.250 17100.0 1 22.250 4005.0 2 10.515 79914.0 3 20.340 813.6 4 27.750 37000.0
avg_order_value_x_items_per_order avg_order_value_x_return_charge 0 456.00 7.125 1 160.20 16.020 2 1072.53 4.206 3 81.36 14.916 4 740.00 14.800 ...
days_since_signup_x_return_rate items_per_order_x_return_charge 0 6.00 0.160 1 8.10 0.324 2 7.60 0.102 3 2.64 0.264 4 16.00 0.320 |
mixtures(numeric_cols, 2) generates each distinctive pair precisely as soon as with out duplicates. With 5 columns, that’s 10 pairs; with 10 columns, it’s 45. This method scales as you add columns.
2. Constructing Cross-Class Characteristic Grids with product
itertools.product offers you the Cartesian product of two or extra iterables — each doable mixture throughout them — together with repeats throughout completely different teams.
Within the e-commerce pattern we’re working with, that is helpful whenever you need to construct a function matrix throughout buyer segments and product classes.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import itertools
customer_segments = [“new”, “returning”, “vip”] product_categories = [“electronics”, “apparel”, “home_goods”, “beauty”] channels = [“mobile”, “desktop”]
# All section × class × channel mixtures combos = checklist(itertools.product(customer_segments, product_categories, channels))
grid_df = pd.DataFrame(combos, columns=[“segment”, “category”, “channel”])
# Simulate a conversion charge lookup per mixture import numpy as np np.random.seed(7) grid_df[“avg_conversion_rate”] = np.spherical( np.random.uniform(0.02, 0.18, dimension=len(grid_df)), 3 )
print(grid_df.head(12)) print(f“nTotal mixtures: {len(grid_df)}”) |
Output:
|
section class channel avg_conversion_charge 0 new electronics cellular 0.032 1 new electronics desktop 0.145 2 new attire cellular 0.090 3 new attire desktop 0.136 4 new home_goods cellular 0.176 5 new home_goods desktop 0.106 6 new magnificence cellular 0.100 7 new magnificence desktop 0.032 8 returning electronics cellular 0.063 9 returning electronics desktop 0.100 10 returning attire cellular 0.129 11 returning attire desktop 0.149
Complete mixtures: 24 |
This grid can then be merged again onto your fundamental transaction dataset as a lookup function, as each row will get the anticipated conversion charge for its particular section × class × channel bucket. product ensures you haven’t missed any legitimate mixture when constructing that grid.
3. Flattening Multi-Supply Characteristic Units with chain
In most pipelines, options come from a number of sources: a buyer profile desk, a product metadata desk, and a searching historical past desk. You typically must flatten these right into a single function checklist for column choice or validation.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
import itertools
customer_features = [ “customer_age”, “days_since_signup”, “lifetime_value”, “total_orders”, “avg_order_value” ]
product_features = [ “category”, “brand_tier”, “avg_rating”, “review_count”, “is_sponsored” ]
behavioral_features = [ “pages_viewed_last_7d”, “search_queries_last_7d”, “cart_abandonment_rate”, “wishlist_size” ]
# Flatten all function teams into one checklist all_features = checklist(itertools.chain( customer_features, product_features, behavioral_options ))
print(f“Complete options: {len(all_features)}”) print(all_features) |
Output:
|
Complete options: 14 [‘customer_age’, ‘days_since_signup’, ‘lifetime_value’, ‘total_orders’, ‘avg_order_value’, ‘category’, ‘brand_tier’, ‘avg_rating’, ‘review_count’, ‘is_sponsored’, ‘pages_viewed_last_7d’, ‘search_queries_last_7d’, ‘cart_abandonment_rate’, ‘wishlist_size’] |
This would possibly appear like utilizing + to concatenate lists, and it’s for easy circumstances. However chain is very helpful when you’ve many sources, when sources are mills reasonably than lists, or whenever you’re constructing the function checklist conditionally, the place some function teams are non-obligatory relying on information availability. It retains the code readable and composable.
4. Creating Windowed Lag Options with islice
Lag options are essential in lots of datasets. In e-commerce, for instance, what a buyer spent final month, their order rely during the last 3 purchases, and their common basket dimension during the last 5 transactions can all be essential options. Constructing these manually with index arithmetic is vulnerable to errors.
islice permits you to slice an iterator with out changing it to an inventory first. That is helpful when processing ordered transaction histories row by row.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
import itertools
# Transaction historical past for buyer C-10482, ordered chronologically transactions = [ {“order_id”: “ORD-8821”, “amount”: 134.50, “items”: 3}, {“order_id”: “ORD-8934”, “amount”: 89.00, “items”: 2}, {“order_id”: “ORD-9102”, “amount”: 210.75, “items”: 5}, {“order_id”: “ORD-9341”, “amount”: 55.20, “items”: 1}, {“order_id”: “ORD-9488”, “amount”: 178.90, “items”: 4}, {“order_id”: “ORD-9601”, “amount”: 302.10, “items”: 7}, ]
# Construct lag-3 options for every transaction (utilizing 3 most up-to-date prior orders) window_size = 3 options = []
for i in vary(window_size, len(transactions)): window = checklist(itertools.islice(transactions, i – window_size, i)) present = transactions[i]
lag_amounts = [t[“amount”] for t in window] options.append({ “order_id”: present[“order_id”], “current_amount”: present[“amount”], “lag_1_amount”: lag_amounts[–1], “lag_2_amount”: lag_amounts[–2], “lag_3_amount”: lag_amounts[–3], “rolling_mean_3”: spherical(sum(lag_amounts) / len(lag_amounts), 2), “rolling_max_3”: max(lag_amounts), })
print(pd.DataFrame(options).to_string(index=False)) |
Output:
|
order_id current_amount lag_1_amount lag_2_amount lag_3_amount rolling_mean_3 rolling_max_3 ORD–9341 55.2 210.75 89.00 134.50 144.75 210.75 ORD–9488 178.9 55.20 210.75 89.00 118.32 210.75 ORD–9601 302.1 178.90 55.20 210.75 148.28 210.75 |
islice(transactions, i - window_size, i) offers you precisely the previous window_size transactions with out constructing intermediate lists for the total historical past.
5. Aggregating Per-Class Options with groupby
groupby permits you to group a sorted iterable and compute per-group statistics cleanly.
Going again to our instance, a buyer’s conduct typically varies considerably by product class. Their common spend on electronics could be 4× their spend on equipment. Treating all orders as one pool loses that sign.
Right here’s an instance:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
import itertools
orders = [ {“customer”: “C-10482”, “category”: “electronics”, “amount”: 349.99}, {“customer”: “C-10482”, “category”: “electronics”, “amount”: 189.00}, {“customer”: “C-10482”, “category”: “apparel”, “amount”: 62.50}, {“customer”: “C-10482”, “category”: “apparel”, “amount”: 88.00}, {“customer”: “C-10482”, “category”: “apparel”, “amount”: 45.75}, {“customer”: “C-10482”, “category”: “home_goods”, “amount”: 124.30}, ]
# Should be sorted by the grouping key earlier than utilizing groupby orders_sorted = sorted(orders, key=lambda x: x[“category”])
category_features = {} for class, group in itertools.groupby(orders_sorted, key=lambda x: x[“category”]): quantities = [o[“amount”] for o in group] category_features[category] = { “order_count”: len(quantities), “total_spend”: spherical(sum(quantities), 2), “avg_spend”: spherical(sum(quantities) / len(quantities), 2), “max_spend”: max(quantities), }
cat_df = pd.DataFrame(category_features).T cat_df.index.identify = “class” print(cat_df) |
Output:
|
order_count total_spend avg_spend max_spend class attire 3.0 196.25 65.42 88.00 electronics 2.0 538.99 269.50 349.99 residence_items 1.0 124.30 124.30 124.30 |
These per-category aggregates turn out to be options on the client row — electronics_avg_spend, apparel_order_count, and so forth. The essential factor to recollect with itertools.groupby is that you need to type by the key first. Not like pandas groupby, it solely teams consecutive parts.
6. Constructing Polynomial Options with combinations_with_replacement
Polynomial options — squares, cubes, and cross-products — are a regular option to give linear fashions the power to seize non-linear relationships.
Scikit-learn’s PolynomialFeatures does this, however combinations_with_replacement offers you an identical outcome with full management over which options get expanded and the way.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
import itertools
df_poly = pd.DataFrame({ “avg_order_value”: [142.5, 89.0, 210.3, 67.8], “discount_rate”: [0.10, 0.25, 0.05, 0.30], “items_per_order”: [3.2, 1.8, 5.1, 1.2], })
cols = df_poly.columns.tolist()
# Diploma-2: contains col^2 and col_a × col_b for col_a, col_b in itertools.combinations_with_replacement(cols, 2): feature_name = f“{col_a}^2” if col_a == col_b else f“{col_a}_x_{col_b}” df_poly[feature_name] = df_poly[col_a] * df_poly[col_b]
poly_cols = [c for c in df_poly.columns if “^2” in c or “_x_” in c] print(df_poly[poly_cols].spherical(3)) |
Output:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
avg_order_value^2 avg_order_value_x_discount_charge 0 20306.25 14.250 1 7921.00 22.250 2 44226.09 10.515 3 4596.84 20.340
avg_order_value_x_items_per_order discount_rate^2 0 456.00 0.010 1 160.20 0.062 2 1072.53 0.003 3 81.36 0.090
discount_rate_x_items_per_order items_per_order^2 0 0.320 10.24 1 0.450 3.24 2 0.255 26.01 3 0.360 1.44 |
The distinction from mixtures is within the identify: combinations_with_replacement permits the identical factor to seem twice. That’s what offers you the squared phrases (avg_order_value^2). Use this whenever you need polynomial enlargement with out pulling in scikit-learn only for preprocessing.
7. Accumulating Cumulative Behavioral Options with accumulate
itertools.accumulate computes operating aggregates over a sequence with no need pandas or NumPy.
Cumulative options — operating whole spend, cumulative order rely, and operating common basket dimension — are helpful alerts for lifetime worth modeling and churn prediction. A buyer’s cumulative spend at order 5 says one thing completely different than their spend at order 15. Right here’s a helpful instance:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
import itertools
# Buyer C-20917: chronological order quantities order_amounts = [56.80, 123.40, 89.90, 245.00, 67.50, 310.20, 88.75]
# Cumulative spend cumulative_spend = checklist(itertools.accumulate(order_amounts))
# Cumulative max spend (highest single order thus far) cumulative_max = checklist(itertools.accumulate(order_amounts, func=max))
# Cumulative order rely (simply utilizing addition on 1s) cumulative_count = checklist(itertools.accumulate([1] * len(order_amounts)))
features_df = pd.DataFrame({ “order_number”: vary(1, len(order_amounts) + 1), “order_amount”: order_amounts, “cumulative_spend”: cumulative_spend, “cumulative_max_order”: cumulative_max, “order_count_so_far”: cumulative_count, })
features_df[“avg_spend_so_far”] = ( features_df[“cumulative_spend”] / features_df[“order_count_so_far”] ).spherical(2)
print(features_df.to_string(index=False)) |
Output:
|
order_number order_amount cumulative_spend cumulative_max_order order_count_so_far avg_spend_so_far 1 56.80 56.80 56.8 1 56.80 2 123.40 180.20 123.4 2 90.10 3 89.90 270.10 123.4 3 90.03 4 245.00 515.10 245.0 4 128.78 5 67.50 582.60 245.0 5 116.52 6 310.20 892.80 310.2 6 148.80 7 88.75 981.55 310.2 7 140.22 |
accumulate takes an non-obligatory func argument — any two-argument perform. The default is addition, however max, min, operator.mul, or a customized lambda all work. On this instance, every row within the output is a snapshot of the client’s historical past at that time limit. That is helpful when constructing options for sequential fashions or coaching information the place you need to keep away from leakage.
Wrapping Up
I hope you discovered this text on utilizing Python’s itertools module for function engineering useful. Right here’s a fast reference for when to succeed in for every perform:
| Perform | Characteristic Engineering Use Case |
|---|---|
mixtures |
Pairwise interplay options |
product |
Cross-category function grids |
chain |
Merging function lists from a number of sources |
islice |
Lag and rolling window options |
groupby |
Per-group aggregation options |
combinations_with_replacement |
Polynomial / squared options |
accumulate |
Cumulative behavioral options |
A helpful behavior to construct right here is recognizing when a function engineering drawback is, at its core, an iteration drawback. When it’s, itertools nearly all the time has a cleaner reply than a customized perform with hard-to-maintain loops. Within the subsequent article, we’ll give attention to constructing options for time collection information. Till then, comfortable coding!

