Table of Contents
- Why Use Python’s Standard Library for ML Prototyping?
- Key Standard Library Modules for ML
- Implementing Basic ML Algorithms from Scratch
- Evaluation Metrics with Standard Library Tools
- Case Study: Predicting House Prices with Linear Regression
- Limitations and When to Switch to External Libraries
- Conclusion
- References
Why Use Python’s Standard Library for ML Prototyping?
Before diving into code, let’s clarify when and why the standard library shines for ML prototyping:
- No External Dependencies: No
pip installrequired. This is ideal for environments with restricted internet access or for sharing code with minimal setup. - Lightweight and Fast: Avoids the overhead of loading large libraries like
pandasornumpy, making prototyping snappier for small datasets. - Educational Value: Implementing models from scratch forces you to understand core concepts (e.g., gradient descent, distance metrics) rather than relying on black-box functions.
- Portability: Standard library code runs on any Python installation (2.7+, though 3.6+ is recommended), ensuring compatibility across systems.
Key Standard Library Modules for ML
The standard library offers a surprising range of tools for ML. Let’s focus on the most critical modules:
Data Handling: csv, json, and io
Most ML projects start with loading data. The csv and json modules simplify reading structured data, while io helps with in-memory data manipulation.
Example: Loading Data from a CSV
Suppose we have a CSV file house_data.csv with columns: square_footage and price (target). We can load it using csv.DictReader:
import csv
def load_csv(file_path):
data = []
with open(file_path, mode='r') as file:
reader = csv.DictReader(file)
for row in reader:
# Convert string values to floats
data.append({
'square_footage': float(row['square_footage']),
'price': float(row['price'])
})
return data
# Usage
house_data = load_csv('house_data.csv')
print(f"Loaded {len(house_data)} samples.")
Example: Loading JSON Data
For JSON datasets (e.g., titanic.json), use json.load:
import json
def load_json(file_path):
with open(file_path, mode='r') as file:
return json.load(file)
titanic_data = load_json('titanic.json')
Data Preprocessing: math, statistics, and random
Cleaning and transforming data is critical for ML. The math module provides basic arithmetic operations, statistics computes summary stats, and random handles shuffling/splitting data.
Example: Normalizing Data
Normalization (scaling features to [0,1] or standardizing to mean=0, std=1) improves model performance. Here’s how to standardize data using statistics:
import statistics
import math
def standardize(data, feature_name):
# Extract feature values
values = [sample[feature_name] for sample in data]
mean = statistics.mean(values)
std = statistics.stdev(values) if len(values) > 1 else 1 # Avoid division by zero
# Standardize each value: (x - mean) / std
for sample in data:
sample[f"{feature_name}_std"] = (sample[feature_name] - mean) / std
# Usage: Standardize 'square_footage' in house_data
standardize(house_data, 'square_footage')
Example: Train-Test Split
Use random.shuffle to split data into training and testing sets:
import random
def train_test_split(data, test_size=0.2):
random.shuffle(data) # Shuffle data to avoid bias
split_idx = int(len(data) * (1 - test_size))
return data[:split_idx], data[split_idx:]
train_data, test_data = train_test_split(house_data, test_size=0.2)
Implementing Basic ML Algorithms from Scratch
With data loaded and preprocessed, let’s implement two foundational ML algorithms using only standard library modules.
Linear Regression with Gradient Descent
Linear regression predicts a continuous target (e.g., house prices) using a linear equation: ( \hat{y} = w_0 + w_1x_1 + w_2x_2 + … + w_nx_n ), where ( w ) are weights. We’ll use gradient descent to optimize ( w ).
Step 1: Define the Model
import math
class LinearRegression:
def __init__(self, learning_rate=0.01, epochs=1000):
self.learning_rate = learning_rate
self.epochs = epochs
self.weights = None # Coefficients for features
self.bias = None # Intercept term
def fit(self, X, y):
# Initialize weights and bias to 0
n_samples, n_features = len(X), len(X[0])
self.weights = [0.0] * n_features
self.bias = 0.0
# Gradient descent
for _ in range(self.epochs):
y_pred = [self._predict(x) for x in X] # Predictions
# Compute gradients
dw = [0.0] * n_features
db = 0.0
for i in range(n_samples):
error = y_pred[i] - y[i]
for j in range(n_features):
dw[j] += error * X[i][j]
db += error
# Update weights and bias (average gradients)
dw = [d / n_samples for d in dw]
db /= n_samples
self.weights = [w - self.learning_rate * d for w, d in zip(self.weights, dw)]
self.bias -= self.learning_rate * db
def _predict(self, x):
# Single sample prediction: y = bias + sum(weights * x)
return self.bias + sum(w * xi for w, xi in zip(self.weights, x))
def predict(self, X):
# Predict for multiple samples
return [self._predict(x) for x in X]
How It Works:
fit: Optimizesweightsandbiasusing gradient descent. The gradient of the mean squared error (MSE) loss function guides updates._predict: Computes the linear combination of features and weights.
K-Nearest Neighbors (K-NN) for Classification
K-NN is a simple classification algorithm: predict the class of a sample by majority voting among its k nearest neighbors.
Step 1: Define the Model
import math
import random
from collections import Counter
class KNNClassifier:
def __init__(self, k=3):
self.k = k
self.X_train = None
self.y_train = None
def fit(self, X, y):
# K-NN has no training phase; just store data
self.X_train = X
self.y_train = y
def predict(self, X):
return [self._predict(x) for x in X]
def _predict(self, x):
# Compute distances between x and all training samples
distances = [
(math.dist(x, x_train), y_train)
for x_train, y_train in zip(self.X_train, self.y_train)
]
# Sort by distance and take first k neighbors
distances.sort()
k_nearest = [label for (dist, label) in distances[:self.k]]
# Majority vote
most_common = Counter(k_nearest).most_common(1)
return most_common[0][0]
How It Works:
fit: Stores training data (K-NN is lazy—no computation during training)._predict: Usesmath.distto compute Euclidean distance, sorts neighbors, and returns the majority class.
Evaluation Metrics with Standard Library Tools
To assess model performance, we’ll implement metrics like Mean Squared Error (MSE) for regression and accuracy for classification using math and basic loops.
Example: MSE for Regression
import math
def mean_squared_error(y_true, y_pred):
return sum(math.pow(y_t - y_p, 2) for y_t, y_p in zip(y_true, y_pred)) / len(y_true)
Example: Accuracy for Classification
def accuracy(y_true, y_pred):
correct = sum(1 for y_t, y_p in zip(y_true, y_pred) if y_t == y_p)
return correct / len(y_true)
Case Study: Predicting House Prices with Linear Regression
Let’s tie it all together with a complete prototype: predicting house prices from square footage using only the standard library.
Step 1: Prepare the Data
Create a simple CSV file house_data.csv (or use in-memory data for testing):
square_footage,price
1000,200000
1200,240000
1500,300000
1800,360000
2000,400000
Step 2: Load and Preprocess Data
import csv
import statistics
import random
# Load data
def load_data(file_path):
with open(file_path, 'r') as f:
reader = csv.DictReader(f)
return [{'square_footage': float(row['square_footage']), 'price': float(row['price'])} for row in reader]
data = load_data('house_data.csv')
# Standardize features (square_footage)
square_footages = [d['square_footage'] for d in data]
mean_sqft = statistics.mean(square_footages)
std_sqft = statistics.stdev(square_footages) if len(square_footages) > 1 else 1
for d in data:
d['square_footage_std'] = (d['square_footage'] - mean_sqft) / std_sqft
# Split into train/test
train_data, test_data = train_test_split(data, test_size=0.2)
# Extract features (X) and target (y)
X_train = [[d['square_footage_std']] for d in train_data] # 2D list for model compatibility
y_train = [d['price'] for d in train_data]
X_test = [[d['square_footage_std']] for d in test_data]
y_test = [d['price'] for d in test_data]
Step 3: Train and Evaluate the Model
# Initialize and train model
model = LinearRegression(learning_rate=0.01, epochs=1000)
model.fit(X_train, y_train)
# Predict and evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Test MSE: {mse:.2f}")
print(f"Predictions: {y_pred}")
print(f"True Values: {y_test}")
Expected Output
With our small dataset, the MSE should be near zero (since the relationship is perfectly linear). The model will learn that price ≈ 200 * square_footage.
Limitations and When to Switch to External Libraries
While the standard library is powerful for prototyping, it has clear limitations:
- No Vectorization: Loops are slow for large datasets (e.g., 100k+ samples). Use
numpyfor vectorized operations. - Limited Data Tools: No built-in DataFrames (use
pandasfor complex data manipulation). - Advanced Algorithms: No support for neural networks, random forests, or SVMs (use
scikit-learn,TensorFlow, orPyTorch). - Scalability: Not designed for parallel processing or GPU acceleration.
Switch to external libraries when:
- Your dataset exceeds 10k samples.
- You need advanced preprocessing (e.g., one-hot encoding, PCA).
- Prototyping evolves into production.
Conclusion
Python’s standard library is a hidden gem for ML prototyping. By leveraging modules like csv, math, and random, you can build functional models with minimal dependencies, deepen your ML knowledge, and iterate quickly. While it’s not a replacement for heavyweights like scikit-learn, it’s an invaluable tool for learning, small-scale projects, and environments with restrictions.
Next time you want to test an ML idea, try starting with the standard library—you might be surprised by how far it can take you!