Table of Contents
- Prerequisites
- Linear Regression
- 2.1 Theory & Math
- 2.2 Implementation Steps
- 2.3 Python Code & Explanation
- 2.4 Example & Visualization
- Logistic Regression
- 3.1 Theory & Math
- 3.2 Implementation Steps
- 3.3 Python Code & Explanation
- 3.4 Example & Visualization
- Decision Trees
- 4.1 Theory & Math
- 4.2 Implementation Steps
- 4.3 Python Code & Explanation
- 4.4 Example & Visualization
- K-Means Clustering
- 5.1 Theory & Math
- 5.2 Implementation Steps
- 5.3 Python Code & Explanation
- 5.4 Example & Visualization
- Conclusion
- References
Prerequisites
To follow along, you’ll need:
- Basic Python knowledge (classes, functions, loops).
- Familiarity with NumPy (arrays, matrix operations) and Matplotlib (plotting).
- Optional: High-school math (algebra, derivatives) for understanding gradients.
Linear Regression
2.1 Theory & Math
Linear Regression models the relationship between a dependent variable ( y ) and one or more independent variables ( X ). For simple linear regression (one feature), the model is:
[ \hat{y} = wx + b ]
Where:
- ( \hat{y} ): Predicted value.
- ( w ): Weight (slope of the line).
- ( b ): Bias (y-intercept).
Our goal is to find ( w ) and ( b ) that minimize the Mean Squared Error (MSE) between predictions ( \hat{y} ) and true values ( y ):
[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}i)^2 = \frac{1}{n} \sum{i=1}^{n} (y_i - wx_i - b)^2 ]
To minimize MSE, we use Gradient Descent:
- Compute gradients of MSE with respect to ( w ) and ( b ).
- Update ( w ) and ( b ) iteratively:
[ w = w - \alpha \frac{\partial \text{MSE}}{\partial w}, \quad b = b - \alpha \frac{\partial \text{MSE}}{\partial b} ] - ( \alpha ): Learning rate (controls step size).
Derivatives (gradients):
[ \frac{\partial \text{MSE}}{\partial w} = \frac{2}{n} \sum_{i=1}^{n} -x_i(y_i - \hat{y}i) ]
[ \frac{\partial \text{MSE}}{\partial b} = \frac{2}{n} \sum{i=1}^{n} -(y_i - \hat{y}_i) ]
2.2 Implementation Steps
- Initialize Parameters: Start with random ( w ) and ( b ) (e.g., ( w=0, b=0 )).
- Predict: Compute ( \hat{y} = wx + b ).
- Compute Gradients: Use the MSE derivatives above.
- Update Parameters: Adjust ( w ) and ( b ) using gradients and ( \alpha ).
- Repeat: Iterate until MSE converges (stops changing significantly).
2.3 Python Code & Explanation
We’ll use synthetic data for demonstration. Here’s the full implementation:
import numpy as np
import matplotlib.pyplot as plt
class LinearRegression:
def __init__(self, learning_rate=0.01, epochs=1000):
self.lr = learning_rate # Step size for updates
self.epochs = epochs # Number of iterations
self.w = None # Weight (slope)
self.b = None # Bias (intercept)
def fit(self, X, y):
# X: (n_samples, 1), y: (n_samples,)
n_samples = X.shape[0]
self.w = 0.0 # Initialize weight
self.b = 0.0 # Initialize bias
# Gradient Descent loop
for _ in range(self.epochs):
y_pred = self.w * X + self.b # Predictions
error = y_pred - y # Residuals
# Compute gradients
dw = (2 / n_samples) * np.sum(X * error) # d(MSE)/dw
db = (2 / n_samples) * np.sum(error) # d(MSE)/db
# Update parameters
self.w -= self.lr * dw
self.b -= self.lr * db
def predict(self, X):
return self.w * X + self.b # Inference
2.4 Example & Visualization
Let’s test with synthetic data:
# Generate data: y = 2x + 3 + noise
np.random.seed(42)
X = np.random.rand(100, 1) * 10 # Features (0-10)
y = 2 * X + 3 + np.random.randn(100, 1) * 1.5 # True relation + noise
# Train model
model = LinearRegression(learning_rate=0.01, epochs=1000)
model.fit(X, y)
# Predict and plot
y_pred = model.predict(X)
plt.scatter(X, y, label="Data")
plt.plot(X, y_pred, color="red", label=f"Fit: y = {model.w:.2f}x + {model.b:.2f}")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.show()
Output: A scatter plot with a red line fitting the data. The learned ( w \approx 2 ) and ( b \approx 3 ), matching the true values!
Logistic Regression
3.1 Theory & Math
Logistic Regression is for binary classification (predicting 0 or 1). Unlike Linear Regression, it uses the sigmoid function to squish predictions between 0 and 1:
[ \sigma(z) = \frac{1}{1 + e^{-z}}, \quad \text{where } z = wx + b ]
Thus, the predicted probability of class 1 is:
[ \hat{y} = \sigma(wx + b) = \frac{1}{1 + e^{-(wx + b)}} ]
Loss Function: Use Binary Cross-Entropy (BCE) instead of MSE (BCE penalizes confident wrong predictions more heavily):
[ \text{BCE} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] ]
Gradients for Gradient Descent (derived using chain rule):
[ \frac{\partial \text{BCE}}{\partial w} = \frac{1}{n} \sum_{i=1}^{n} X_i (\hat{y}i - y_i) ]
[ \frac{\partial \text{BCE}}{\partial b} = \frac{1}{n} \sum{i=1}^{n} (\hat{y}_i - y_i) ]
3.2 Implementation Steps
- Initialize ( w ) and ( b ).
- Predict Probabilities: ( \hat{y} = \sigma(wx + b) ).
- Compute BCE Loss and gradients.
- Update ( w ) and ( b ) via Gradient Descent.
- Threshold Predictions: For class labels, use ( \hat{y} \geq 0.5 \rightarrow 1 ), else 0.
3.3 Python Code & Explanation
class LogisticRegression:
def __init__(self, learning_rate=0.01, epochs=1000):
self.lr = learning_rate
self.epochs = epochs
self.w = None
self.b = None
def sigmoid(self, z):
return 1 / (1 + np.exp(-z)) # Sigmoid function
def fit(self, X, y):
n_samples, n_features = X.shape
self.w = np.zeros(n_features) # Initialize weights (for multiple features)
self.b = 0 # Initialize bias
for _ in range(self.epochs):
z = np.dot(X, self.w) + self.b # z = wx + b
y_pred = self.sigmoid(z) # Probabilities
# Compute gradients
dw = (1 / n_samples) * np.dot(X.T, (y_pred - y)) # d(BCE)/dw
db = (1 / n_samples) * np.sum(y_pred - y) # d(BCE)/db
# Update parameters
self.w -= self.lr * dw
self.b -= self.lr * db
def predict_prob(self, X):
z = np.dot(X, self.w) + self.b
return self.sigmoid(z) # Return probabilities
def predict(self, X, threshold=0.5):
return (self.predict_prob(X) >= threshold).astype(int) # Class labels
3.4 Example & Visualization
Test with Iris dataset (binary classification: Setosa vs. Versicolor):
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load data (binary classification: 0=Setosa, 1=Versicolor)
data = load_iris()
X = data.data[:100, :2] # First 2 features, first 100 samples (0/1)
y = data.target[:100]
# Split and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(learning_rate=0.01, epochs=10000)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}") # ~1.0 (perfect separation!)
# Plot decision boundary
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors="k")
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.title("Decision Boundary")
plt.show()
Output: A plot with a decision boundary separating Setosa (0) and Versicolor (1) perfectly.
Decision Trees
4.1 Theory & Math
Decision Trees split data into subsets based on feature values, creating a tree-like structure for classification/regression. For classification, we use Gini Impurity to measure node “purity” (how mixed classes are):
[ \text{Gini}(p) = 1 - \sum_{i=1}^{C} p_i^2 ]
Where ( p_i ) is the proportion of class ( i ) in the node. A node with Gini=0 is “pure” (all samples belong to one class).
Splitting Logic: For each feature and threshold, split the data into left/right subsets and compute the weighted Gini impurity of the split:
[ \text{Gini}{\text{split}} = \frac{n{\text{left}}}{n} \text{Gini}{\text{left}} + \frac{n{\text{right}}}{n} \text{Gini}_{\text{right}} ]
Choose the split with the lowest Gini impurity.
4.2 Implementation Steps
- Recursive Splitting: Start with the root node (all data). For each node:
- If pure (Gini=0) or max depth reached, stop (leaf node: predict majority class).
- Else, find the best feature/threshold split (min Gini impurity).
- Split data into left/right subsets and repeat for child nodes.
4.3 Python Code & Explanation
We’ll implement a simple classification tree:
class DecisionTreeClassifier:
def __init__(self, max_depth=None):
self.max_depth = max_depth # Stop splitting after this depth
def gini(self, y):
# Compute Gini impurity
classes, counts = np.unique(y, return_counts=True)
p = counts / len(y)
return 1 - np.sum(p ** 2)
def best_split(self, X, y):
# Find best feature and threshold to split
best_gini = float("inf")
best_feature = None
best_threshold = None
for feature in range(X.shape[1]):
thresholds = np.unique(X[:, feature]) # Unique values in feature
for threshold in thresholds:
left_mask = X[:, feature] <= threshold
right_mask = ~left_mask
if len(y[left_mask]) == 0 or len(y[right_mask]) == 0:
continue # Skip splits with empty subsets
gini_split = (len(y[left_mask])/len(y)) * self.gini(y[left_mask]) + \
(len(y[right_mask])/len(y)) * self.gini(y[right_mask])
if gini_split < best_gini:
best_gini = gini_split
best_feature = feature
best_threshold = threshold
return best_feature, best_threshold, best_gini
def build_tree(self, X, y, depth=0):
# Recursively build tree
if (self.max_depth is not None and depth >= self.max_depth) or self.gini(y) == 0:
# Leaf node: return majority class
return np.argmax(np.bincount(y))
# Split
feature, threshold, gini = self.best_split(X, y)
if feature is None: # No valid split
return np.argmax(np.bincount(y))
# Recurse on children
left_mask = X[:, feature] <= threshold
right_mask = ~left_mask
left_subtree = self.build_tree(X[left_mask], y[left_mask], depth + 1)
right_subtree = self.build_tree(X[right_mask], y[right_mask], depth + 1)
return {"feature": feature, "threshold": threshold,
"left": left_subtree, "right": right_subtree}
def fit(self, X, y):
self.tree = self.build_tree(X, y)
def predict_sample(self, x, tree):
# Predict single sample
if isinstance(tree, np.integer): # Leaf node
return tree
feature = tree["feature"]
if x[feature] <= tree["threshold"]:
return self.predict_sample(x, tree["left"])
else:
return self.predict_sample(x, tree["right"])
def predict(self, X):
return np.array([self.predict_sample(x, self.tree) for x in X])
4.4 Example & Visualization
Test with Iris data:
from sklearn.datasets import load_iris
data = load_iris()
X = data.data[:, :2] # First 2 features
y = data.target
model = DecisionTreeClassifier(max_depth=3)
model.fit(X, y)
# Plot decision boundary (similar to Logistic Regression example)
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors="k")
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.title("Decision Tree Boundary")
plt.show()
Output: A plot with regions separated by axis-aligned splits (the tree’s decision boundaries).
K-Means Clustering
5.1 Theory & Math
K-Means is an unsupervised algorithm that groups data into ( K ) clusters. It minimizes the inertia (sum of squared distances from samples to their cluster centroid).
Steps:
- Initialize Centroids: Randomly select ( K ) data points as initial centroids.
- Assign Clusters: Assign each sample to the nearest centroid (using Euclidean distance).
- Update Centroids: Compute new centroids as the mean of samples in each cluster.
- Repeat: Until centroids stop changing (convergence).
5.2 Implementation Steps
- Choose ( K ): Number of clusters (user-specified).
- Initialize Centroids: Randomly pick ( K ) samples from ( X ).
- Cluster Assignment: For each sample, compute distance to all centroids; assign to closest.
- Update Centroids: For each cluster, set centroid to the mean of its samples.
- Check Convergence: If centroids don’t change, stop. Else, repeat steps 3–4.
5.3 Python Code & Explanation
class KMeans:
def __init__(self, n_clusters=2, max_iter=100):
self.n_clusters = n_clusters # K
self.max_iter = max_iter # Max iterations
self.centroids = None # Cluster centers
def fit(self, X):
# Initialize centroids: random samples from X
np.random.seed(42)
self.centroids = X[np.random.choice(X.shape[0], self.n_clusters, replace=False)]
for _ in range(self.max_iter):
# Assign clusters: distance from each sample to centroids
distances = np.sqrt(((X - self.centroids[:, np.newaxis])**2).sum(axis=2))
labels = np.argmin(distances, axis=0) # Closest centroid
# Update centroids: mean of each cluster
new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(self.n_clusters)])
# Check convergence
if np.allclose(self.centroids, new_centroids):
break
self.centroids = new_centroids
def predict(self, X):
# Assign new samples to clusters
distances = np.sqrt(((X - self.centroids[:, np.newaxis])**2).sum(axis=2))
return np.argmin(distances, axis=0)
5.4 Example & Visualization
Test with synthetic blobs:
from sklearn.datasets import make_blobs
# Generate 3 clusters
X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=0.6, random_state=42)
# Fit K-Means
model = KMeans(n_clusters=3, max_iter=100)
model.fit(X)
y_pred = model.predict(X)
# Plot
plt.scatter(X[:, 0], X[:,1], c=y_pred, cmap="viridis", label="Clusters")
plt.scatter(model.centroids[:, 0], model.centroids[:, 1], s=300, c="red", marker="X", label="Centroids")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()
Output: Data points colored by cluster, with red X’s marking centroids. K-Means successfully groups the blobs!
Conclusion
Implementing ML algorithms from scratch transforms abstract concepts into tangible code. You’ve now built Linear Regression (regression), Logistic Regression (classification), Decision Trees (non-parametric learning), and K-Means (clustering) using only NumPy.
This foundation lets you:
- Debug models when they fail (e.g., why Gradient Descent diverges).
- Customize algorithms (e.g., add L2 regularization to Linear Regression).
- Innovate (e.g., modify K-Means to use Manhattan distance).
Next steps: Explore Random Forests (ensembles of Decision Trees), or deep learning (neural networks from scratch)!
References
- Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow. O’Reilly Media.
- James, G., et al. (2021). An Introduction to Statistical Learning. Springer.
- Ng, A. (2018). Machine Learning Specialization. Coursera (Stanford).
- Scikit-Learn Documentation: LinearRegression, LogisticRegression, DecisionTreeClassifier, KMeans.