py4u guide

Implementing Machine Learning Algorithms in Python from Scratch

Machine learning (ML) has revolutionized industries from healthcare to finance, but much of its power is often hidden behind "black-box" libraries like scikit-learn or TensorFlow. While these tools are indispensable for building production systems, **implementing algorithms from scratch** is a critical step in mastering ML. It demystifies the math, strengthens intuition about how models learn, and equips you to debug, customize, or innovate on existing methods. In this blog, we’ll build four foundational ML algorithms from the ground up using Python: Linear Regression (regression), Logistic Regression (classification), Decision Trees (non-parametric classification/regression), and K-Means Clustering (unsupervised clustering). We’ll break down the theory, math, and code step-by-step, ensuring you understand *why* each part works.

Table of Contents

  1. Prerequisites
  2. Linear Regression
    • 2.1 Theory & Math
    • 2.2 Implementation Steps
    • 2.3 Python Code & Explanation
    • 2.4 Example & Visualization
  3. Logistic Regression
    • 3.1 Theory & Math
    • 3.2 Implementation Steps
    • 3.3 Python Code & Explanation
    • 3.4 Example & Visualization
  4. Decision Trees
    • 4.1 Theory & Math
    • 4.2 Implementation Steps
    • 4.3 Python Code & Explanation
    • 4.4 Example & Visualization
  5. K-Means Clustering
    • 5.1 Theory & Math
    • 5.2 Implementation Steps
    • 5.3 Python Code & Explanation
    • 5.4 Example & Visualization
  6. Conclusion
  7. References

Prerequisites

To follow along, you’ll need:

  • Basic Python knowledge (classes, functions, loops).
  • Familiarity with NumPy (arrays, matrix operations) and Matplotlib (plotting).
  • Optional: High-school math (algebra, derivatives) for understanding gradients.

Linear Regression

2.1 Theory & Math

Linear Regression models the relationship between a dependent variable ( y ) and one or more independent variables ( X ). For simple linear regression (one feature), the model is:

[ \hat{y} = wx + b ]

Where:

  • ( \hat{y} ): Predicted value.
  • ( w ): Weight (slope of the line).
  • ( b ): Bias (y-intercept).

Our goal is to find ( w ) and ( b ) that minimize the Mean Squared Error (MSE) between predictions ( \hat{y} ) and true values ( y ):

[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}i)^2 = \frac{1}{n} \sum{i=1}^{n} (y_i - wx_i - b)^2 ]

To minimize MSE, we use Gradient Descent:

  • Compute gradients of MSE with respect to ( w ) and ( b ).
  • Update ( w ) and ( b ) iteratively:
    [ w = w - \alpha \frac{\partial \text{MSE}}{\partial w}, \quad b = b - \alpha \frac{\partial \text{MSE}}{\partial b} ]
  • ( \alpha ): Learning rate (controls step size).

Derivatives (gradients):
[ \frac{\partial \text{MSE}}{\partial w} = \frac{2}{n} \sum_{i=1}^{n} -x_i(y_i - \hat{y}i) ]
[ \frac{\partial \text{MSE}}{\partial b} = \frac{2}{n} \sum
{i=1}^{n} -(y_i - \hat{y}_i) ]

2.2 Implementation Steps

  1. Initialize Parameters: Start with random ( w ) and ( b ) (e.g., ( w=0, b=0 )).
  2. Predict: Compute ( \hat{y} = wx + b ).
  3. Compute Gradients: Use the MSE derivatives above.
  4. Update Parameters: Adjust ( w ) and ( b ) using gradients and ( \alpha ).
  5. Repeat: Iterate until MSE converges (stops changing significantly).

2.3 Python Code & Explanation

We’ll use synthetic data for demonstration. Here’s the full implementation:

import numpy as np
import matplotlib.pyplot as plt

class LinearRegression:
    def __init__(self, learning_rate=0.01, epochs=1000):
        self.lr = learning_rate  # Step size for updates
        self.epochs = epochs      # Number of iterations
        self.w = None             # Weight (slope)
        self.b = None             # Bias (intercept)

    def fit(self, X, y):
        # X: (n_samples, 1), y: (n_samples,)
        n_samples = X.shape[0]
        self.w = 0.0  # Initialize weight
        self.b = 0.0  # Initialize bias

        # Gradient Descent loop
        for _ in range(self.epochs):
            y_pred = self.w * X + self.b  # Predictions
            error = y_pred - y            # Residuals

            # Compute gradients
            dw = (2 / n_samples) * np.sum(X * error)  # d(MSE)/dw
            db = (2 / n_samples) * np.sum(error)       # d(MSE)/db

            # Update parameters
            self.w -= self.lr * dw
            self.b -= self.lr * db

    def predict(self, X):
        return self.w * X + self.b  # Inference

2.4 Example & Visualization

Let’s test with synthetic data:

# Generate data: y = 2x + 3 + noise
np.random.seed(42)
X = np.random.rand(100, 1) * 10  # Features (0-10)
y = 2 * X + 3 + np.random.randn(100, 1) * 1.5  # True relation + noise

# Train model
model = LinearRegression(learning_rate=0.01, epochs=1000)
model.fit(X, y)

# Predict and plot
y_pred = model.predict(X)
plt.scatter(X, y, label="Data")
plt.plot(X, y_pred, color="red", label=f"Fit: y = {model.w:.2f}x + {model.b:.2f}")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.show()

Output: A scatter plot with a red line fitting the data. The learned ( w \approx 2 ) and ( b \approx 3 ), matching the true values!


Logistic Regression

3.1 Theory & Math

Logistic Regression is for binary classification (predicting 0 or 1). Unlike Linear Regression, it uses the sigmoid function to squish predictions between 0 and 1:

[ \sigma(z) = \frac{1}{1 + e^{-z}}, \quad \text{where } z = wx + b ]

Thus, the predicted probability of class 1 is:
[ \hat{y} = \sigma(wx + b) = \frac{1}{1 + e^{-(wx + b)}} ]

Loss Function: Use Binary Cross-Entropy (BCE) instead of MSE (BCE penalizes confident wrong predictions more heavily):

[ \text{BCE} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] ]

Gradients for Gradient Descent (derived using chain rule):
[ \frac{\partial \text{BCE}}{\partial w} = \frac{1}{n} \sum_{i=1}^{n} X_i (\hat{y}i - y_i) ]
[ \frac{\partial \text{BCE}}{\partial b} = \frac{1}{n} \sum
{i=1}^{n} (\hat{y}_i - y_i) ]

3.2 Implementation Steps

  1. Initialize ( w ) and ( b ).
  2. Predict Probabilities: ( \hat{y} = \sigma(wx + b) ).
  3. Compute BCE Loss and gradients.
  4. Update ( w ) and ( b ) via Gradient Descent.
  5. Threshold Predictions: For class labels, use ( \hat{y} \geq 0.5 \rightarrow 1 ), else 0.

3.3 Python Code & Explanation

class LogisticRegression:
    def __init__(self, learning_rate=0.01, epochs=1000):
        self.lr = learning_rate
        self.epochs = epochs
        self.w = None
        self.b = None

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))  # Sigmoid function

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.w = np.zeros(n_features)  # Initialize weights (for multiple features)
        self.b = 0                     # Initialize bias

        for _ in range(self.epochs):
            z = np.dot(X, self.w) + self.b  # z = wx + b
            y_pred = self.sigmoid(z)        # Probabilities

            # Compute gradients
            dw = (1 / n_samples) * np.dot(X.T, (y_pred - y))  # d(BCE)/dw
            db = (1 / n_samples) * np.sum(y_pred - y)         # d(BCE)/db

            # Update parameters
            self.w -= self.lr * dw
            self.b -= self.lr * db

    def predict_prob(self, X):
        z = np.dot(X, self.w) + self.b
        return self.sigmoid(z)  # Return probabilities

    def predict(self, X, threshold=0.5):
        return (self.predict_prob(X) >= threshold).astype(int)  # Class labels

3.4 Example & Visualization

Test with Iris dataset (binary classification: Setosa vs. Versicolor):

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data (binary classification: 0=Setosa, 1=Versicolor)
data = load_iris()
X = data.data[:100, :2]  # First 2 features, first 100 samples (0/1)
y = data.target[:100]

# Split and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(learning_rate=0.01, epochs=10000)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")  # ~1.0 (perfect separation!)

# Plot decision boundary
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors="k")
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.title("Decision Boundary")
plt.show()

Output: A plot with a decision boundary separating Setosa (0) and Versicolor (1) perfectly.


Decision Trees

4.1 Theory & Math

Decision Trees split data into subsets based on feature values, creating a tree-like structure for classification/regression. For classification, we use Gini Impurity to measure node “purity” (how mixed classes are):

[ \text{Gini}(p) = 1 - \sum_{i=1}^{C} p_i^2 ]

Where ( p_i ) is the proportion of class ( i ) in the node. A node with Gini=0 is “pure” (all samples belong to one class).

Splitting Logic: For each feature and threshold, split the data into left/right subsets and compute the weighted Gini impurity of the split:

[ \text{Gini}{\text{split}} = \frac{n{\text{left}}}{n} \text{Gini}{\text{left}} + \frac{n{\text{right}}}{n} \text{Gini}_{\text{right}} ]

Choose the split with the lowest Gini impurity.

4.2 Implementation Steps

  1. Recursive Splitting: Start with the root node (all data). For each node:
    • If pure (Gini=0) or max depth reached, stop (leaf node: predict majority class).
    • Else, find the best feature/threshold split (min Gini impurity).
    • Split data into left/right subsets and repeat for child nodes.

4.3 Python Code & Explanation

We’ll implement a simple classification tree:

class DecisionTreeClassifier:
    def __init__(self, max_depth=None):
        self.max_depth = max_depth  # Stop splitting after this depth

    def gini(self, y):
        # Compute Gini impurity
        classes, counts = np.unique(y, return_counts=True)
        p = counts / len(y)
        return 1 - np.sum(p ** 2)

    def best_split(self, X, y):
        # Find best feature and threshold to split
        best_gini = float("inf")
        best_feature = None
        best_threshold = None

        for feature in range(X.shape[1]):
            thresholds = np.unique(X[:, feature])  # Unique values in feature
            for threshold in thresholds:
                left_mask = X[:, feature] <= threshold
                right_mask = ~left_mask
                if len(y[left_mask]) == 0 or len(y[right_mask]) == 0:
                    continue  # Skip splits with empty subsets
                gini_split = (len(y[left_mask])/len(y)) * self.gini(y[left_mask]) + \
                             (len(y[right_mask])/len(y)) * self.gini(y[right_mask])
                if gini_split < best_gini:
                    best_gini = gini_split
                    best_feature = feature
                    best_threshold = threshold
        return best_feature, best_threshold, best_gini

    def build_tree(self, X, y, depth=0):
        # Recursively build tree
        if (self.max_depth is not None and depth >= self.max_depth) or self.gini(y) == 0:
            # Leaf node: return majority class
            return np.argmax(np.bincount(y))

        # Split
        feature, threshold, gini = self.best_split(X, y)
        if feature is None:  # No valid split
            return np.argmax(np.bincount(y))

        # Recurse on children
        left_mask = X[:, feature] <= threshold
        right_mask = ~left_mask
        left_subtree = self.build_tree(X[left_mask], y[left_mask], depth + 1)
        right_subtree = self.build_tree(X[right_mask], y[right_mask], depth + 1)
        return {"feature": feature, "threshold": threshold, 
                "left": left_subtree, "right": right_subtree}

    def fit(self, X, y):
        self.tree = self.build_tree(X, y)

    def predict_sample(self, x, tree):
        # Predict single sample
        if isinstance(tree, np.integer):  # Leaf node
            return tree
        feature = tree["feature"]
        if x[feature] <= tree["threshold"]:
            return self.predict_sample(x, tree["left"])
        else:
            return self.predict_sample(x, tree["right"])

    def predict(self, X):
        return np.array([self.predict_sample(x, self.tree) for x in X])

4.4 Example & Visualization

Test with Iris data:

from sklearn.datasets import load_iris

data = load_iris()
X = data.data[:, :2]  # First 2 features
y = data.target

model = DecisionTreeClassifier(max_depth=3)
model.fit(X, y)

# Plot decision boundary (similar to Logistic Regression example)
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors="k")
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.title("Decision Tree Boundary")
plt.show()

Output: A plot with regions separated by axis-aligned splits (the tree’s decision boundaries).


K-Means Clustering

5.1 Theory & Math

K-Means is an unsupervised algorithm that groups data into ( K ) clusters. It minimizes the inertia (sum of squared distances from samples to their cluster centroid).

Steps:

  1. Initialize Centroids: Randomly select ( K ) data points as initial centroids.
  2. Assign Clusters: Assign each sample to the nearest centroid (using Euclidean distance).
  3. Update Centroids: Compute new centroids as the mean of samples in each cluster.
  4. Repeat: Until centroids stop changing (convergence).

5.2 Implementation Steps

  1. Choose ( K ): Number of clusters (user-specified).
  2. Initialize Centroids: Randomly pick ( K ) samples from ( X ).
  3. Cluster Assignment: For each sample, compute distance to all centroids; assign to closest.
  4. Update Centroids: For each cluster, set centroid to the mean of its samples.
  5. Check Convergence: If centroids don’t change, stop. Else, repeat steps 3–4.

5.3 Python Code & Explanation

class KMeans:
    def __init__(self, n_clusters=2, max_iter=100):
        self.n_clusters = n_clusters  # K
        self.max_iter = max_iter      # Max iterations
        self.centroids = None         # Cluster centers

    def fit(self, X):
        # Initialize centroids: random samples from X
        np.random.seed(42)
        self.centroids = X[np.random.choice(X.shape[0], self.n_clusters, replace=False)]

        for _ in range(self.max_iter):
            # Assign clusters: distance from each sample to centroids
            distances = np.sqrt(((X - self.centroids[:, np.newaxis])**2).sum(axis=2))
            labels = np.argmin(distances, axis=0)  # Closest centroid

            # Update centroids: mean of each cluster
            new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(self.n_clusters)])

            # Check convergence
            if np.allclose(self.centroids, new_centroids):
                break
            self.centroids = new_centroids

    def predict(self, X):
        # Assign new samples to clusters
        distances = np.sqrt(((X - self.centroids[:, np.newaxis])**2).sum(axis=2))
        return np.argmin(distances, axis=0)

5.4 Example & Visualization

Test with synthetic blobs:

from sklearn.datasets import make_blobs

# Generate 3 clusters
X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=0.6, random_state=42)

# Fit K-Means
model = KMeans(n_clusters=3, max_iter=100)
model.fit(X)
y_pred = model.predict(X)

# Plot
plt.scatter(X[:, 0], X[:,1], c=y_pred, cmap="viridis", label="Clusters")
plt.scatter(model.centroids[:, 0], model.centroids[:, 1], s=300, c="red", marker="X", label="Centroids")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()

Output: Data points colored by cluster, with red X’s marking centroids. K-Means successfully groups the blobs!


Conclusion

Implementing ML algorithms from scratch transforms abstract concepts into tangible code. You’ve now built Linear Regression (regression), Logistic Regression (classification), Decision Trees (non-parametric learning), and K-Means (clustering) using only NumPy.

This foundation lets you:

  • Debug models when they fail (e.g., why Gradient Descent diverges).
  • Customize algorithms (e.g., add L2 regularization to Linear Regression).
  • Innovate (e.g., modify K-Means to use Manhattan distance).

Next steps: Explore Random Forests (ensembles of Decision Trees), or deep learning (neural networks from scratch)!

References

  • Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow. O’Reilly Media.
  • James, G., et al. (2021). An Introduction to Statistical Learning. Springer.
  • Ng, A. (2018). Machine Learning Specialization. Coursera (Stanford).
  • Scikit-Learn Documentation: LinearRegression, LogisticRegression, DecisionTreeClassifier, KMeans.