Table of Contents
Creational Design Patterns
Creational patterns focus on object creation mechanisms, ensuring objects are created in a way that aligns with the project’s requirements (e.g., limiting instances, standardizing creation logic).
1. Singleton Pattern
What is it?
The Singleton pattern ensures a class has only one instance and provides a global point of access to it. This is critical for resources that should not be duplicated (e.g., database connections, configuration managers, or logging services).
Why Data Science Needs It
In data science, you often work with expensive resources: a single database connection pool, a shared configuration file, or a trained model loaded into memory. Creating multiple instances of these resources wastes memory, slows down execution, or causes conflicts (e.g., concurrent writes to a file).
Python Example: Singleton for a Database Client
Suppose you’re building a data pipeline that reads from a PostgreSQL database. You want to ensure only one database client is created to avoid connection bloat.
import psycopg2
from psycopg2.extras import RealDictCursor
class DatabaseClient:
_instance = None # Class-level variable to store the single instance
def __new__(cls, *args, **kwargs):
# Ensure only one instance is created
if cls._instance is None:
cls._instance = super().__new__(cls)
# Initialize the database connection on first creation
cls._instance.connection = psycopg2.connect(
dbname=kwargs.get("dbname"),
user=kwargs.get("user"),
password=kwargs.get("password"),
host=kwargs.get("host")
)
cls._instance.cursor = cls._instance.connection.cursor(cursor_factory=RealDictCursor)
return cls._instance
def query(self, sql: str):
"""Execute a SQL query and return results as a list of dictionaries."""
self.cursor.execute(sql)
return self.cursor.fetchall()
# Usage
client1 = DatabaseClient(dbname="mydb", user="admin", password="secret", host="localhost")
client2 = DatabaseClient(dbname="mydb", user="admin", password="secret", host="localhost")
print(client1 is client2) # Output: True (both are the same instance)
data = client1.query("SELECT * FROM sensor_data LIMIT 10;")
Key Takeaway
Use Singleton for resources that should be globally unique and expensive to create. Avoid overusing it—over-reliance can make testing harder (e.g., mocking a Singleton is tricky).
2. Factory Method Pattern
What is it?
The Factory Method pattern defines an interface for creating objects but lets subclasses decide which class to instantiate. It delegates object creation to subclasses, promoting loose coupling and simplifying the addition of new object types.
Why Data Science Needs It
Data science workflows often involve multiple data sources (CSV, JSON, Parquet, APIs) or model types (classification, regression). The Factory Method centralizes object creation logic, making it easy to add new sources/types without changing existing code.
Python Example: Data Loader Factory
Imagine you need to load data from CSV, JSON, or Parquet files. A Factory Method can create the appropriate loader based on the file extension.
from abc import ABC, abstractmethod
import pandas as pd
# Abstract Product: Defines the interface for data loaders
class DataLoader(ABC):
@abstractmethod
def load(self, file_path: str) -> pd.DataFrame:
pass
# Concrete Products: Loaders for specific file types
class CSVLoader(DataLoader):
def load(self, file_path: str) -> pd.DataFrame:
return pd.read_csv(file_path)
class JSONLoader(DataLoader):
def load(self, file_path: str) -> pd.DataFrame:
return pd.read_json(file_path)
class ParquetLoader(DataLoader):
def load(self, file_path: str) -> pd.DataFrame:
return pd.read_parquet(file_path)
# Factory: Creates the appropriate loader based on file extension
class DataLoaderFactory:
@staticmethod
def get_loader(file_path: str) -> DataLoader:
if file_path.endswith(".csv"):
return CSVLoader()
elif file_path.endswith(".json"):
return JSONLoader()
elif file_path.endswith(".parquet"):
return ParquetLoader()
else:
raise ValueError(f"Unsupported file type for {file_path}")
# Usage
if __name__ == "__main__":
csv_data = DataLoaderFactory.get_loader("data.csv").load("data.csv")
json_data = DataLoaderFactory.get_loader("data.json").load("data.json")
parquet_data = DataLoaderFactory.get_loader("data.parquet").load("data.parquet")
Key Takeaway
Use Factory Method when you need to create objects of related types and want to decouple the client from concrete classes. Adding a new loader (e.g., Excel) only requires a new ExcelLoader class and updating the factory—no changes to existing loaders.
3. Builder Pattern
What is it?
The Builder pattern separates the construction of a complex object from its representation, allowing the same construction process to create different representations. It’s ideal for objects with many optional components or steps (e.g., data pipelines with configurable preprocessing steps).
Why Data Science Needs It
Data preprocessing pipelines are often complex and multi-step (e.g., scaling, encoding, imputation, feature selection). The Builder pattern lets you construct pipelines step-by-step, reusing steps across projects and ensuring consistency.
Python Example: Data Preprocessing Pipeline Builder
Build a pipeline with optional steps like handling missing values, scaling features, and encoding categorical variables.
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import pandas as pd
class PreprocessingPipelineBuilder:
def __init__(self):
self.numeric_features = []
self.categorical_features = []
self.pipeline = None
def with_numeric_features(self, features: list[str]) -> "PreprocessingPipelineBuilder":
self.numeric_features = features
return self # Enable method chaining
def with_categorical_features(self, features: list[str]) -> "PreprocessingPipelineBuilder":
self.categorical_features = features
return self
def build(self) -> Pipeline:
# Define preprocessing for numeric features
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Define preprocessing for categorical features
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
])
# Combine preprocessing steps
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, self.numeric_features),
('cat', categorical_transformer, self.categorical_features)
])
# Create the full pipeline
self.pipeline = Pipeline(steps=[('preprocessor', preprocessor)])
return self.pipeline
# Usage
if __name__ == "__main__":
# Sample data
data = pd.DataFrame({
"age": [25, None, 30, 35],
"income": [50000, 60000, None, 75000],
"city": ["NY", "CA", None, "TX"]
})
# Build pipeline with numeric (age, income) and categorical (city) features
pipeline = PreprocessingPipelineBuilder() \
.with_numeric_features(["age", "income"]) \
.with_categorical_features(["city"]) \
.build()
# Fit and transform the data
processed_data = pipeline.fit_transform(data)
print("Processed data shape:", processed_data.shape) # Output: (4, 6) (2 numeric + 4 categorical columns)
Key Takeaway
Use Builder to construct complex objects with optional components. It improves readability (via method chaining) and flexibility (e.g., add/remove preprocessing steps without rewriting the entire pipeline).
Structural Design Patterns
Structural patterns focus on how objects and classes are composed to form larger structures, ensuring flexibility and efficiency.
1. Adapter Pattern
What is it?
The Adapter pattern converts the interface of a class into another interface that clients expect. It lets classes work together that couldn’t otherwise because of incompatible interfaces.
Why Data Science Needs It
Data scientists often work with disparate data sources (APIs, databases, legacy systems) that return data in incompatible formats (e.g., JSON from an API, XML from a legacy system). Adapters standardize these formats into a unified structure (e.g., pandas DataFrame).
Python Example: API Data Adapter
Suppose you fetch data from a weather API that returns JSON, but your pipeline expects a pandas DataFrame. An adapter can bridge this gap.
import pandas as pd
import requests
# Legacy/External Interface: Returns JSON data
class WeatherAPI:
@staticmethod
def fetch_weather(city: str) -> dict:
"""Fetch weather data from an external API (returns JSON)."""
response = requests.get(f"https://api.weather.com/{city}")
return response.json() # Example response: {"city": "NY", "temp": 72, "humidity": 65}
# Target Interface: Expects a DataFrame
class DataFrameProvider(ABC):
@abstractmethod
def get_dataframe(self) -> pd.DataFrame:
pass
# Adapter: Converts WeatherAPI's JSON output to a DataFrame
class WeatherAPIAdapter(DataFrameProvider):
def __init__(self, api: WeatherAPI):
self.api = api
def get_dataframe(self, city: str) -> pd.DataFrame:
json_data = self.api.fetch_weather(city)
# Convert JSON to DataFrame
return pd.DataFrame([json_data])
# Usage
if __name__ == "__main__":
weather_api = WeatherAPI()
adapter = WeatherAPIAdapter(weather_api)
df = adapter.get_dataframe("NY")
print(df)
# Output:
# city temp humidity
# 0 NY 72 65
Key Takeaway
Use Adapter to integrate legacy systems, external APIs, or third-party libraries with incompatible interfaces. It keeps your pipeline clean by hiding conversion logic behind a standard interface.
2. Decorator Pattern
What is it?
The Decorator pattern dynamically adds new behavior to objects without changing their structure. It wraps objects in “decorator” classes that add functionality before/after the object’s methods.
Why Data Science Needs It
Data science workflows often require cross-cutting concerns like logging, caching, or validation (e.g., logging the time taken for a preprocessing step, caching API responses to avoid redundant calls). Decorators add these concerns without cluttering the core logic.
Python Example: Timing Decorator for Preprocessing
Add logging to a preprocessing function to track execution time.
import time
from functools import wraps
import pandas as pd
# Decorator to log function execution time
def timing_decorator(func):
@wraps(func) # Preserve original function metadata
def wrapper(*args, **kwargs):
start_time = time.time()
result = func(*args, **kwargs) # Execute the original function
end_time = time.time()
print(f"Function {func.__name__} took {end_time - start_time:.4f} seconds")
return result
return wrapper
# Core preprocessing function
@timing_decorator # Apply the decorator
def preprocess_data(df: pd.DataFrame) -> pd.DataFrame:
"""Clean and preprocess raw data."""
df = df.dropna()
df["age"] = df["age"].astype(int)
return df
# Usage
if __name__ == "__main__":
data = pd.DataFrame({
"age": [25.0, None, 30.0, 35.5],
"income": [50000, 60000, 70000, 80000]
})
processed_data = preprocess_data(data)
# Output: Function preprocess_data took 0.0002 seconds
Key Takeaway
Use Decorator for adding reusable, modular functionality (logging, caching, validation) to functions or classes. Python’s functools.wraps ensures decorators don’t break function introspection (e.g., help(preprocess_data) still works).
3. Composite Pattern
What is it?
The Composite pattern treats individual objects and collections of objects uniformly. It lets you compose objects into tree structures to represent part-whole hierarchies (e.g., an ensemble model with base models).
Why Data Science Needs It
Data science uses hierarchical structures like ensemble models (e.g., a Random Forest composed of decision trees) or nested data pipelines (e.g., a pipeline that includes sub-pipelines for text and numeric features). The Composite pattern simplifies interacting with these structures by treating them as single entities.
Python Example: Ensemble Model Composite
Treat individual models and ensembles uniformly, allowing prediction with a single predict method.
from abc import ABC, abstractmethod
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
# Component: Defines the interface for all models (leaf and composite)
class ModelComponent(ABC):
@abstractmethod
def predict(self, X: np.ndarray) -> np.ndarray:
pass
# Leaf: Individual model (e.g., Logistic Regression)
class LogisticRegressionModel(ModelComponent):
def __init__(self):
self.model = LogisticRegression()
def fit(self, X: np.ndarray, y: np.ndarray):
self.model.fit(X, y)
return self
def predict(self, X: np.ndarray) -> np.ndarray:
return self.model.predict(X)
# Leaf: Another individual model (e.g., Random Forest)
class RandomForestModel(ModelComponent):
def __init__(self):
self.model = RandomForestClassifier()
def fit(self, X: np.ndarray, y: np.ndarray):
self.model.fit(X, y)
return self
# Composite: Ensemble model containing multiple ModelComponents
class EnsembleModel(ModelComponent):
def __init__(self):
self.models = []
def add_model(self, model: ModelComponent):
self.models.append(model)
def remove_model(self, model: ModelComponent):
self.models.remove(model)
def predict(self, X: np.ndarray) -> np.ndarray:
"""Predict by majority voting across all models."""
predictions = [model.predict(X) for model in self.models]
# Majority vote (axis=0 to vote per sample)
return np.apply_along_axis(
lambda x: np.bincount(x).argmax(), axis=0, arr=np.array(predictions)
)
# Usage
if __name__ == "__main__":
# Sample data (binary classification)
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([0, 0, 1, 1])
# Create individual models
lr = LogisticRegressionModel().fit(X, y)
rf = RandomForestModel().fit(X, y)
# Create ensemble and add models
ensemble = EnsembleModel()
ensemble.add_model(lr)
ensemble.add_model(rf)
# Predict with individual models and ensemble
print("LR Predictions:", lr.predict(X)) # Output: [0 0 1 1]
print("RF Predictions:", rf.predict(X)) # Output: [0 0 1 1]
print("Ensemble Predictions:", ensemble.predict(X)) # Output: [0 0 1 1] (majority vote)
Key Takeaway
Use Composite to work with hierarchical object structures (e.g., ensembles, nested pipelines). It simplifies code by allowing uniform treatment of individual objects and their compositions.
Behavioral Design Patterns
Behavioral patterns focus on communication between objects, defining how objects interact and distribute responsibility.
1. Observer Pattern
What is it?
The Observer pattern defines a one-to-many dependency between objects: when one object (the “subject”) changes state, all its dependents (the “observers”) are notified and updated automatically.
Why Data Science Needs It
Data pipelines often involve event-driven workflows: e.g., notifying a monitoring system when a model finishes training, or triggering a downstream process when new data is ingested. Observers decouple the subject (e.g., a training job) from its dependents (e.g., a logger, a deployment service).
Python Example: Model Training Observer
Notify a logger and a deployment service when a model finishes training.
from abc import ABC, abstractmethod
# Observer Interface: Defines the update method for dependents
class Observer(ABC):
@abstractmethod
def update(self, message: str):
pass
# Concrete Observers: Logger and Deployment Service
class TrainingLogger(Observer):
def update(self, message: str):
print(f"[LOG] {message}")
class DeploymentService(Observer):
def update(self, message: str):
if "training completed" in message:
print(f"[DEPLOY] Deploying model...")
# Subject: Model Trainer (notifies observers on state changes)
class ModelTrainer:
def __init__(self):
self.observers = [] # List of observers to notify
def attach(self, observer: Observer):
self.observers.append(observer)
def detach(self, observer: Observer):
self.observers.remove(observer)
def notify(self, message: str):
"""Notify all observers with a message."""
for observer in self.observers:
observer.update(message)
def train(self):
"""Simulate model training and notify observers."""
self.notify("Training started...")
# Simulate training work
import time
time.sleep(2)
self.notify("Training completed successfully!")
# Usage
if __name__ == "__main__":
trainer = ModelTrainer()
logger = TrainingLogger()
deployer = DeploymentService()
# Attach observers
trainer.attach(logger)
trainer.attach(deployer)
# Start training (triggers notifications)
trainer.train()
# Output:
# [LOG] Training started...
# [LOG] Training completed successfully!
# [DEPLOY] Deploying model...
Key Takeaway
Use Observer for event-driven systems where objects need to react to state changes in another object. It promotes loose coupling (subject doesn’t need to know about specific observers) and scalability (easily add new observers).
2. Strategy Pattern
What is it?
The Strategy pattern defines a family of interchangeable algorithms, encapsulates each one, and makes them interchangeable. It lets the algorithm vary independently from clients that use it.
Why Data Science Needs It
Data science involves experimenting with multiple algorithms (e.g., different models for classification, or different metrics for evaluation). The Strategy pattern lets you swap algorithms dynamically without changing client code.
Python Example: Model Strategy for Classification
Encapsulate different classification algorithms (e.g., SVM, Random Forest) as strategies, allowing dynamic selection.
from abc import ABC, abstractmethod
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Strategy Interface: Defines the contract for all classification models
class ClassificationStrategy(ABC):
@abstractmethod
def train(self, X_train, y_train):
pass
@abstractmethod
def predict(self, X):
pass
# Concrete Strategies: Different classification algorithms
class SVMStrategy(ClassificationStrategy):
def __init__(self, C=1.0):
self.model = SVC(C=C)
def train(self, X_train, y_train):
self.model.fit(X_train, y_train)
def predict(self, X):
return self.model.predict(X)
class RandomForestStrategy(ClassificationStrategy):
def __init__(self, n_estimators=100):
self.model = RandomForestClassifier(n_estimators=n_estimators)
def train(self, X_train, y_train):
self.model.fit(X_train, y_train)
def predict(self, X):
return self.model.predict(X)
# Context: Uses a strategy to perform classification
class Classifier:
def __init__(self, strategy: ClassificationStrategy):
self.strategy = strategy
def set_strategy(self, strategy: ClassificationStrategy):
"""Dynamically change the strategy."""
self.strategy = strategy
def train(self, X_train, y_train):
self.strategy.train(X_train, y_train)
def predict(self, X):
return self.strategy.predict(X)
# Usage
if __name__ == "__main__":
# Generate sample classification data
X, y = make_classification(n_samples=100, n_features=4, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize classifier with SVM strategy
classifier = Classifier(SVMStrategy(C=0.5))
classifier.train(X_train, y_train)
print("SVM Predictions:", classifier.predict(X_test)) # Output: [0 1 0 1 0 1 1 0 1 0 0 1 1 0 0 1 1 0 1 0]
# Switch to Random Forest strategy dynamically
classifier.set_strategy(RandomForestStrategy(n_estimators=50))
classifier.train(X_train, y_train)
print("RF Predictions:", classifier.predict(X_test)) # Output: [0 1 0 1 0 1 1 0 1 0 0 1 1 0 0 1 1 0 1 0]
Key Takeaway
Use Strategy to encapsulate interchangeable algorithms (models, metrics, preprocessing steps). It simplifies experimentation and allows dynamic strategy changes (e.g., A/B testing models).
3. Command Pattern
What is it?
The Command pattern encapsulates a request as an object, allowing you to parameterize clients with different requests, queue or log requests, and support undoable operations.
Why Data Science Needs It
Data pipelines often involve reusable, queueable tasks: e.g., logging preprocessing steps for auditing, undoing a failed transformation, or queuing model training jobs. Commands encapsulate these tasks as objects, making them easier to manage.
Python Example: Preprocessing Command Queue
Encapsulate preprocessing steps as commands, allowing them to be queued, executed, and undone.
from abc import ABC, abstractmethod
import pandas as pd
# Command Interface: Defines execute and undo methods
class Command(ABC):
@abstractmethod
def execute(self, df: pd.DataFrame) -> pd.DataFrame:
pass
@abstractmethod
def undo(self, df: pd.DataFrame) -> pd.DataFrame:
pass
# Concrete Commands: Preprocessing steps
class DropNACommand(Command):
def __init__(self):
self.dropped_rows = None # Store state for undo
def execute(self, df: pd.DataFrame) -> pd.DataFrame:
self.dropped_rows = df[df.isna().any(axis=1)] # Save rows to restore later
return df.dropna()
def undo(self, df: pd.DataFrame) -> pd.DataFrame:
"""Restore dropped rows."""
return pd.concat([df, self.dropped_rows]).sort_index()
class ScaleCommand(Command):
def __init__(self, column: str):
self.column = column
self.original_mean = None # Store state for undo
self.original_std = None
def execute(self, df: pd.DataFrame) -> pd.DataFrame:
self.original_mean = df[self.column].mean()
self.original_std = df[self.column].std()
# Standard scaling: (x - mean) / std
df[self.column] = (df[self.column] - self.original_mean) / self.original_std
return df
def undo(self, df: pd.DataFrame) -> pd.DataFrame:
"""Restore original scale."""
df[self.column] = (df[self.column] * self.original_std) + self.original_mean
return df
# Invoker: Manages a queue of commands
class CommandInvoker:
def __init__(self):
self.command_queue = []
self.history = [] # To track executed commands for undo
def add_command(self, command: Command):
self.command_queue.append(command)
def execute_commands(self, df: pd.DataFrame) -> pd.DataFrame:
"""Execute all queued commands and track history."""
current_df = df.copy()
for command in self.command_queue:
current_df = command.execute(current_df)
self.history.append(command)
self.command_queue = [] # Clear queue after execution
return current_df
def undo_last_command(self, df: pd.DataFrame) -> pd.DataFrame:
"""Undo the last executed command."""
if not self.history:
raise ValueError("No commands to undo.")
last_command = self.history.pop()
return last_command.undo(df)
# Usage
if __name__ == "__main__":
# Sample data
data = pd.DataFrame({
"age": [25, None, 30, 35],
"income": [50000, 60000, 70000, 80000]
})
# Create commands
drop_na = DropNACommand()
scale_income = ScaleCommand(column="income")
# Queue and execute commands
invoker = CommandInvoker()
invoker.add_command(drop_na)
invoker.add_command(scale_income)
processed_data = invoker.execute_commands(data)
print("After execution:")
print(processed_data)
# Output:
# age income
# 0 25 -1.161895
# 2 30 0.000000
# 3 35 1.161895
# Undo scaling
data_after_undo = invoker.undo_last_command(processed_data)
print("\nAfter undoing scale:")
print(data_after_undo)
# Output:
# age income
# 0 25 50000.0
# 2 30 70000.0
# 3 35 80000.0
# Undo dropping NA
data_after_undo_drop = invoker.undo_last_command(data_after_undo)
print("\nAfter undoing drop NA:")
print(data_after_undo_drop)
# Output (original data with NaN restored):
# age income
# 0 25 50000.0
# 1 NaN 60000.0
# 2 30 70000.0
# 3 35 80000.0
Key Takeaway
Use Command to encapsulate tasks as objects, enabling queuing, logging, and undo/redo functionality. It’s ideal for complex workflows where traceability and reversibility are important (e.g., regulatory-compliant pipelines).
Anti-Patterns in Data Science
Anti-patterns are common solutions that seem effective but lead to problems in the long run. Avoid these in data science projects:
1. Spaghetti Code
What it is: Unstructured, tangled code with no clear organization (e.g., a single Jupyter notebook with thousands of lines, global variables, and duplicated logic).
Why it’s bad: Hard to debug, test, or extend. New team members struggle to understand the code.
Solution: Break code into functions/classes, use design patterns (e.g., Factory, Builder) for structure, and modularize components (data loading, preprocessing, modeling).
2. God Object
What it is: A single class or function that does everything (e.g., a DataProcessor class with methods for loading, cleaning, modeling, and visualization).
Why it’s bad: Violates the Single Responsibility Principle (SRP), making the code rigid and hard to maintain. Changes to one part risk breaking others.
Solution: Split responsibilities into separate classes (e.g., DataLoader, Preprocessor, ModelTrainer) and use patterns like Strategy or Composite to coordinate them.
3. Premature Optimization
What it is: Optimizing code for performance before identifying bottlenecks (e.g., spending days optimizing a preprocessing step that takes 1% of total runtime).
Why it’s bad: Wastes time and complicates code. Prematurely optimized code is often harder to read and modify.
Solution: Profile first (use cProfile, line_profiler) to identify slow parts, then optimize strategically. Prioritize readability and correctness first.
Conclusion
Design patterns are not just for software engineers—they are powerful tools for data scientists to write clean, scalable, and maintainable code. By adopting patterns like Singleton (for shared resources), Factory Method (for data loaders), Decorator (for cross-cutting concerns), and Strategy (for algorithm experimentation), you can transform messy scripts into robust pipelines.
Remember: design patterns are guidelines, not rules. Choose patterns that solve your specific problem, and avoid over-engineering. Start with simple solutions, then refactor to use patterns as your project grows.
By integrating these patterns into your workflow, you’ll reduce technical debt, improve collaboration, and build data science systems that stand the test of time.
References
- Gamma, E., Helm, R., Johnson, R., & Vlissides, J. (1994). Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley.
- Freeman, E., Robson, E., Bates, B., & Sierra, K. (2004). Head First Design Patterns. O’Reilly Media.
- Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media.
- Python Software Foundation. (n.d.). Python Decorators. https://docs.python.org/3/glossary.html#term-decorator
- Scikit-learn Developers. (n.d.). Pipeline: chaining estimators. https://scikit-learn.org/stable/modules/compose.html#pipeline-chaining-estimators