py4u guide

Python Standard Library for Data Science: A Practical Guide

When most people think of Python for data science, libraries like Pandas, NumPy, or Scikit-learn come to mind. These tools are powerful, but they often require installation and can add bloat to lightweight projects. What if you need to analyze data without external dependencies? Or prototype a solution quickly without waiting for package installations? Enter Python’s **Standard Library**—a robust collection of modules built into Python, requiring no extra setup. The Standard Library is Python’s "batteries-included" ecosystem, offering tools for file handling, data parsing, numerical computation, and more. While it lacks the advanced features of specialized libraries, it provides a foundation for data science workflows, especially for small-to-medium datasets or scripts where minimalism is key. In this guide, we’ll explore the most useful Standard Library modules for data science, with practical examples to help you integrate them into your workflows. Whether you’re cleaning data, parsing files, or prototyping analyses, these tools will become indispensable.

Table of Contents

  1. File Handling: os & pathlib
  2. Data Input/Output: csv & json
  3. Time Series: datetime
  4. Advanced Data Structures: collections
  5. Efficient Iteration: itertools
  6. Numerical Computation: math & statistics
  7. Scripting & Automation: sys & argparse
  8. Debugging & Monitoring: logging
  9. Practical Example: A Mini Data Pipeline
  10. Conclusion
  11. References

1. File Handling: os & pathlib

Before analyzing data, you’ll often need to interact with files and directories (e.g., locating datasets, creating output folders). The os module and its modern counterpart pathlib simplify this.

os: Operating System Interactions

os provides functions to navigate directories, check file existence, and modify paths.

Key Functions:

  • os.getcwd(): Get current working directory.
  • os.listdir(path): List files in a directory.
  • os.path.exists(path): Check if a path exists.
  • os.makedirs(path): Create nested directories.

Example: Locate and validate a dataset path.

import os

# Get current directory
current_dir = os.getcwd()
print(f"Current Directory: {current_dir}")

# Define dataset path
data_path = os.path.join(current_dir, "data", "raw", "sales.csv")

# Check if file exists
if os.path.exists(data_path):
    print(f"Dataset found: {data_path}")
else:
    print(f"Error: Dataset not found at {data_path}")
    # Create directory if missing
    os.makedirs(os.path.dirname(data_path), exist_ok=True)
    print(f"Created missing directories: {os.path.dirname(data_path)}")

pathlib: Object-Oriented Paths

pathlib (Python 3.4+) replaces string-based paths with objects, making code cleaner and more readable.

Key Methods:

  • Path.cwd(): Get current directory (returns a Path object).
  • Path.glob(pattern): Search for files matching a pattern (e.g., *.csv).
  • Path.exists(): Check if the path exists.
  • Path.mkdir(parents=True, exist_ok=True): Create directories (with parents).

Example: Using pathlib to find CSV files in a folder.

from pathlib import Path

# Define data directory
data_dir = Path("data/raw")

# Find all CSV files
csv_files = list(data_dir.glob("*.csv"))
print(f"Found {len(csv_files)} CSV files:")
for file in csv_files:
    print(f"- {file.name}")  # Access filename via .name attribute

When to Use: Use pathlib for new projects (cleaner syntax) and os for compatibility with older code.

2. Data Input/Output: csv & json

Most data science workflows start with loading data. The csv and json modules handle two of the most common formats.

csv: Comma-Separated Values

The csv module parses and writes CSV files, avoiding manual string splitting (which fails with quoted commas).

Key Classes/Functions:

  • csv.reader(file): Read CSV as rows (lists).
  • csv.DictReader(file): Read CSV as rows (dictionaries, with headers as keys).
  • csv.writer(file): Write rows to CSV.
  • csv.DictWriter(file, fieldnames): Write dictionaries to CSV with specified headers.

Example: Load a CSV into a list of dictionaries.

import csv
from pathlib import Path

data_path = Path("data/raw/sales.csv")

# Load CSV with DictReader
with open(data_path, "r", newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)  # Uses first row as headers
    sales_data = list(reader)  # Convert to list of dicts

# Inspect first row
print("First row:", sales_data[0])
# Example output: {'date': '2023-01-01', 'product': 'Laptop', 'revenue': '999.99'}

Example: Write filtered data to a new CSV.

# Filter rows where revenue > 500
high_revenue = [row for row in sales_data if float(row["revenue"]) > 500]

# Write to CSV with DictWriter
output_path = Path("data/processed/high_revenue.csv")
with open(output_path, "w", newline="", encoding="utf-8") as f:
    fieldnames = ["date", "product", "revenue"]  # Explicit headers
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()  # Write headers
    writer.writerows(high_revenue)  # Write all rows

print(f"Saved {len(high_revenue)} rows to {output_path}")

json: JavaScript Object Notation

JSON is ideal for nested or semi-structured data (e.g., API responses). The json module serializes/deserializes Python objects.

Key Functions:

  • json.load(file): Load JSON from a file into a Python dict/list.
  • json.loads(string): Load JSON from a string.
  • json.dump(obj, file): Write Python object to a JSON file.
  • json.dumps(obj): Convert Python object to a JSON string.

Example: Load and parse nested JSON data.

import json
from pathlib import Path

# Load JSON data
json_path = Path("data/raw/user_sessions.json")
with open(json_path, "r", encoding="utf-8") as f:
    sessions = json.load(f)  # List of session dicts

# Extract user IDs and session durations
user_sessions = []
for session in sessions:
    user_sessions.append({
        "user_id": session["user"]["id"],
        "duration_minutes": session["duration"] / 60  # Convert seconds to minutes
    })

# Save cleaned data
with open("data/processed/user_sessions_clean.json", "w", encoding="utf-8") as f:
    json.dump(user_sessions, f, indent=2)  # indent for readability

3. Time Series: datetime

Time series data (e.g., sales dates, sensor timestamps) requires parsing, formatting, and arithmetic. The datetime module handles this with datetime, date, time, and timedelta classes.

Key Classes/Functions:

  • datetime.datetime(year, month, day, hour, minute, second): Represents a timestamp.
  • datetime.strptime(date_str, format): Parse a string into a datetime object (string parse time).
  • datetime.strftime(format): Format a datetime object into a string (string format time).
  • timedelta(days, hours, ...): Represents a time interval.

Example: Parse dates and calculate time differences.

from datetime import datetime, timedelta

# Sample date strings from sales_data (from earlier CSV example)
date_str = sales_data[0]["date"]  # '2023-01-01'

# Parse string to datetime object
date_obj = datetime.strptime(date_str, "%Y-%m-%d")  # %Y=4-digit year, %m=2-digit month, %d=2-digit day
print(f"Parsed date: {date_obj} (type: {type(date_obj)})")  # 2023-01-01 00:00:00 (type: <class 'datetime.datetime'>)

# Calculate days since date
today = datetime.today()
days_since = (today - date_obj).days  # timedelta object's .days attribute
print(f"Days since {date_str}: {days_since}")

# Format datetime back to string (e.g., "Jan 01, 2023")
formatted_date = date_obj.strftime("%b %d, %Y")  # %b=abbreviated month, %d=day, %Y=year
print(f"Formatted date: {formatted_date}")  # Jan 01, 2023

4. Advanced Data Structures: collections

The collections module extends Python’s built-in data structures with tools for common tasks like counting, grouping, and efficient lookups.

Key Structures:

  • defaultdict: Dictionary with default values for missing keys (avoids KeyError).
  • Counter: Counts hashable objects (e.g., categorical data).
  • deque: Double-ended queue for fast appends/pops from both ends.
  • namedtuple: Lightweight class for immutable data (e.g., rows with named fields).

Example 1: Counter for Categorical Data
Count product frequencies in sales data:

from collections import Counter

# Extract product names from sales_data (list of dicts)
products = [row["product"] for row in sales_data]

# Count occurrences
product_counts = Counter(products)

# Most common products
print("Top 3 products:")
for product, count in product_counts.most_common(3):
    print(f"- {product}: {count} sales")
# Example output:
# - Laptop: 150 sales
# - Phone: 120 sales
# - Tablet: 80 sales

Example 2: defaultdict for Grouping
Group sales by product category:

from collections import defaultdict

# Sample product categories (could load from a lookup file)
categories = {
    "Laptop": "Electronics",
    "Phone": "Electronics",
    "Tablet": "Electronics",
    "Desk": "Furniture"
}

# Group sales by category using defaultdict(list)
sales_by_category = defaultdict(list)
for row in sales_data:
    category = categories.get(row["product"], "Other")  # Default to "Other"
    sales_by_category[category].append(float(row["revenue"]))

# Calculate total revenue per category
total_revenue = {cat: sum(revenues) for cat, revenues in sales_by_category.items()}
print("Revenue by category:", total_revenue)
# Example output: {'Electronics': 349999.75, 'Furniture': 4999.95}

5. Efficient Iteration: itertools

For large datasets, looping with for can be slow. The itertools module provides memory-efficient tools to generate and manipulate iterables (e.g., loops that avoid storing all values in memory).

Key Functions:

  • itertools.chain(*iterables): Combine multiple iterables into one.
  • itertools.groupby(iterable, key): Group items by a key (like SQL’s GROUP BY).
  • itertools.product(*iterables): Cartesian product of input iterables (e.g., feature combinations).

Example 1: chain for Combining Data Sources
Merge two CSV files without loading both into memory:

import itertools
import csv
from pathlib import Path

# Define file paths
file1 = Path("data/raw/sales_2023.csv")
file2 = Path("data/raw/sales_2024.csv")

# Open both files and chain readers
with open(file1, "r") as f1, open(file2, "r") as f2:
    reader1 = csv.DictReader(f1)
    reader2 = csv.DictReader(f2)
    combined_reader = itertools.chain(reader1, reader2)  # Lazy evaluation

    # Process row-by-row (memory-efficient)
    for row_num, row in enumerate(combined_reader, 1):
        if row_num % 1000 == 0:  # Progress update
            print(f"Processed {row_num} rows...")

Example 2: product for Feature Engineering
Generate all combinations of categorical features (e.g., for a grid search):

from itertools import product

# Categorical features
regions = ["North", "South", "East", "West"]
segments = ["Retail", "Corporate"]

# Generate all combinations
feature_combinations = list(product(regions, segments))
print("Feature combinations:", feature_combinations)
# Output: [('North', 'Retail'), ('North', 'Corporate'), ..., ('West', 'Corporate')]

6. Numerical Computation: math & statistics

For basic numerical tasks, the math (low-level math) and statistics (descriptive stats) modules eliminate the need for NumPy.

math: Basic Arithmetic & Trigonometry

Includes functions like sqrt, log, sin, and constants like pi.

Example: Calculate profit margin (revenue - cost):

import math

def profit_margin(revenue, cost):
    if revenue == 0:
        return 0.0
    margin = (revenue - cost) / revenue
    return math.round(margin * 100, 2)  # Round to 2 decimals

print(profit_margin(999.99, 500.00))  # Output: 50.0 (50% margin)

statistics: Descriptive Statistics

Compute mean, median, standard deviation, and more on numerical data.

Example: Descriptive stats for revenue:

import statistics

# Extract revenues as floats
revenues = [float(row["revenue"]) for row in sales_data]

# Compute stats
mean_rev = statistics.mean(revenues)
median_rev = statistics.median(revenues)
stdev_rev = statistics.stdev(revenues)  # Sample standard deviation

print(f"Revenue stats:\nMean: {mean_rev:.2f}\nMedian: {median_rev:.2f}\nStdev: {stdev_rev:.2f}")
# Example output:
# Mean: 750.50
# Median: 699.99
# Stdev: 230.25

7. Scripting & Automation: sys & argparse

To turn data processing into reusable tools, use sys (access command-line arguments) and argparse (parse arguments cleanly).

sys: System-Specific Parameters

sys.argv returns a list of command-line arguments passed to the script.

Example: Simple script with sys.argv

import sys
from pathlib import Path

def main():
    # Check if input path is provided
    if len(sys.argv) != 2:
        print("Usage: python process_data.py <input_file>")
        sys.exit(1)  # Exit with error code 1

    input_path = Path(sys.argv[1])
    if not input_path.exists():
        print(f"Error: File {input_path} not found.")
        sys.exit(1)

    print(f"Processing data from: {input_path}")
    # Add data processing logic here...

if __name__ == "__main__":
    main()  # Run when script is executed directly

argparse: Structured Argument Parsing

For complex scripts, argparse adds help messages, type checking, and optional arguments.

Example: Script with argparse

import argparse
from pathlib import Path

def main():
    parser = argparse.ArgumentParser(description="Process sales data.")
    # Required input path
    parser.add_argument("input", type=Path, help="Path to input CSV file")
    # Optional output path (default: ./output.csv)
    parser.add_argument("-o", "--output", type=Path, default=Path("output.csv"), 
                        help="Path to output CSV file (default: output.csv)")
    # Flag to enable verbose mode
    parser.add_argument("-v", "--verbose", action="store_true", help="Print detailed logs")

    args = parser.parse_args()  # Parse arguments

    # Validate input
    if not args.input.exists():
        parser.error(f"Input file not found: {args.input}")

    if args.verbose:
        print(f"Starting processing...\nInput: {args.input}\nOutput: {args.output}")
    # Add processing logic here...

if __name__ == "__main__":
    main()

Run with: python process_data.py data/raw/sales.csv -o data/processed/output.csv -v

8. Debugging & Monitoring: logging

The logging module tracks script execution, making it easier to debug failures (e.g., missing files, data errors).

Key Features:

  • Log levels: DEBUG (detailed), INFO (progress), WARNING, ERROR, CRITICAL.
  • Log to files, console, or external services.
  • Format logs with timestamps, module names, and messages.

Example: Logging a Data Pipeline

import logging
from pathlib import Path

# Configure logging to file and console
logging.basicConfig(
    level=logging.INFO,  # Capture INFO and above
    format="%(asctime)s - %(levelname)s - %(message)s",  # Include timestamp and level
    handlers=[
        logging.FileHandler("data_pipeline.log"),  # Log to file
        logging.StreamHandler()  # Also log to console
    ]
)

def load_data(path):
    logging.info(f"Loading data from {path}")
    if not path.exists():
        logging.error(f"File not found: {path}")
        raise FileNotFoundError(f"Missing file: {path}")
    # Add load logic...
    logging.debug(f"Loaded {len(data)} rows")  # DEBUG-level (only shown if level=DEBUG)
    return data

# Run pipeline
try:
    data_path = Path("data/raw/sales.csv")
    data = load_data(data_path)
    logging.info("Data loaded successfully")
    # Add processing steps...
except Exception as e:
    logging.critical(f"Pipeline failed: {str(e)}", exc_info=True)  # Log traceback

Logs will appear in data_pipeline.log and the console:

2023-10-01 14:30:00 - INFO - Loading data from data/raw/sales.csv
2023-10-01 14:30:02 - INFO - Data loaded successfully

9. Practical Example: A Mini Data Pipeline

Let’s combine the modules above into a pipeline that:

  1. Loads raw sales data (CSV).
  2. Cleans dates (using datetime).
  3. Aggregates revenue by product (using collections.Counter and defaultdict).
  4. Logs steps (using logging).
  5. Saves results to JSON (using json).
import csv
import json
from datetime import datetime
from collections import defaultdict, Counter
import logging
from pathlib import Path

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

def main(input_path, output_path):
    # Step 1: Load CSV data
    logging.info(f"Loading data from {input_path}")
    with open(input_path, "r", newline="", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        sales_data = list(reader)
    logging.info(f"Loaded {len(sales_data)} rows")

    # Step 2: Clean dates (filter 2023 data)
    logging.info("Filtering 2023 data...")
    valid_rows = []
    for row in sales_data:
        try:
            date = datetime.strptime(row["date"], "%Y-%m-%d")
            if date.year == 2023:
                valid_rows.append(row)
        except ValueError:
            logging.warning(f"Skipping invalid date: {row['date']}")
    logging.info(f"Retained {len(valid_rows)} valid 2023 rows")

    # Step 3: Aggregate revenue by product
    logging.info("Aggregating revenue...")
    product_revenue = defaultdict(float)
    for row in valid_rows:
        product_revenue[row["product"]] += float(row["revenue"])

    # Step 4: Count sales and calculate stats
    product_counts = Counter([row["product"] for row in valid_rows])
    results = {
        "product_stats": {
            product: {
                "total_revenue": round(revenue, 2),
                "total_sales": product_counts[product]
            } for product, revenue in product_revenue.items()
        },
        "total_2023_revenue": round(sum(product_revenue.values()), 2)
    }

    # Step 5: Save results to JSON
    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(results, f, indent=2)
    logging.info(f"Saved results to {output_path}")

if __name__ == "__main__":
    input_path = Path("data/raw/sales.csv")
    output_path = Path("data/processed/2023_sales_summary.json")
    main(input_path, output_path)

10. Conclusion

Python’s Standard Library is a hidden gem for data science. While specialized libraries like Pandas and NumPy excel at scale, the Standard Library offers:

  • Lightweight workflows: No external dependencies (ideal for edge devices or restricted environments).
  • Foundational skills: Understanding core modules makes learning advanced tools easier.
  • Speed: Avoids overkill for small-to-medium datasets.

By mastering modules like csv, collections, itertools, and logging, you can build robust data pipelines with minimal overhead. The next time you reach for Pandas, ask: Can the Standard Library handle this task more simply?

11. References