Table of Contents
- File Handling:
os&pathlib - Data Input/Output:
csv&json - Time Series:
datetime - Advanced Data Structures:
collections - Efficient Iteration:
itertools - Numerical Computation:
math&statistics - Scripting & Automation:
sys&argparse - Debugging & Monitoring:
logging - Practical Example: A Mini Data Pipeline
- Conclusion
- References
1. File Handling: os & pathlib
Before analyzing data, you’ll often need to interact with files and directories (e.g., locating datasets, creating output folders). The os module and its modern counterpart pathlib simplify this.
os: Operating System Interactions
os provides functions to navigate directories, check file existence, and modify paths.
Key Functions:
os.getcwd(): Get current working directory.os.listdir(path): List files in a directory.os.path.exists(path): Check if a path exists.os.makedirs(path): Create nested directories.
Example: Locate and validate a dataset path.
import os
# Get current directory
current_dir = os.getcwd()
print(f"Current Directory: {current_dir}")
# Define dataset path
data_path = os.path.join(current_dir, "data", "raw", "sales.csv")
# Check if file exists
if os.path.exists(data_path):
print(f"Dataset found: {data_path}")
else:
print(f"Error: Dataset not found at {data_path}")
# Create directory if missing
os.makedirs(os.path.dirname(data_path), exist_ok=True)
print(f"Created missing directories: {os.path.dirname(data_path)}")
pathlib: Object-Oriented Paths
pathlib (Python 3.4+) replaces string-based paths with objects, making code cleaner and more readable.
Key Methods:
Path.cwd(): Get current directory (returns aPathobject).Path.glob(pattern): Search for files matching a pattern (e.g.,*.csv).Path.exists(): Check if the path exists.Path.mkdir(parents=True, exist_ok=True): Create directories (with parents).
Example: Using pathlib to find CSV files in a folder.
from pathlib import Path
# Define data directory
data_dir = Path("data/raw")
# Find all CSV files
csv_files = list(data_dir.glob("*.csv"))
print(f"Found {len(csv_files)} CSV files:")
for file in csv_files:
print(f"- {file.name}") # Access filename via .name attribute
When to Use: Use pathlib for new projects (cleaner syntax) and os for compatibility with older code.
2. Data Input/Output: csv & json
Most data science workflows start with loading data. The csv and json modules handle two of the most common formats.
csv: Comma-Separated Values
The csv module parses and writes CSV files, avoiding manual string splitting (which fails with quoted commas).
Key Classes/Functions:
csv.reader(file): Read CSV as rows (lists).csv.DictReader(file): Read CSV as rows (dictionaries, with headers as keys).csv.writer(file): Write rows to CSV.csv.DictWriter(file, fieldnames): Write dictionaries to CSV with specified headers.
Example: Load a CSV into a list of dictionaries.
import csv
from pathlib import Path
data_path = Path("data/raw/sales.csv")
# Load CSV with DictReader
with open(data_path, "r", newline="", encoding="utf-8") as f:
reader = csv.DictReader(f) # Uses first row as headers
sales_data = list(reader) # Convert to list of dicts
# Inspect first row
print("First row:", sales_data[0])
# Example output: {'date': '2023-01-01', 'product': 'Laptop', 'revenue': '999.99'}
Example: Write filtered data to a new CSV.
# Filter rows where revenue > 500
high_revenue = [row for row in sales_data if float(row["revenue"]) > 500]
# Write to CSV with DictWriter
output_path = Path("data/processed/high_revenue.csv")
with open(output_path, "w", newline="", encoding="utf-8") as f:
fieldnames = ["date", "product", "revenue"] # Explicit headers
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader() # Write headers
writer.writerows(high_revenue) # Write all rows
print(f"Saved {len(high_revenue)} rows to {output_path}")
json: JavaScript Object Notation
JSON is ideal for nested or semi-structured data (e.g., API responses). The json module serializes/deserializes Python objects.
Key Functions:
json.load(file): Load JSON from a file into a Python dict/list.json.loads(string): Load JSON from a string.json.dump(obj, file): Write Python object to a JSON file.json.dumps(obj): Convert Python object to a JSON string.
Example: Load and parse nested JSON data.
import json
from pathlib import Path
# Load JSON data
json_path = Path("data/raw/user_sessions.json")
with open(json_path, "r", encoding="utf-8") as f:
sessions = json.load(f) # List of session dicts
# Extract user IDs and session durations
user_sessions = []
for session in sessions:
user_sessions.append({
"user_id": session["user"]["id"],
"duration_minutes": session["duration"] / 60 # Convert seconds to minutes
})
# Save cleaned data
with open("data/processed/user_sessions_clean.json", "w", encoding="utf-8") as f:
json.dump(user_sessions, f, indent=2) # indent for readability
3. Time Series: datetime
Time series data (e.g., sales dates, sensor timestamps) requires parsing, formatting, and arithmetic. The datetime module handles this with datetime, date, time, and timedelta classes.
Key Classes/Functions:
datetime.datetime(year, month, day, hour, minute, second): Represents a timestamp.datetime.strptime(date_str, format): Parse a string into adatetimeobject (string parse time).datetime.strftime(format): Format adatetimeobject into a string (string format time).timedelta(days, hours, ...): Represents a time interval.
Example: Parse dates and calculate time differences.
from datetime import datetime, timedelta
# Sample date strings from sales_data (from earlier CSV example)
date_str = sales_data[0]["date"] # '2023-01-01'
# Parse string to datetime object
date_obj = datetime.strptime(date_str, "%Y-%m-%d") # %Y=4-digit year, %m=2-digit month, %d=2-digit day
print(f"Parsed date: {date_obj} (type: {type(date_obj)})") # 2023-01-01 00:00:00 (type: <class 'datetime.datetime'>)
# Calculate days since date
today = datetime.today()
days_since = (today - date_obj).days # timedelta object's .days attribute
print(f"Days since {date_str}: {days_since}")
# Format datetime back to string (e.g., "Jan 01, 2023")
formatted_date = date_obj.strftime("%b %d, %Y") # %b=abbreviated month, %d=day, %Y=year
print(f"Formatted date: {formatted_date}") # Jan 01, 2023
4. Advanced Data Structures: collections
The collections module extends Python’s built-in data structures with tools for common tasks like counting, grouping, and efficient lookups.
Key Structures:
defaultdict: Dictionary with default values for missing keys (avoidsKeyError).Counter: Counts hashable objects (e.g., categorical data).deque: Double-ended queue for fast appends/pops from both ends.namedtuple: Lightweight class for immutable data (e.g., rows with named fields).
Example 1: Counter for Categorical Data
Count product frequencies in sales data:
from collections import Counter
# Extract product names from sales_data (list of dicts)
products = [row["product"] for row in sales_data]
# Count occurrences
product_counts = Counter(products)
# Most common products
print("Top 3 products:")
for product, count in product_counts.most_common(3):
print(f"- {product}: {count} sales")
# Example output:
# - Laptop: 150 sales
# - Phone: 120 sales
# - Tablet: 80 sales
Example 2: defaultdict for Grouping
Group sales by product category:
from collections import defaultdict
# Sample product categories (could load from a lookup file)
categories = {
"Laptop": "Electronics",
"Phone": "Electronics",
"Tablet": "Electronics",
"Desk": "Furniture"
}
# Group sales by category using defaultdict(list)
sales_by_category = defaultdict(list)
for row in sales_data:
category = categories.get(row["product"], "Other") # Default to "Other"
sales_by_category[category].append(float(row["revenue"]))
# Calculate total revenue per category
total_revenue = {cat: sum(revenues) for cat, revenues in sales_by_category.items()}
print("Revenue by category:", total_revenue)
# Example output: {'Electronics': 349999.75, 'Furniture': 4999.95}
5. Efficient Iteration: itertools
For large datasets, looping with for can be slow. The itertools module provides memory-efficient tools to generate and manipulate iterables (e.g., loops that avoid storing all values in memory).
Key Functions:
itertools.chain(*iterables): Combine multiple iterables into one.itertools.groupby(iterable, key): Group items by a key (like SQL’sGROUP BY).itertools.product(*iterables): Cartesian product of input iterables (e.g., feature combinations).
Example 1: chain for Combining Data Sources
Merge two CSV files without loading both into memory:
import itertools
import csv
from pathlib import Path
# Define file paths
file1 = Path("data/raw/sales_2023.csv")
file2 = Path("data/raw/sales_2024.csv")
# Open both files and chain readers
with open(file1, "r") as f1, open(file2, "r") as f2:
reader1 = csv.DictReader(f1)
reader2 = csv.DictReader(f2)
combined_reader = itertools.chain(reader1, reader2) # Lazy evaluation
# Process row-by-row (memory-efficient)
for row_num, row in enumerate(combined_reader, 1):
if row_num % 1000 == 0: # Progress update
print(f"Processed {row_num} rows...")
Example 2: product for Feature Engineering
Generate all combinations of categorical features (e.g., for a grid search):
from itertools import product
# Categorical features
regions = ["North", "South", "East", "West"]
segments = ["Retail", "Corporate"]
# Generate all combinations
feature_combinations = list(product(regions, segments))
print("Feature combinations:", feature_combinations)
# Output: [('North', 'Retail'), ('North', 'Corporate'), ..., ('West', 'Corporate')]
6. Numerical Computation: math & statistics
For basic numerical tasks, the math (low-level math) and statistics (descriptive stats) modules eliminate the need for NumPy.
math: Basic Arithmetic & Trigonometry
Includes functions like sqrt, log, sin, and constants like pi.
Example: Calculate profit margin (revenue - cost):
import math
def profit_margin(revenue, cost):
if revenue == 0:
return 0.0
margin = (revenue - cost) / revenue
return math.round(margin * 100, 2) # Round to 2 decimals
print(profit_margin(999.99, 500.00)) # Output: 50.0 (50% margin)
statistics: Descriptive Statistics
Compute mean, median, standard deviation, and more on numerical data.
Example: Descriptive stats for revenue:
import statistics
# Extract revenues as floats
revenues = [float(row["revenue"]) for row in sales_data]
# Compute stats
mean_rev = statistics.mean(revenues)
median_rev = statistics.median(revenues)
stdev_rev = statistics.stdev(revenues) # Sample standard deviation
print(f"Revenue stats:\nMean: {mean_rev:.2f}\nMedian: {median_rev:.2f}\nStdev: {stdev_rev:.2f}")
# Example output:
# Mean: 750.50
# Median: 699.99
# Stdev: 230.25
7. Scripting & Automation: sys & argparse
To turn data processing into reusable tools, use sys (access command-line arguments) and argparse (parse arguments cleanly).
sys: System-Specific Parameters
sys.argv returns a list of command-line arguments passed to the script.
Example: Simple script with sys.argv
import sys
from pathlib import Path
def main():
# Check if input path is provided
if len(sys.argv) != 2:
print("Usage: python process_data.py <input_file>")
sys.exit(1) # Exit with error code 1
input_path = Path(sys.argv[1])
if not input_path.exists():
print(f"Error: File {input_path} not found.")
sys.exit(1)
print(f"Processing data from: {input_path}")
# Add data processing logic here...
if __name__ == "__main__":
main() # Run when script is executed directly
argparse: Structured Argument Parsing
For complex scripts, argparse adds help messages, type checking, and optional arguments.
Example: Script with argparse
import argparse
from pathlib import Path
def main():
parser = argparse.ArgumentParser(description="Process sales data.")
# Required input path
parser.add_argument("input", type=Path, help="Path to input CSV file")
# Optional output path (default: ./output.csv)
parser.add_argument("-o", "--output", type=Path, default=Path("output.csv"),
help="Path to output CSV file (default: output.csv)")
# Flag to enable verbose mode
parser.add_argument("-v", "--verbose", action="store_true", help="Print detailed logs")
args = parser.parse_args() # Parse arguments
# Validate input
if not args.input.exists():
parser.error(f"Input file not found: {args.input}")
if args.verbose:
print(f"Starting processing...\nInput: {args.input}\nOutput: {args.output}")
# Add processing logic here...
if __name__ == "__main__":
main()
Run with: python process_data.py data/raw/sales.csv -o data/processed/output.csv -v
8. Debugging & Monitoring: logging
The logging module tracks script execution, making it easier to debug failures (e.g., missing files, data errors).
Key Features:
- Log levels:
DEBUG(detailed),INFO(progress),WARNING,ERROR,CRITICAL. - Log to files, console, or external services.
- Format logs with timestamps, module names, and messages.
Example: Logging a Data Pipeline
import logging
from pathlib import Path
# Configure logging to file and console
logging.basicConfig(
level=logging.INFO, # Capture INFO and above
format="%(asctime)s - %(levelname)s - %(message)s", # Include timestamp and level
handlers=[
logging.FileHandler("data_pipeline.log"), # Log to file
logging.StreamHandler() # Also log to console
]
)
def load_data(path):
logging.info(f"Loading data from {path}")
if not path.exists():
logging.error(f"File not found: {path}")
raise FileNotFoundError(f"Missing file: {path}")
# Add load logic...
logging.debug(f"Loaded {len(data)} rows") # DEBUG-level (only shown if level=DEBUG)
return data
# Run pipeline
try:
data_path = Path("data/raw/sales.csv")
data = load_data(data_path)
logging.info("Data loaded successfully")
# Add processing steps...
except Exception as e:
logging.critical(f"Pipeline failed: {str(e)}", exc_info=True) # Log traceback
Logs will appear in data_pipeline.log and the console:
2023-10-01 14:30:00 - INFO - Loading data from data/raw/sales.csv
2023-10-01 14:30:02 - INFO - Data loaded successfully
9. Practical Example: A Mini Data Pipeline
Let’s combine the modules above into a pipeline that:
- Loads raw sales data (CSV).
- Cleans dates (using
datetime). - Aggregates revenue by product (using
collections.Counteranddefaultdict). - Logs steps (using
logging). - Saves results to JSON (using
json).
import csv
import json
from datetime import datetime
from collections import defaultdict, Counter
import logging
from pathlib import Path
# Configure logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s"
)
def main(input_path, output_path):
# Step 1: Load CSV data
logging.info(f"Loading data from {input_path}")
with open(input_path, "r", newline="", encoding="utf-8") as f:
reader = csv.DictReader(f)
sales_data = list(reader)
logging.info(f"Loaded {len(sales_data)} rows")
# Step 2: Clean dates (filter 2023 data)
logging.info("Filtering 2023 data...")
valid_rows = []
for row in sales_data:
try:
date = datetime.strptime(row["date"], "%Y-%m-%d")
if date.year == 2023:
valid_rows.append(row)
except ValueError:
logging.warning(f"Skipping invalid date: {row['date']}")
logging.info(f"Retained {len(valid_rows)} valid 2023 rows")
# Step 3: Aggregate revenue by product
logging.info("Aggregating revenue...")
product_revenue = defaultdict(float)
for row in valid_rows:
product_revenue[row["product"]] += float(row["revenue"])
# Step 4: Count sales and calculate stats
product_counts = Counter([row["product"] for row in valid_rows])
results = {
"product_stats": {
product: {
"total_revenue": round(revenue, 2),
"total_sales": product_counts[product]
} for product, revenue in product_revenue.items()
},
"total_2023_revenue": round(sum(product_revenue.values()), 2)
}
# Step 5: Save results to JSON
with open(output_path, "w", encoding="utf-8") as f:
json.dump(results, f, indent=2)
logging.info(f"Saved results to {output_path}")
if __name__ == "__main__":
input_path = Path("data/raw/sales.csv")
output_path = Path("data/processed/2023_sales_summary.json")
main(input_path, output_path)
10. Conclusion
Python’s Standard Library is a hidden gem for data science. While specialized libraries like Pandas and NumPy excel at scale, the Standard Library offers:
- Lightweight workflows: No external dependencies (ideal for edge devices or restricted environments).
- Foundational skills: Understanding core modules makes learning advanced tools easier.
- Speed: Avoids overkill for small-to-medium datasets.
By mastering modules like csv, collections, itertools, and logging, you can build robust data pipelines with minimal overhead. The next time you reach for Pandas, ask: Can the Standard Library handle this task more simply?