Table of Contents
- File & Path Handling:
osandpathlib - CSV Data Parsing:
csv - JSON Data Handling:
json - Time Series & Date Manipulation:
datetime - Advanced Data Structures:
collections - Efficient Iteration:
itertools - Basic Statistics:
statistics - Command-Line & I/O:
sys - Conclusion
- References
1. File & Path Handling: os and pathlib
Before analyzing data, you’ll often need to locate, read, or organize files (e.g., CSVs, logs, or JSON). The os module and its modern counterpart pathlib simplify path manipulation, directory traversal, and file metadata checks.
Key Functions & Use Cases
os.path: Legacy module for path manipulation (e.g., joining paths, checking file existence).pathlib.Path: Object-oriented alternative toos.path(more intuitive and readable).
Example: Locating and Reading a Data File
Suppose you have a CSV file in a data/ subdirectory. Use pathlib to safely construct the file path and check if it exists:
from pathlib import Path
# Define the data directory and file name
data_dir = Path("data")
file_path = data_dir / "sales_data.csv" # Uses '/' operator to join paths
# Check if the file exists
if file_path.exists() and file_path.is_file():
print(f"Found file: {file_path}")
with open(file_path, "r") as f:
# Read file content (e.g., with csv module, covered next)
pass
else:
print(f"File not found: {file_path}")
Why it matters: Avoids hardcoding paths (e.g., data/sales_data.csv), which breaks across operating systems. pathlib handles slashes (/ vs. \) automatically.
2. CSV Data Parsing: csv
Comma-Separated Values (CSV) is the most common format for tabular data. The csv module simplifies reading, writing, and manipulating CSV files without manual string splitting.
Key Functions
csv.reader: Reads CSV rows as lists (index-based access).csv.DictReader: Reads CSV rows as dictionaries (column-name-based access).csv.writer/csv.DictWriter: Writes data to CSV files.
Example: Analyzing Sales Data with csv.DictReader
Suppose sales_data.csv has columns: date, product, revenue. Use DictReader to group revenue by product:
import csv
from collections import defaultdict # Covered later!
revenue_by_product = defaultdict(float)
with open("data/sales_data.csv", "r") as f:
reader = csv.DictReader(f) # Uses first row as column names
for row in reader:
product = row["product"]
revenue = float(row["revenue"])
revenue_by_product[product] += revenue
# Print results
for product, total in revenue_by_product.items():
print(f"{product}: ${total:.2f}")
Output:
Laptop: $15000.00
Phone: $8000.00
Tablet: $3000.00
Why it matters: DictReader lets you access columns by name (e.g., row["product"]), making code readable and robust to column reordering.
3. JSON Data Handling: json
JSON (JavaScript Object Notation) is ubiquitous for APIs, config files, and nested data. The json module parses JSON strings/files into Python dictionaries/lists and vice versa.
Key Functions
json.load: Reads JSON from a file into a Python object.json.loads: Parses a JSON string into a Python object.json.dump/json.dumps: Writes Python objects to JSON files/strings.
Example: Parsing API Response Data
Suppose you fetch user activity data from an API (stored in user_activity.json):
{
"users": [
{"id": 1, "name": "Alice", "activity": [{"date": "2023-10-01", "duration": 120}]},
{"id": 2, "name": "Bob", "activity": [{"date": "2023-10-01", "duration": 90}, {"date": "2023-10-02", "duration": 60}]}
]
}
Use json.load to extract total activity duration per user:
import json
with open("user_activity.json", "r") as f:
data = json.load(f) # Parses JSON into a Python dict
# Calculate total activity duration per user
for user in data["users"]:
total_duration = sum(act["duration"] for act in user["activity"])
print(f"{user['name']}: {total_duration} minutes")
Output:
Alice: 120 minutes
Bob: 150 minutes
Why it matters: APIs and modern data pipelines return JSON. json avoids manual parsing of curly braces/brackets.
4. Time Series & Date Manipulation: datetime
Time series data (e.g., stock prices, sensor logs) requires handling dates, times, and time zones. The datetime module provides classes for precise time manipulation.
Key Classes/Functions
datetime.datetime: Combines date and time (e.g.,2023-10-01 14:30:00).datetime.date/datetime.time: Isolates date or time.datetime.strptime: Parses strings intodatetimeobjects (e.g.,"2023-10-01"→datetime).datetime.strftime: Formatsdatetimeobjects into strings (e.g.,datetime→"Oct 1, 2023").datetime.timedelta: Represents time intervals (e.g., 3 days, 2 hours).
Example: Analyzing Daily User Logins
Suppose a log file has timestamps like "2023-10-01 08:30:45". Use datetime to count logins per day:
from datetime import datetime
from collections import defaultdict
login_counts = defaultdict(int)
with open("user_logs.txt", "r") as f:
for line in f:
# Line format: "user1,2023-10-01 08:30:45"
_, timestamp_str = line.strip().split(",")
# Parse string to datetime object
timestamp = datetime.strptime(timestamp_str, "%Y-%m-%d %H:%M:%S")
# Extract date (ignore time)
date = timestamp.date()
login_counts[date] += 1
# Print results sorted by date
for date in sorted(login_counts):
print(f"{date}: {login_counts[date]} logins")
Output:
2023-10-01: 45 logins
2023-10-02: 38 logins
Why it matters: Avoids errors from manual date arithmetic (e.g., “Is February 29 a valid date?”). datetime handles leap years, time zones, and daylight saving automatically.
5. Advanced Data Structures: collections
The collections module extends Python’s built-in data types (lists, dicts) with specialized structures for common data analysis tasks.
Key Classes
defaultdict: A dict that auto-initializes missing keys (e.g., withlist,int, orfloat).Counter: Counts hashable objects (e.g., word frequencies, category counts).namedtuple: Creates lightweight, immutable “records” (e.g., for structured data).
Example 1: Grouping Data with defaultdict
Group sales data by region using defaultdict(list):
from collections import defaultdict
sales = [
{"region": "North", "product": "Laptop", "revenue": 1000},
{"region": "South", "product": "Phone", "revenue": 500},
{"region": "North", "product": "Tablet", "revenue": 300},
]
sales_by_region = defaultdict(list)
for sale in sales:
sales_by_region[sale["region"]].append(sale) # No KeyError!
print(sales_by_region["North"])
# Output: [{'region': 'North', 'product': 'Laptop', 'revenue': 1000}, ...]
Example 2: Counting Categories with Counter
Count product occurrences in sales data:
from collections import Counter
products = [sale["product"] for sale in sales] # Extract product names
product_counts = Counter(products)
print(product_counts) # Output: Counter({'Laptop': 1, 'Phone': 1, 'Tablet': 1})
print(product_counts.most_common(1)) # Most frequent: [('Laptop', 1)]
Why it matters: Eliminates boilerplate (e.g., if key not in my_dict: my_dict[key] = 0 for counting).
6. Efficient Iteration: itertools
The itertools module provides tools for efficient iteration over data, critical for processing large datasets without loading everything into memory.
Key Functions
itertools.groupby: Groups consecutive items by a key (like SQL’sGROUP BY).itertools.chain: Flattens nested iterables (e.g.,[[1,2], [3,4]]→[1,2,3,4]).itertools.product: Computes Cartesian products (e.g., combinations of two lists).
Example: Grouping Time-Series Data with groupby
Group sensor data by hour using groupby (requires sorted data):
from itertools import groupby
from datetime import datetime
# Sample sensor data: (timestamp_str, temperature)
sensor_data = [
("2023-10-01 08:15:00", 22.5),
("2023-10-01 08:30:00", 23.0),
("2023-10-01 09:05:00", 24.1),
]
# Sort data by hour (required for groupby)
sorted_data = sorted(sensor_data, key=lambda x: datetime.strptime(x[0], "%Y-%m-%d %H:%M:%S").hour)
# Group by hour
for hour, group in groupby(sorted_data, key=lambda x: datetime.strptime(x[0], "%Y-%m-%d %H:%M:%S").hour):
temperatures = [temp for _, temp in group]
avg_temp = sum(temperatures) / len(temperatures)
print(f"Hour {hour}: Average temp = {avg_temp:.1f}°C")
Output:
Hour 8: Average temp = 22.8°C
Hour 9: Average temp = 24.1°C
Why it matters: groupby avoids nested loops for grouping, making code cleaner and faster.
7. Basic Statistics: statistics
For quick statistical analysis (no need for Pandas/NumPy), the statistics module provides functions for mean, median, standard deviation, and more.
Key Functions
mean: Arithmetic mean (average).median: Middle value of sorted data.mode: Most frequent value.stdev: Sample standard deviation.
Example: Analyzing Test Scores
Compute basic stats for student test scores:
import statistics
scores = [85, 92, 78, 90, 85, 95, 88]
print(f"Mean: {statistics.mean(scores)}") # Output: 87.57
print(f"Median: {statistics.median(scores)}") # Output: 88
print(f"Mode: {statistics.mode(scores)}") # Output: 85
print(f"Stdev: {statistics.stdev(scores):.2f}") # Output: 5.94
Why it matters: Avoids writing custom statistical formulas (e.g., for standard deviation), reducing errors.
8. Command-Line & I/O: sys
The sys module interacts with the Python interpreter, enabling scripts to accept command-line arguments, read input, or write output.
Key Features
sys.argv: List of command-line arguments (e.g.,python script.py data.csv→sys.argv = ["script.py", "data.csv"]).sys.stdin/sys.stdout: Read from/write to standard input/output (e.g., pipe data between scripts).
Example: Processing Data from Command Line
Write a script to calculate mean revenue from a CSV file passed as a command-line argument:
import sys
import csv
import statistics
def main():
if len(sys.argv) != 2:
print("Usage: python mean_revenue.py <csv_file>")
sys.exit(1) # Exit with error code
csv_file = sys.argv[1]
revenues = []
with open(csv_file, "r") as f:
reader = csv.DictReader(f)
for row in reader:
revenues.append(float(row["revenue"]))
print(f"Mean revenue: ${statistics.mean(revenues):.2f}")
if __name__ == "__main__":
main()
Run with:
python mean_revenue.py data/sales_data.csv
# Output: Mean revenue: $4500.00
Why it matters: Enables automation (e.g., integrating with shell scripts or cron jobs).
Conclusion
Python’s standard library is a powerful, underrated tool for data analysis. Modules like csv, datetime, collections, and statistics handle core tasks like file parsing, time manipulation, and basic stats without external dependencies. While libraries like Pandas simplify complex workflows, mastering the standard library improves code efficiency, reduces bloat, and deepens your understanding of Python’s internals.
Start small: Use pathlib for file paths, csv.DictReader for CSV parsing, and collections.Counter for quick counts. As you grow, combine these tools with higher-level libraries to build robust data pipelines.