Table of Contents
- Introduction
- Parsing CSV Files with
csv - Working with JSON Data using
json - XML Parsing with
xml.etree.ElementTree - Configuration Files (INI) with
configparser - HTML Parsing with
html.parser - Plain Text & Logs with
re(Regular Expressions) - Best Practices for Data Parsing
- Conclusion
- References
Parsing CSV Files with csv
When to Use
Comma-Separated Values (CSV) is the gold standard for tabular data (e.g., spreadsheets, database exports, or sensor logs). Use Python’s csv module for parsing CSV/TSV files instead of manual string splitting—it handles edge cases like quoted fields, newlines within cells, and varying delimiters.
Example: Reading and Writing CSV
Step 1: Basic CSV Parsing
Suppose we have a sales_data.csv file:
date,product,revenue
2023-01-01,Laptop,999.99
2023-01-02,Phone,699.99
2023-01-03,Tablet,299.99
Use csv.reader to parse rows as lists:
import csv
with open('sales_data.csv', 'r', newline='', encoding='utf-8') as file:
csv_reader = csv.reader(file)
headers = next(csv_reader) # Extract header row
print(f"Headers: {headers}")
for row in csv_reader:
date, product, revenue = row
print(f"Date: {date}, Product: {product}, Revenue: ${revenue}")
Step 2: Parsing with Named Columns
For easier access to columns by name, use csv.DictReader:
with open('sales_data.csv', 'r', newline='', encoding='utf-8') as file:
csv_dict_reader = csv.DictReader(file)
for row in csv_dict_reader:
print(f"{row['date']}: {row['product']} sold for ${row['revenue']}")
Step 3: Advanced: Custom Delimiters or Quoting
The csv module handles non-comma delimiters (e.g., tabs) and quoted fields. For a tab-separated file (data.tsv):
with open('data.tsv', 'r', newline='') as file:
tsv_reader = csv.reader(file, delimiter='\t', quotechar='"')
for row in tsv_reader:
print(row)
Key Features
- Handles edge cases (quoted fields, newlines in cells).
- Supports custom delimiters, quote characters, and escape sequences.
csv.writerfor writing CSV files (e.g.,csv.writer(file).writerow(row)).
Working with JSON Data using json
When to Use
JavaScript Object Notation (JSON) is ubiquitous for APIs, configuration files, and data interchange. Python’s json module seamlessly converts between JSON strings/files and Python dictionaries/lists.
Example: Loading and Dumping JSON
Step 1: Parsing JSON from a File
Suppose config.json contains:
{
"app_name": "MyApp",
"version": "1.0.0",
"features": ["login", "dashboard", "reports"],
"limits": {
"max_users": 1000,
"max_storage_gb": 50
}
}
Load and parse it with json.load:
import json
with open('config.json', 'r', encoding='utf-8') as file:
config = json.load(file) # Returns a Python dict
print(f"App Name: {config['app_name']}")
print(f"Max Users: {config['limits']['max_users']}")
Step 2: Parsing JSON from a String
Use json.loads (load string) for JSON data in memory:
json_str = '{"name": "Alice", "age": 30, "hobbies": ["reading", "hiking"]}'
data = json.loads(json_str)
print(f"{data['name']} is {data['age']} years old.")
Step 3: Converting Python Objects to JSON
Use json.dump (to file) or json.dumps (to string) to serialize Python data:
python_data = {
"name": "Bob",
"is_student": True,
"grades": [90, 85, 95]
}
# Write to file with indentation for readability
with open('output.json', 'w') as file:
json.dump(python_data, file, indent=4) # indent=4 for pretty-printing
# Convert to string
json_str = json.dumps(python_data, sort_keys=True) # sort_keys alphabetizes keys
print(json_str)
Key Features
json.load/json.loads: Parse JSON into Python objects.json.dump/json.dumps: Serialize Python objects to JSON.- Handles nested structures, booleans, and null values (
Nonein Python). - Use
defaultandobject_hookfor custom serialization/deserialization (e.g., datetime objects).
XML Parsing with xml.etree.ElementTree
When to Use
Extensible Markup Language (XML) is still used in legacy systems, RSS feeds, and enterprise tools. Python’s xml.etree.ElementTree (ET) provides a lightweight API for parsing and manipulating XML.
Example: Parsing XML
Step 1: Basic XML Parsing
Given books.xml:
<library>
<book id="1">
<title>Python Basics</title>
<author>John Doe</author>
<year>2020</year>
</book>
<book id="2">
<title>Advanced Python</title>
<author>Jane Smith</author>
<year>2022</year>
</book>
</library>
Parse and extract data with ET:
import xml.etree.ElementTree as ET
# Parse from file
tree = ET.parse('books.xml')
root = tree.getroot() # Root element: <library>
# Iterate over child elements
for book in root.findall('book'): # Find all <book> tags
book_id = book.get('id') # Get attribute "id"
title = book.find('title').text # Get text of <title>
author = book.find('author').text
year = book.find('year').text
print(f"Book {book_id}: {title} by {author} ({year})")
Step 2: Parsing XML from a String
Use ET.fromstring for XML in memory:
xml_str = """<note>
<to>Alice</to>
<from>Bob</from>
<body>Hello!</body>
</note>"""
root = ET.fromstring(xml_str)
print(f"Note to: {root.find('to').text}") # Output: "Note to: Alice"
Key Features
- Supports XPath-like queries (e.g.,
root.findall("./book/author")). - Iterates over elements with
iter(). - Handles namespaces (via
namespacesparameter infind/findall).
Configuration Files (INI) with configparser
When to Use
INI files are simple yet powerful for application settings (e.g., config.ini). The configparser module parses INI-style files with sections and key-value pairs.
Example: Reading and Writing INI Files
Step 1: Parsing an INI File
Sample app_config.ini:
[database]
host = localhost
port = 5432
user = admin
password = secret
db_name = mydb
[app]
debug = False
max_retries = 3
log_file = app.log
Load and access settings with configparser.ConfigParser:
import configparser
config = configparser.ConfigParser()
config.read('app_config.ini') # Read the INI file
# Access values by section and key
db_host = config['database']['host']
db_port = config.getint('database', 'port') # Get as integer
debug_mode = config.getboolean('app', 'debug') # Get as boolean
print(f"Connecting to {db_host}:{db_port} (Debug: {debug_mode})")
Step 2: Writing Configuration Files
Update and save settings:
config.set('app', 'debug', 'True') # Update a value
config.add_section('new_section')
config.set('new_section', 'key', 'value')
with open('updated_config.ini', 'w') as file:
config.write(file) # Write changes to file
Key Features
- Supports type conversion (
getint,getboolean,getfloat). - Handles nested sections (via
[section.subsection]syntax). - Merges multiple INI files with
read([file1, file2]).
HTML Parsing with html.parser
When to Use
For simple HTML extraction (e.g., scraping links or text), use html.parser.HTMLParser (no external dependencies like BeautifulSoup). It’s lightweight but limited—use it for basic tasks only.
Example: Extracting Links from HTML
Subclass HTMLParser to handle tags and data:
from html.parser import HTMLParser
class LinkExtractor(HTMLParser):
def handle_starttag(self, tag, attrs):
# Look for <a> tags and extract href attributes
if tag == 'a':
for attr_name, attr_value in attrs:
if attr_name == 'href':
print(f"Found link: {attr_value}")
# Sample HTML
html_content = """
<html>
<body>
<a href="https://python.org">Python Homepage</a>
<a href="/docs">Documentation</a>
<p>Some text here.</p>
</body>
</html>
"""
parser = LinkExtractor()
parser.feed(html_content) # Parse the HTML
Output:
Found link: https://python.org
Found link: /docs
Key Features
- Override methods like
handle_starttag,handle_endtag, andhandle_datato process HTML. - Lightweight and built into Python.
Plain Text & Logs with re (Regular Expressions)
When to Use
For unstructured text (e.g., logs, emails, or custom formats), re (regular expressions) is indispensable. It lets you define patterns to match, extract, or replace text.
Example: Parsing Log Files
Suppose app.log contains:
2023-10-01 08:30:15 [INFO] User 'alice' logged in
2023-10-01 08:35:42 [ERROR] Database connection failed (timeout)
2023-10-01 08:40:00 [INFO] User 'bob' logged in
Use re to extract timestamps, log levels, and messages:
import re
# Define a regex pattern: timestamp, level, message
log_pattern = r'^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) \[(INFO|ERROR|WARNING)\] (.*)$'
compiled_pattern = re.compile(log_pattern) # Compile for performance
with open('app.log', 'r') as file:
for line in file:
match = compiled_pattern.match(line.strip())
if match:
timestamp, level, message = match.groups()
print(f"[{level}] {timestamp}: {message}")
Key Features
re.match/re.search: Match patterns at the start or anywhere in the string.- Groups (
(...)) to extract specific parts of the match. re.findall: Get all non-overlapping matches.- Compile patterns with
re.compilefor repeated use (faster).
Plain Text & Logs with re (Regular Expressions)
When to Use
For unstructured text (e.g., logs, emails, or custom formats), re (regular expressions) is indispensable. It lets you define patterns to match, extract, or replace text.
Example: Parsing Log Files
Suppose app.log contains:
2023-10-01 08:30:15 [INFO] User 'alice' logged in
2023-10-01 08:35:42 [ERROR] Database connection failed (timeout)
2023-10-01 08:40:00 [INFO] User 'bob' logged in
Use re to extract timestamps, log levels, and messages:
import re
# Define a regex pattern: timestamp, level, message
log_pattern = r'^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) \[(INFO|ERROR|WARNING)\] (.*)$'
compiled_pattern = re.compile(log_pattern) # Compile for performance
with open('app.log', 'r') as file:
for line in file:
match = compiled_pattern.match(line.strip())
if match:
timestamp, level, message = match.groups()
print(f"[{level}] {timestamp}: {message}")
Key Features
re.match/re.search: Match patterns at the start or anywhere in the string.- Groups (
(...)) to extract specific parts of the match. re.findall: Get all non-overlapping matches.- Compile patterns with
re.compilefor repeated use (faster).
Best Practices for Data Parsing
- Choose the Right Tool: Use
csvfor CSV/TSV,jsonfor JSON, etc.—avoid regex for structured formats (e.g., don’t parse JSON withre!). - Handle Errors: Use
try-exceptblocks to catch parsing errors (e.g.,json.JSONDecodeError,csv.Error).try: with open('data.json') as f: data = json.load(f) except json.JSONDecodeError as e: print(f"Invalid JSON: {e}") - Optimize Performance:
- Use
csv.readerinstead of manualsplit(',')for CSV files. - Compile regex patterns with
re.compilefor repeated use. - Parse large files line-by-line (e.g.,
for line in file:) instead of loading the entire file into memory.
- Use
- Secure Parsing: Avoid parsing untrusted XML with vulnerable parsers (Python’s
ElementTreeis safe). Sanitize inputs for regex to prevent denial-of-service attacks. - Test with Edge Cases: Validate parsing with messy data (e.g., CSV with missing values, JSON with extra fields).
Conclusion
Python’s standard library is a treasure trove for data parsing. From CSV and JSON to XML, INI files, and plain text, modules like csv, json, xml.etree.ElementTree, configparser, html.parser, and re cover most real-world scenarios. By leveraging these tools, you avoid external dependencies, keep your code lightweight, and ensure compatibility across environments.
For complex tasks (e.g., large-scale CSV analysis or advanced HTML scraping), libraries like pandas or BeautifulSoup add value—but start with the standard library. It’s often sufficient, reliable, and optimized for performance.