Table of Contents
1.** What Are Regular Expressions?2. Getting Started with Python’s re Module **- 2.1 Key re Module Functions
- 2.2 The Match Object
3.** Basic Regex Syntax **- 3.1 Literals - 3.2 Metacharacters
- 3.3 Quantifiers
4.** Character Classes **- 4.1 Custom Character Classes - 4.2 Predefined Character Classes
5.** Anchors: Matching Positions, Not Characters6. Groups and Capturing **- 6.1 Grouping with() - 6.2 Non-Capturing Groups
- 6.3 Backreferences
7.** Greedy vs. Non-Greedy Matching8. Practical Examples**- 8.1 Validating Emails - 8.2 Extracting Phone Numbers
- 8.3 Replacing Text with
re.sub()
9.** Common Pitfalls10. Best Practices11. Conclusion12. References**
1. What Are Regular Expressions?
A regular expression is a sequence of characters that defines a search pattern. Think of it as a “textual filter” that matches strings meeting specific criteria. For example:
\d{3}-\d{2}-\d{4}matches US Social Security numbers (e.g.,123-45-6789).[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}matches email addresses (e.g.,[email protected]).
Regex patterns are universal across programming languages, but Python’s re module adds language-specific utilities to work with them seamlessly.
2. Getting Started with Python’s re Module
Python’s re module provides functions to work with regular expressions. To use it, start by importing the module:
import re
###** 2.1 Key re Module Functions**The re module offers several functions for pattern matching. Here are the most common ones:
| Function | Purpose |
|---|---|
re.match(pattern, string) | Checks if the pattern matches**at the start **of the string. |
re.search(pattern, string) | Searches the entire string for thefirst occurrenceof the pattern. |
re.findall(pattern, string) | Returns a list ofall non-overlapping matchesin the string. |
re.sub(pattern, repl, string) | Replaces all matches of the pattern with repl (a string or function). |
re.split(pattern, string) | Splits the string at each match of the pattern. |
re.compile(pattern) | Compiles a pattern into a RegexObject for repeated use (faster!). |
###** 2.2 The Match Object**Functions like re.match() and re.search() return a Match object if they find a match; otherwise, they return None. The Match object contains details about the match, including:
group(): Returns the matched substring.start()/end(): Returns the start/end indices of the match.span(): Returns a tuple(start, end)of the match’s position.
Example:
text = "Hello, Python!"
match = re.search(r"Python", text)
if match: # Always check for None to avoid errors!
print("Match found:", match.group()) # Output: Match found: Python
print("Position:", match.span()) # Output: Position: (7, 13)
3. Basic Regex Syntax
Regex patterns are built fromliterals (ordinary characters) andmetacharacters (special characters that modify behavior).
###** 3.1 Literals**Literals are ordinary characters that match themselves exactly. For example, the pattern cat matches the substring "cat" in "The cat sat".
Example:
text = "The cat sat on the mat."
matches = re.findall(r"cat", text)
print(matches) # Output: ['cat']
###** 3.2 Metacharacters**Metacharacters add logic to patterns. Here are the most essential ones:
| Metacharacter | Description | Example Pattern | Matches |
|---|---|---|---|
. | Matchesany single character(except newline \n). | c.t | "cat", "cot", "c3t" (but not "ct") |
^ | Matches thestart of the string. | ^Hello | "Hello world" (but not "world Hello") |
$ | Matches theend of the string. | world$ | "Hello world" (but not "world Hello") |
* | Matches the preceding element**0 or more times **(greedy). | ab*c | "ac", "abc", "abbc" |
+ | Matches the preceding element**1 or more times **(greedy). | ab+c | "abc", "abbc" (but not "ac") |
? | Matches the preceding element**0 or 1 time **(optional). | colou?r | "color", "colour" |
| ` | ` | Acts as a logical**OR **. | `cat |
###** 3.3 Quantifiers**Quantifiers specify how many times an element should appear. We already saw *, +, and ?, but you can also use curly braces {} for precise control:
| Quantifier | Description | Example Pattern | Matches |
|---|---|---|---|
{n} | Exactly n times. | \d{3} | "123", "456" (3 digits) |
{n,} | At least n times. | \d{2,} | "12", "123", "1234" (≥2 digits) |
{n,m} | Between n and m times (inclusive). | \d{2,4} | "12", "123", "1234" (2-4 digits) |
4. Character Classes
Character classes ([]) let you matchone character from a set of options. They simplify patterns by grouping related characters.
###** 4.1 Custom Character Classes**- [abc]: Matches a, b, or c.
[a-z]: Matches any lowercase letter (range).[A-Z0-9]: Matches uppercase letters or digits.[^abc]: Matchesany character excepta,b, orc(negation).
Example:
text = "The quick brown fox jumps over 123 lazy dogs."
vowels = re.findall(r"[aeiouAEIOU]", text) # Match all vowels
print(vowels) # Output: ['e', 'u', 'i', 'o', 'o', 'u', 'o', 'e', 'a', 'o']
###** 4.2 Predefined Character ClassesFor common patterns, use these shortcuts (note: useraw strings r""**to avoid Python escaping backslashes):
| Shorthand | Description | Equivalent Class |
|---|---|---|
\d | Any digit (0-9). | [0-9] |
\D | Any non-digit. | [^0-9] |
\w | Any “word character” (letters, digits, _). | [a-zA-Z0-9_] |
\W | Any non-word character. | [^a-zA-Z0-9_] |
\s | Any whitespace (space, tab \t, newline \n). | [ \t\n\r\f\v] |
\S | Any non-whitespace. | [^ \t\n\r\f\v] |
Example:
text = "User_123 logged in at 9:30 AM."
# Match usernames (letters, digits, underscores)
usernames = re.findall(r"\w+", text)
print(usernames) # Output: ['User_123', 'logged', 'in', 'at', '9', '30', 'AM']
5. Anchors: Matching Positions, Not Characters
Anchors don’t match characters—they matchpositionsin the string (e.g., start/end of the string, word boundaries).
| Anchor | Description | Example Pattern | Matches |
|---|---|---|---|
^ | Start of the string (outside []). | ^Hello | "Hello world" (starts with “Hello”) |
$ | End of the string. | world$ | "Hello world" (ends with “world”) |
\b | Word boundary (between \w and \W). | \bcat\b | "cat" (but not "category" or "scat") |
\B | Non-word boundary (opposite of \b). | \Bcat\B | "scat" (but not standalone “cat”) |
Example:
text = "cat catapult scat"
# Match "cat" only as a standalone word
standalone_cat = re.findall(r"\bcat\b", text)
print(standalone_cat) # Output: ['cat'] (only the first "cat")
6. Groups and Capturing
Parentheses () let yougroup parts of a patternand extract them later. This is called “capturing.”
###** 6.1 Grouping with ()**Use () to split a pattern into logical groups. The Match object’s group(n) method returns the n-th group (0 = entire match, 1 = first group, etc.).
Example: Extracting a phone number
text = "Call me at (123) 456-7890 or (098) 765-4321."
pattern = r"\((\d{3})\) (\d{3})-(\d{4})" # 3 groups: area, central, line
match = re.search(pattern, text)
if match:
print("Full match:", match.group(0)) # Output: (123) 456-7890
print("Area code:", match.group(1)) # Output: 123
print("Central office:", match.group(2)) # Output: 456
print("Line number:", match.group(3)) # Output: 7890
###** 6.2 Non-Capturing Groups**Use (?:...) to group without capturing (saves memory for unused groups).
Example:
text = "I like apples and oranges."
pattern = r"(?:app|orang)es" # Match "apples" or "oranges" (no capture)
matches = re.findall(pattern, text)
print(matches) # Output: ['apples', 'oranges']
###** 6.3 Backreferences**Use \n (e.g., \1, \2) to refer to a previously captured group. Useful for matching repeated patterns.
Example: Matching repeated words
text = "Hello hello world world"
# Match repeated words (case-insensitive)
repeated_words = re.findall(r"(\b\w+\b) \1", text, flags=re.IGNORECASE)
print(repeated_words) # Output: ['Hello', 'world']
7. Greedy vs. Non-Greedy Matching
By default, quantifiers aregreedy: they match as much text as possible. Add ? after a quantifier to make itnon-greedy(match as little as possible).
Example: Greedy vs. non-greedy HTML tags
html = "<div>First div</div><div>Second div</div>"
# Greedy: matches from first <div> to last </div>
greedy_match = re.search(r"<div>.*</div>", html).group()
print(greedy_match) # Output: <div>First div</div><div>Second div</div>
# Non-greedy: matches from first <div> to first </div>
non_greedy_match = re.search(r"<div>.*?</div>", html).group()
print(non_greedy_match) # Output: <div>First div</div>
8. Practical Examples
###** 8.1 Validating Emails**A simplified regex for email validation (real-world email regex is more complex, but this works for basic cases):
email_pattern = r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"
def is_valid_email(email):
return re.match(email_pattern, email) is not None
print(is_valid_email("[email protected]")) # Output: True
print(is_valid_email("[email protected]")) # Output: False (no domain name)
###** 8.2 Extracting Phone Numbers**Extract all US phone numbers from a text:
text = "Reach me at 123-456-7890 or (987) 654-3210."
phone_pattern = r"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}"
phones = re.findall(phone_pattern, text)
print(phones) # Output: ['123-456-7890', '(987) 654-3210']
###** 8.3 Replacing Text with re.sub()**Use re.sub(pattern, repl, text) to replace matches. repl can be a string or a function.
Example: Censoring bad words
text = "This is a badword example."
censored = re.sub(r"badword", "***", text)
print(censored) # Output: This is a ***example.
Example: Using a function to modify matches
def square_numbers(match):
num = int(match.group())
return str(num** 2) # Replace number with its square
text = "Numbers: 2, 3, 4"
result = re.sub(r"\d+", square_numbers, text)
print(result) # Output: Numbers: 4, 9, 16
9. Common Pitfalls
- Forgetting raw strings: Use
r""for regex patterns to avoid Python escaping backslashes (e.g.,r"\d"instead of"\d"). - Confusing
re.match()andre.search():re.match()only checks the start of the string; usere.search()to find matches anywhere. - Overusing greedy quantifiers: Greedy
.*can overmatch (e.g., in HTML). Use non-greedy.*?instead. - Neglecting edge cases: Test with empty strings, special characters, or unexpected inputs.
10. Best Practices
- Test with tools: Use regex101.com to debug patterns interactively.
- Precompile patterns: For repeated use, compile with
re.compile()to improve performance:pattern = re.compile(r"\b\d{3}\b") # Precompile text = "123 45 6789" print(pattern.findall(text)) # Output: ['123'] - Comment complex patterns: Add comments to explain parts of long regex (use
re.VERBOSEflag for readability).
11. Conclusion
Regular expressions are a powerful tool in Python, and mastering them opens up endless possibilities for text manipulation. Start with simple patterns, experiment with the re module, and gradually tackle more complex problems. With practice, regex will become second nature!
12. References
- Python
reModule Documentation - Regex101: Online Regex Tester
- Regular-Expressions.info (Comprehensive regex guide)