Table of Contents
- What is Pandas?
- Installing Pandas
- Core Data Structures: Series and DataFrame
- Creating DataFrames
- Data Inspection and Basic Operations
- Data Cleaning: Preparing Your Data
- Data Manipulation: Filtering, Sorting, and Grouping
- Advanced Operations
- Visualization with Pandas
- Real-World Use Cases
- Conclusion
- References
1. What is Pandas?
Pandas is an open-source Python library built on top of NumPy, designed for fast, flexible, and expressive data manipulation. It excels at handling structured data (e.g., tables, spreadsheets) and provides two primary data structures:
- Series: A 1-dimensional labeled array (like a column in a spreadsheet).
- DataFrame: A 2-dimensional labeled tabular structure (like a spreadsheet or SQL table), with rows and columns.
Pandas simplifies tasks like data loading, cleaning, transformation, aggregation, and visualization, enabling users to focus on analysis rather than low-level data wrangling.
2. Installing Pandas
Before diving in, install Pandas. If you use Python’s package manager pip, run:
pip install pandas
If you use Anaconda (a popular Python distribution for data science), Pandas is pre-installed. To update it:
conda update pandas
For Excel file support, install openpyxl (required for reading/writing .xlsx files):
pip install openpyxl
3. Core Data Structures: Series and DataFrame
3.1 Series: 1D Labeled Arrays
A Series is a 1D array with an associated label (index). Think of it as a single column of data.
Example: Creating a Series
import pandas as pd
# Create a Series from a list
data = [10, 20, 30, 40, 50]
s = pd.Series(data, name="Numbers")
print(s)
# Output:
# 0 10
# 1 20
# 2 30
# 3 40
# 4 50
# Name: Numbers, dtype: int64
- The left column is the index (default: 0, 1, 2…).
- The right column is the data values.
3.2 DataFrame: 2D Tabular Data
A DataFrame is a 2D structure with rows (observations) and columns (variables). It’s the most commonly used Pandas object.
Example: Creating a DataFrame
# Create a DataFrame from a dictionary
data = {
"Name": ["Alice", "Bob", "Charlie", "Diana"],
"Age": [25, 30, 35, 28],
"City": ["New York", "London", "Paris", "Sydney"]
}
df = pd.DataFrame(data)
print(df)
# Output:
# Name Age City
# 0 Alice 25 New York
# 1 Bob 30 London
# 2 Charlie 35 Paris
# 3 Diana 28 Sydney
Here, each key in the dictionary becomes a column, and each value (a list) becomes the column’s data.
4. Creating DataFrames
Pandas supports creating DataFrames from various sources. Let’s explore the most common methods.
4.1 From Lists/Dictionaries
As shown earlier, dictionaries (keys = columns, values = data) are a simple way to create DataFrames. You can also use lists of lists with column names:
data = [
["Alice", 25, "New York"],
["Bob", 30, "London"],
["Charlie", 35, "Paris"]
]
columns = ["Name", "Age", "City"]
df = pd.DataFrame(data, columns=columns)
4.2 From External Files (CSV, Excel)
Pandas shines at loading data from external files.
Reading a CSV File
Use pd.read_csv():
# Load data from a CSV file
df = pd.read_csv("sales_data.csv") # Replace with your file path
Reading an Excel File
Use pd.read_excel() (requires openpyxl):
# Load data from an Excel file (specify sheet name if needed)
df = pd.read_excel("customer_data.xlsx", sheet_name="Sheet1")
Other supported formats: JSON (pd.read_json()), SQL (pd.read_sql()), and more!
5. Data Inspection and Basic Operations
Once you’ve loaded data, the first step is to inspect it. Pandas provides tools to preview and summarize data.
5.1 Previewing Data
df.head(n): Show the firstnrows (default: 5).df.tail(n): Show the lastnrows (default: 5).df.sample(n): Show a random sample ofnrows.
print(df.head(3)) # First 3 rows
print(df.tail(2)) # Last 2 rows
5.2 Summary Statistics
df.info(): Show data types, non-null values, and memory usage.df.describe(): Generate summary statistics (count, mean, std, min/max, quartiles) for numeric columns.
df.info()
# Output (example):
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 4 entries, 0 to 3
# Data columns (total 3 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 Name 4 non-null object
# 1 Age 4 non-null int64
# 2 City 4 non-null object
# dtypes: int64(1), object(2)
# memory usage: 224.0+ bytes
print(df.describe()) # Only for numeric columns (e.g., Age)
5.3 Selecting Columns and Rows
Select Columns
- Single column:
df["Column_Name"]ordf.Column_Name(if no spaces in the name). - Multiple columns:
df[["Column1", "Column2"]].
ages = df["Age"] # Series of ages
name_and_city = df[["Name", "City"]] # DataFrame with two columns
Select Rows
- Label-based selection:
df.loc[row_labels, column_labels]. - Position-based selection:
df.iloc[row_positions, column_positions].
# Select row with index label 1 (Bob's row)
bob_row = df.loc[1]
# Select rows 0-2 and columns "Name" and "Age"
subset = df.loc[0:2, ["Name", "Age"]]
# Select the first 2 rows and first 2 columns (position-based)
subset_iloc = df.iloc[:2, :2]
6. Data Cleaning: Preparing Your Data
Real-world data is often messy (missing values, duplicates, incorrect types). Pandas simplifies cleaning.
6.1 Handling Missing Values
df.isnull().sum(): Count missing values per column.df.dropna(): Drop rows/columns with missing values (useaxis=1for columns).df.fillna(value): Replace missing values with a specific value (e.g., mean, median).
# Check for missing values
print(df.isnull().sum())
# Drop rows with any missing values
df_clean = df.dropna()
# Fill missing "Age" values with the mean age
df["Age"].fillna(df["Age"].mean(), inplace=True) # inplace=True modifies df directly
6.2 Removing Duplicates
df.duplicated(): Identify duplicate rows.df.drop_duplicates(): Remove duplicates (keep first/last occurrence withkeep="first"/"last").
# Check for duplicates
print(df.duplicated().sum())
# Remove duplicates
df_unique = df.drop_duplicates()
6.3 Correcting Data Types
Use df.astype() to convert column data types (e.g., string to numeric, numeric to datetime).
# Convert "Age" from int to float
df["Age"] = df["Age"].astype(float)
# Convert "Date" column to datetime (common for time-series data)
df["Date"] = pd.to_datetime(df["Date"]) # More robust than astype() for dates
7. Data Manipulation: Filtering, Sorting, and Grouping
7.1 Filtering Data
Use boolean indexing to filter rows based on conditions.
# Filter rows where Age > 28
adults = df[df["Age"] > 28]
# Filter rows where City is "London" AND Age > 25
london_adults = df[(df["City"] == "London") & (df["Age"] > 25)]
7.2 Sorting Data
Use df.sort_values(by=column, ascending=True/False) to sort by one or more columns.
# Sort by Age in ascending order (youngest to oldest)
df_sorted = df.sort_values(by="Age", ascending=True)
# Sort by City (A-Z) then Age (descending)
df_sorted_multi = df.sort_values(by=["City", "Age"], ascending=[True, False])
7.3 Grouping and Aggregation
Use df.groupby(column) to group data by a categorical column, then apply aggregations like sum(), mean(), or count().
# Group by City and calculate average Age per city
city_avg_age = df.groupby("City")["Age"].mean().reset_index()
# reset_index() converts the GroupBy result back to a DataFrame
7.4 Merging and Joining DataFrames
Like SQL joins, merge DataFrames using pd.merge().
# Sample DataFrames
df1 = pd.DataFrame({"ID": [1, 2, 3], "Name": ["Alice", "Bob", "Charlie"]})
df2 = pd.DataFrame({"ID": [2, 3, 4], "City": ["London", "Paris", "Sydney"]})
# Inner join (only matching IDs: 2, 3)
merged_inner = pd.merge(df1, df2, on="ID", how="inner")
# Left join (all rows from df1, matching rows from df2)
merged_left = pd.merge(df1, df2, on="ID", how="left")
8. Advanced Operations
8.1 Pivot Tables
Pivot tables summarize data by grouping rows and columns, with aggregated values. Use pd.pivot_table().
# Pivot table: average Age by City and Name (rows=City, columns=Name, values=Age)
pivot = pd.pivot_table(
df,
values="Age",
index="City",
columns="Name",
aggfunc="mean" # Aggregation function
)
8.2 Applying Custom Functions with apply()
Use df.apply(func) to apply a custom function to rows or columns.
# Example: Add a "Double Age" column by doubling each Age value
df["Double Age"] = df["Age"].apply(lambda x: x * 2) # Lambda for simple functions
# Custom function: Categorize age groups
def categorize_age(age):
if age < 30:
return "Young"
else:
return "Adult"
df["Age Group"] = df["Age"].apply(categorize_age)
9. Visualization with Pandas
Pandas integrates with Matplotlib to create quick visualizations directly from DataFrames using df.plot().
import matplotlib.pyplot as plt # Required for displaying plots
# Line plot of Age over time (if "Date" column exists)
df.plot(x="Date", y="Age", kind="line", title="Age Over Time")
plt.show()
# Bar plot of average Age by City
city_avg_age.plot(x="City", y="Age", kind="bar", color="skyblue")
plt.title("Average Age by City")
plt.show()
# Histogram of Age distribution
df["Age"].plot(kind="hist", bins=5, edgecolor="black")
plt.title("Age Distribution")
plt.show()
10. Real-World Use Cases
Pandas is used across industries:
- E-commerce: Analyze sales trends, customer behavior, and inventory levels.
- Finance: Clean and analyze stock prices, portfolio returns, or transaction data.
- Healthcare: Process patient data, clinical trials, or medical records.
- Marketing: Segment customers, track campaign performance, and measure ROI.
11. Conclusion
Pandas is a cornerstone of Python’s data science ecosystem, empowering users to transform raw data into actionable insights. From loading data to cleaning, manipulating, and visualizing it, Pandas simplifies every step of the analysis workflow.
To master Pandas, practice with real datasets (e.g., Kaggle, government open data) and explore advanced features like time-series analysis, window functions, and custom aggregations.
12. References
- Pandas Official Documentation
- McKinney, W. (2018). Python for Data Analysis (2nd ed.). O’Reilly Media.
- Kaggle Pandas Tutorials
- Real Python Pandas Guide