Table of Contents
- Prerequisites
- What is Time Series Data?
- Loading and Exploring Time Series Data
- Data Preprocessing for Time Series
- Exploratory Data Analysis (EDA)
- Time Series Forecasting Models
- Advanced Topics (Optional)
- Conclusion
- References
Prerequisites
Before diving in, ensure you have the following:
- Basic knowledge of Python (e.g., loops, functions, pandas).
- Familiarity with data visualization (matplotlib/seaborn).
- Python libraries installed:
pip install pandas numpy matplotlib seaborn statsmodels scikit-learn pmdarima
What is Time Series Data?
Time series data is a sequence of observations recorded at regular time intervals (e.g., hourly, daily, monthly). Key characteristics include:
- Trend: Long-term upward/downward movement (e.g., increasing global temperatures).
- Seasonality: Repeating patterns at fixed intervals (e.g., holiday sales spikes).
- Cyclicity: Irregular fluctuations (e.g., economic recessions every 5–10 years).
- Noise: Random, unpredictable variation.
Example datasets: Stock prices (minute-level), monthly rainfall, annual GDP, or hourly website traffic.
Loading and Exploring Time Series Data
We’ll use the classic Air Passengers dataset (1949–1960 monthly airline passengers) for this tutorial. It’s included in statsmodels or can be downloaded from Kaggle.
Step 1: Load the Data
First, load the data with pandas and ensure the timestamp column is parsed correctly:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load dataset (replace with your file path if needed)
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv"
df = pd.read_csv(url, parse_dates=['Month'], index_col='Month') # Parse 'Month' as datetime and set as index
# Inspect the first 5 rows
print(df.head())
Output:
Passengers
Month
1949-01-01 112
1949-02-01 118
1949-03-01 132
1949-04-01 129
1949-05-01 121
Step 2: Initial Exploration
Check the dataset’s structure, summary statistics, and time range:
# Dataset info
print(df.info())
# Summary statistics
print(df.describe())
# Time range
print(f"Time range: {df.index.min()} to {df.index.max()}")
Key Observations:
- The dataset has 144 monthly observations (12 years).
- The
Passengerscolumn is numeric (int64). - The time index is correctly formatted as
datetime64[ns].
Step 3: Plot the Time Series
Visualization is critical to identify trends/seasonality:
plt.figure(figsize=(12, 6))
sns.lineplot(data=df, x=df.index, y='Passengers')
plt.title('Monthly Airline Passengers (1949–1960)')
plt.xlabel('Year')
plt.ylabel('Number of Passengers')
plt.grid(True)
plt.show()
Expected Plot: A clear upward trend (more passengers over time) and seasonal spikes (likely summer travel).
Data Preprocessing for Time Series
Time series models require clean, stationary data. Let’s handle missing values and check stationarity.
Step 1: Handle Missing Values
Time series data may have gaps (e.g., missing months). Use pandas to detect and impute:
# Check for missing values
print(df.isnull().sum()) # Air Passengers has no missing values, but let's simulate one for demonstration
# Simulate a missing value (optional)
df.loc['1950-06-01', 'Passengers'] = np.nan
# Impute missing values (forward fill, backward fill, or interpolation)
df['Passengers'] = df['Passengers'].interpolate(method='time') # Time-based interpolation
Step 2: Check Stationarity
Most time series models assume stationarity (constant mean, variance, and autocorrelation over time). The Air Passengers data has a trend, so it’s non-stationary.
Test Stationarity with Augmented Dickey-Fuller (ADF) Test:
The ADF test checks the null hypothesis: “Data is non-stationary.” A p-value < 0.05 rejects the null.
from statsmodels.tsa.stattools import adfuller
def adf_test(series):
result = adfuller(series)
print(f'ADF Statistic: {result[0]}')
print(f'p-value: {result[1]}')
print('Critical Values:')
for key, value in result[4].items():
print(f' {key}: {value}')
adf_test(df['Passengers'])
Output for Air Passengers:
ADF Statistic: 0.8153688792060498
p-value: 0.991880243437641
Critical Values:
1%: -3.4816817173418295
5%: -2.8840418343195267
10%: -2.578770059171598
Interpretation: p-value = 0.99 > 0.05 → Fail to reject null hypothesis → Data is non-stationary.
Step 3: Make Data Stationary
To stationarize, apply transformations:
- Log transformation: Reduces variance (for multiplicative seasonality).
- Differencing: Subtract previous values to remove trends.
# Log transformation (optional but helpful for variance stabilization)
df['Log_Passengers'] = np.log(df['Passengers'])
# First-order differencing (remove trend)
df['Diff_Passengers'] = df['Log_Passengers'].diff().dropna()
# Plot differenced data
plt.figure(figsize=(12, 6))
sns.lineplot(data=df, x=df.index, y='Diff_Passengers')
plt.title('Differenced Log Passengers (Stationary?)')
plt.show()
# Re-run ADF test on differenced data
adf_test(df['Diff_Passengers'].dropna())
ADF Test on Differenced Data: p-value < 0.05 → Now stationary!
Exploratory Data Analysis (EDA)
Dig deeper into patterns with decomposition and autocorrelation plots.
Step 1: Decompose Time Series
Use statsmodels to split the series into trend, seasonality, and residual (noise):
from statsmodels.tsa.seasonal import seasonal_decompose
# Decompose (multiplicative model, since trend and seasonality grow together)
decomposition = seasonal_decompose(df['Passengers'], model='multiplicative', period=12) # 12 months = annual seasonality
# Plot components
fig = decomposition.plot()
plt.tight_layout()
plt.show()
Components:
- Trend: Upward slope.
- Seasonality: Regular spikes (summer months).
- Residual: Random noise (small if decomposition is good).
Step 2: Autocorrelation (ACF) and Partial Autocorrelation (PACF) Plots
ACF/PACF show correlation between a series and its lagged values (e.g., correlation between January 1950 and January 1949).
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# ACF plot (shows total correlation with lags)
plt.figure(figsize=(12, 4))
plot_acf(df['Diff_Passengers'].dropna(), lags=24, ax=plt.gca())
plt.title('ACF Plot (Differenced Data)')
plt.show()
# PACF plot (shows direct correlation with lags, removing intermediate lags)
plt.figure(figsize=(12, 4))
plot_pacf(df['Diff_Passengers'].dropna(), lags=24, ax=plt.gca())
plt.title('PACF Plot (Differenced Data)')
plt.show()
Key Insight: Significant lags in ACF/PACF help choose ARIMA parameters (see below).
Time Series Forecasting Models
6.1 Simple Models: Moving Average & Exponential Smoothing
Start with simple models before advanced ones.
Moving Average (MA)
Predicts future values as the average of past n observations:
# 12-month moving average (since seasonality is annual)
df['MA_12'] = df['Passengers'].rolling(window=12).mean()
# Plot actual vs MA forecast
plt.figure(figsize=(12, 6))
sns.lineplot(data=df, x=df.index, y='Passengers', label='Actual')
sns.lineplot(data=df, x=df.index, y='MA_12', label='12-Month MA')
plt.title('Actual vs Moving Average Forecast')
plt.legend()
plt.show()
Exponential Smoothing (SES/Holt-Winters)
Exponential smoothing assigns more weight to recent observations. Use statsmodels for Holt-Winters (handles trend + seasonality):
from statsmodels.tsa.holtwinters import ExponentialSmoothing
# Train-test split (80% train, 20% test)
train = df[:'1958-12-01'] # 1949–1958
test = df['1959-01-01':] # 1959–1960
# Fit Holt-Winters model (multiplicative for growing trend/seasonality)
model = ExponentialSmoothing(train['Passengers'], trend='mul', seasonal='mul', seasonal_periods=12)
hw_fit = model.fit()
# Forecast on test data
hw_forecast = hw_fit.forecast(steps=len(test))
6.2 ARIMA: AutoRegressive Integrated Moving Average
ARIMA is a powerful model combining:
- AR(p): Autoregression (uses past p values).
- I(d): Integration (differencing to make stationary, d=1 for first-order differencing).
- MA(q): Moving average (uses past q forecast errors).
Step 1: Choose ARIMA Parameters (p, d, q)
- d = 1 (we differenced once to make data stationary).
- p = Lag where PACF cuts off (e.g., PACF lag 1 is significant → p=1).
- q = Lag where ACF cuts off (e.g., ACF lag 1 → q=1).
Alternatively, use auto_arima from pmdarima to auto-select parameters:
from pmdarima import auto_arima
# Auto-select ARIMA parameters
arima_model = auto_arima(train['Passengers'], seasonal=False, trace=True, suppress_warnings=True)
print(arima_model.summary()) # Outputs best (p, d, q)
Example Output: Best model: ARIMA(2,1,1).
Step 2: Fit ARIMA Model and Forecast
from statsmodels.tsa.arima.model import ARIMA
# Fit ARIMA(2,1,1)
arima_fit = ARIMA(train['Passengers'], order=(2,1,1)).fit()
# Forecast on test data
arima_forecast = arima_fit.forecast(steps=len(test))
Step 3: Evaluate Models
Compare Holt-Winters and ARIMA using metrics like MAE, RMSE, and MAPE:
from sklearn.metrics import mean_absolute_error, mean_squared_error
def evaluate_forecast(actual, forecast):
mae = mean_absolute_error(actual, forecast)
mse = mean_squared_error(actual, forecast)
rmse = np.sqrt(mse)
mape = np.mean(np.abs((actual - forecast) / actual)) * 100
return {'MAE': mae, 'MSE': mse, 'RMSE': rmse, 'MAPE': mape}
# Evaluate Holt-Winters
hw_metrics = evaluate_forecast(test['Passengers'], hw_forecast)
# Evaluate ARIMA
arima_metrics = evaluate_forecast(test['Passengers'], arima_forecast)
print("Holt-Winters Metrics:\n", hw_metrics)
print("\nARIMA Metrics:\n", arima_metrics)
Interpretation: Lower MAE/RMSE/MAPE = better model. ARIMA often outperforms simple models.
Step 4: Plot Forecast vs Actual
Visualize predictions to assess fit:
plt.figure(figsize=(12, 6))
plt.plot(train.index, train['Passengers'], label='Train')
plt.plot(test.index, test['Passengers'], label='Test')
plt.plot(test.index, arima_forecast, label='ARIMA Forecast', color='red')
plt.title('ARIMA Forecast vs Actual Passengers')
plt.legend()
plt.show()
Advanced Topics (Optional)
7.1 LSTM for Time Series Forecasting
Deep learning models like LSTMs (Long Short-Term Memory) capture long-term dependencies. Use Keras to build an LSTM:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
# Prepare data for LSTM (reshape to [samples, time steps, features])
def create_sequences(data, seq_length):
X, y = [], []
for i in range(len(data) - seq_length):
X.append(data[i:i+seq_length])
y.append(data[i+seq_length])
return np.array(X), np.array(y)
# Normalize data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df['Passengers'].values.reshape(-1, 1)).flatten()
# Create sequences (12-month lookback)
seq_length = 12
X, y = create_sequences(scaled_data, seq_length)
# Train-test split
train_size = int(0.8 * len(X))
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]
# Reshape for LSTM: [samples, time steps, features]
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))
# Build LSTM model
model = Sequential()
model.add(LSTM(50, activation='relu', input_shape=(seq_length, 1)))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
# Train
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))
# Predict and inverse transform
lstm_forecast = model.predict(X_test)
lstm_forecast = scaler.inverse_transform(lstm_forecast)
7.2 Facebook Prophet
Prophet is a user-friendly tool for time series with strong trends/seasonality (developed by Meta):
from prophet import Prophet
# Format data for Prophet (columns: ds=date, y=value)
prophet_df = df.reset_index().rename(columns={'Month': 'ds', 'Passengers': 'y'})
# Fit model
model = Prophet(yearly_seasonality=True, trend_regularization=True)
model.fit(prophet_df)
# Forecast 2 years ahead
future = model.make_future_dataframe(periods=24, freq='M') # 24 months
forecast = model.predict(future)
# Plot forecast
fig = model.plot(forecast)
plt.title('Prophet Forecast (Air Passengers)')
plt.show()
# Plot components (trend + seasonality)
fig = model.plot_components(forecast)
plt.show()
Conclusion
Time series analysis in Python involves loading/cleaning data, exploring patterns, and building models to forecast future trends. We covered:
- Data preprocessing (stationarity, missing values).
- EDA (decomposition, ACF/PACF).
- Models: Moving Average, Holt-Winters, ARIMA, LSTM, and Prophet.
Start with simple models (ARIMA) before advanced ones (LSTM). Always validate with metrics like RMSE and visualize forecasts. With practice, you’ll master forecasting for real-world datasets!
References
- Pandas Documentation
- Statsmodels Time Series Guide
- Prophet Documentation
- Air Passengers Dataset
- Hyndman, R. J., & Athanasopoulos, G. (2018). Forecasting: Principles and Practice. OTexts.