py4u guide

Step-by-Step Tutorial: Time Series Analysis with Python

Time series analysis is a statistical technique used to analyze data points collected or recorded over time. Unlike cross-sectional data (which captures a snapshot at a single time), time series data has a natural temporal ordering, making it ideal for forecasting future trends, detecting patterns (e.g., seasonality), and understanding underlying dynamics. From predicting stock prices and weather patterns to forecasting sales and energy consumption, time series analysis is ubiquitous in finance, healthcare, retail, and engineering. In this tutorial, we’ll demystify time series analysis using Python, covering everything from data loading and preprocessing to modeling and forecasting. By the end, you’ll be equipped to analyze your own time series datasets and build accurate forecasts.

Table of Contents

  1. Prerequisites
  2. What is Time Series Data?
  3. Loading and Exploring Time Series Data
  4. Data Preprocessing for Time Series
  5. Exploratory Data Analysis (EDA)
  6. Time Series Forecasting Models
  7. Advanced Topics (Optional)
  8. Conclusion
  9. References

Prerequisites

Before diving in, ensure you have the following:

  • Basic knowledge of Python (e.g., loops, functions, pandas).
  • Familiarity with data visualization (matplotlib/seaborn).
  • Python libraries installed:
    pip install pandas numpy matplotlib seaborn statsmodels scikit-learn pmdarima  

What is Time Series Data?

Time series data is a sequence of observations recorded at regular time intervals (e.g., hourly, daily, monthly). Key characteristics include:

  • Trend: Long-term upward/downward movement (e.g., increasing global temperatures).
  • Seasonality: Repeating patterns at fixed intervals (e.g., holiday sales spikes).
  • Cyclicity: Irregular fluctuations (e.g., economic recessions every 5–10 years).
  • Noise: Random, unpredictable variation.

Example datasets: Stock prices (minute-level), monthly rainfall, annual GDP, or hourly website traffic.

Loading and Exploring Time Series Data

We’ll use the classic Air Passengers dataset (1949–1960 monthly airline passengers) for this tutorial. It’s included in statsmodels or can be downloaded from Kaggle.

Step 1: Load the Data

First, load the data with pandas and ensure the timestamp column is parsed correctly:

import pandas as pd  
import matplotlib.pyplot as plt  
import seaborn as sns  

# Load dataset (replace with your file path if needed)  
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv"  
df = pd.read_csv(url, parse_dates=['Month'], index_col='Month')  # Parse 'Month' as datetime and set as index  

# Inspect the first 5 rows  
print(df.head())  

Output:

            Passengers  
Month                  
1949-01-01         112  
1949-02-01         118  
1949-03-01         132  
1949-04-01         129  
1949-05-01         121  

Step 2: Initial Exploration

Check the dataset’s structure, summary statistics, and time range:

# Dataset info  
print(df.info())  

# Summary statistics  
print(df.describe())  

# Time range  
print(f"Time range: {df.index.min()} to {df.index.max()}")  

Key Observations:

  • The dataset has 144 monthly observations (12 years).
  • The Passengers column is numeric (int64).
  • The time index is correctly formatted as datetime64[ns].

Step 3: Plot the Time Series

Visualization is critical to identify trends/seasonality:

plt.figure(figsize=(12, 6))  
sns.lineplot(data=df, x=df.index, y='Passengers')  
plt.title('Monthly Airline Passengers (1949–1960)')  
plt.xlabel('Year')  
plt.ylabel('Number of Passengers')  
plt.grid(True)  
plt.show()  

Expected Plot: A clear upward trend (more passengers over time) and seasonal spikes (likely summer travel).

Data Preprocessing for Time Series

Time series models require clean, stationary data. Let’s handle missing values and check stationarity.

Step 1: Handle Missing Values

Time series data may have gaps (e.g., missing months). Use pandas to detect and impute:

# Check for missing values  
print(df.isnull().sum())  # Air Passengers has no missing values, but let's simulate one for demonstration  

# Simulate a missing value (optional)  
df.loc['1950-06-01', 'Passengers'] = np.nan  

# Impute missing values (forward fill, backward fill, or interpolation)  
df['Passengers'] = df['Passengers'].interpolate(method='time')  # Time-based interpolation  

Step 2: Check Stationarity

Most time series models assume stationarity (constant mean, variance, and autocorrelation over time). The Air Passengers data has a trend, so it’s non-stationary.

Test Stationarity with Augmented Dickey-Fuller (ADF) Test:
The ADF test checks the null hypothesis: “Data is non-stationary.” A p-value < 0.05 rejects the null.

from statsmodels.tsa.stattools import adfuller  

def adf_test(series):  
    result = adfuller(series)  
    print(f'ADF Statistic: {result[0]}')  
    print(f'p-value: {result[1]}')  
    print('Critical Values:')  
    for key, value in result[4].items():  
        print(f'   {key}: {value}')  

adf_test(df['Passengers'])  

Output for Air Passengers:

ADF Statistic: 0.8153688792060498  
p-value: 0.991880243437641  
Critical Values:  
   1%: -3.4816817173418295  
   5%: -2.8840418343195267  
   10%: -2.578770059171598  

Interpretation: p-value = 0.99 > 0.05 → Fail to reject null hypothesis → Data is non-stationary.

Step 3: Make Data Stationary

To stationarize, apply transformations:

  • Log transformation: Reduces variance (for multiplicative seasonality).
  • Differencing: Subtract previous values to remove trends.
# Log transformation (optional but helpful for variance stabilization)  
df['Log_Passengers'] = np.log(df['Passengers'])  

# First-order differencing (remove trend)  
df['Diff_Passengers'] = df['Log_Passengers'].diff().dropna()  

# Plot differenced data  
plt.figure(figsize=(12, 6))  
sns.lineplot(data=df, x=df.index, y='Diff_Passengers')  
plt.title('Differenced Log Passengers (Stationary?)')  
plt.show()  

# Re-run ADF test on differenced data  
adf_test(df['Diff_Passengers'].dropna())  

ADF Test on Differenced Data: p-value < 0.05 → Now stationary!

Exploratory Data Analysis (EDA)

Dig deeper into patterns with decomposition and autocorrelation plots.

Step 1: Decompose Time Series

Use statsmodels to split the series into trend, seasonality, and residual (noise):

from statsmodels.tsa.seasonal import seasonal_decompose  

# Decompose (multiplicative model, since trend and seasonality grow together)  
decomposition = seasonal_decompose(df['Passengers'], model='multiplicative', period=12)  # 12 months = annual seasonality  

# Plot components  
fig = decomposition.plot()  
plt.tight_layout()  
plt.show()  

Components:

  • Trend: Upward slope.
  • Seasonality: Regular spikes (summer months).
  • Residual: Random noise (small if decomposition is good).

Step 2: Autocorrelation (ACF) and Partial Autocorrelation (PACF) Plots

ACF/PACF show correlation between a series and its lagged values (e.g., correlation between January 1950 and January 1949).

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf  

# ACF plot (shows total correlation with lags)  
plt.figure(figsize=(12, 4))  
plot_acf(df['Diff_Passengers'].dropna(), lags=24, ax=plt.gca())  
plt.title('ACF Plot (Differenced Data)')  
plt.show()  

# PACF plot (shows direct correlation with lags, removing intermediate lags)  
plt.figure(figsize=(12, 4))  
plot_pacf(df['Diff_Passengers'].dropna(), lags=24, ax=plt.gca())  
plt.title('PACF Plot (Differenced Data)')  
plt.show()  

Key Insight: Significant lags in ACF/PACF help choose ARIMA parameters (see below).

Time Series Forecasting Models

6.1 Simple Models: Moving Average & Exponential Smoothing

Start with simple models before advanced ones.

Moving Average (MA)

Predicts future values as the average of past n observations:

# 12-month moving average (since seasonality is annual)  
df['MA_12'] = df['Passengers'].rolling(window=12).mean()  

# Plot actual vs MA forecast  
plt.figure(figsize=(12, 6))  
sns.lineplot(data=df, x=df.index, y='Passengers', label='Actual')  
sns.lineplot(data=df, x=df.index, y='MA_12', label='12-Month MA')  
plt.title('Actual vs Moving Average Forecast')  
plt.legend()  
plt.show()  

Exponential Smoothing (SES/Holt-Winters)

Exponential smoothing assigns more weight to recent observations. Use statsmodels for Holt-Winters (handles trend + seasonality):

from statsmodels.tsa.holtwinters import ExponentialSmoothing  

# Train-test split (80% train, 20% test)  
train = df[:'1958-12-01']  # 1949–1958  
test = df['1959-01-01':]   # 1959–1960  

# Fit Holt-Winters model (multiplicative for growing trend/seasonality)  
model = ExponentialSmoothing(train['Passengers'], trend='mul', seasonal='mul', seasonal_periods=12)  
hw_fit = model.fit()  

# Forecast on test data  
hw_forecast = hw_fit.forecast(steps=len(test))  

6.2 ARIMA: AutoRegressive Integrated Moving Average

ARIMA is a powerful model combining:

  • AR(p): Autoregression (uses past p values).
  • I(d): Integration (differencing to make stationary, d=1 for first-order differencing).
  • MA(q): Moving average (uses past q forecast errors).

Step 1: Choose ARIMA Parameters (p, d, q)

  • d = 1 (we differenced once to make data stationary).
  • p = Lag where PACF cuts off (e.g., PACF lag 1 is significant → p=1).
  • q = Lag where ACF cuts off (e.g., ACF lag 1 → q=1).

Alternatively, use auto_arima from pmdarima to auto-select parameters:

from pmdarima import auto_arima  

# Auto-select ARIMA parameters  
arima_model = auto_arima(train['Passengers'], seasonal=False, trace=True, suppress_warnings=True)  
print(arima_model.summary())  # Outputs best (p, d, q)  

Example Output: Best model: ARIMA(2,1,1).

Step 2: Fit ARIMA Model and Forecast

from statsmodels.tsa.arima.model import ARIMA  

# Fit ARIMA(2,1,1)  
arima_fit = ARIMA(train['Passengers'], order=(2,1,1)).fit()  

# Forecast on test data  
arima_forecast = arima_fit.forecast(steps=len(test))  

Step 3: Evaluate Models

Compare Holt-Winters and ARIMA using metrics like MAE, RMSE, and MAPE:

from sklearn.metrics import mean_absolute_error, mean_squared_error  

def evaluate_forecast(actual, forecast):  
    mae = mean_absolute_error(actual, forecast)  
    mse = mean_squared_error(actual, forecast)  
    rmse = np.sqrt(mse)  
    mape = np.mean(np.abs((actual - forecast) / actual)) * 100  
    return {'MAE': mae, 'MSE': mse, 'RMSE': rmse, 'MAPE': mape}  

# Evaluate Holt-Winters  
hw_metrics = evaluate_forecast(test['Passengers'], hw_forecast)  
# Evaluate ARIMA  
arima_metrics = evaluate_forecast(test['Passengers'], arima_forecast)  

print("Holt-Winters Metrics:\n", hw_metrics)  
print("\nARIMA Metrics:\n", arima_metrics)  

Interpretation: Lower MAE/RMSE/MAPE = better model. ARIMA often outperforms simple models.

Step 4: Plot Forecast vs Actual

Visualize predictions to assess fit:

plt.figure(figsize=(12, 6))  
plt.plot(train.index, train['Passengers'], label='Train')  
plt.plot(test.index, test['Passengers'], label='Test')  
plt.plot(test.index, arima_forecast, label='ARIMA Forecast', color='red')  
plt.title('ARIMA Forecast vs Actual Passengers')  
plt.legend()  
plt.show()  

Advanced Topics (Optional)

7.1 LSTM for Time Series Forecasting

Deep learning models like LSTMs (Long Short-Term Memory) capture long-term dependencies. Use Keras to build an LSTM:

from tensorflow.keras.models import Sequential  
from tensorflow.keras.layers import LSTM, Dense  

# Prepare data for LSTM (reshape to [samples, time steps, features])  
def create_sequences(data, seq_length):  
    X, y = [], []  
    for i in range(len(data) - seq_length):  
        X.append(data[i:i+seq_length])  
        y.append(data[i+seq_length])  
    return np.array(X), np.array(y)  

# Normalize data  
from sklearn.preprocessing import MinMaxScaler  
scaler = MinMaxScaler()  
scaled_data = scaler.fit_transform(df['Passengers'].values.reshape(-1, 1)).flatten()  

# Create sequences (12-month lookback)  
seq_length = 12  
X, y = create_sequences(scaled_data, seq_length)  

# Train-test split  
train_size = int(0.8 * len(X))  
X_train, X_test = X[:train_size], X[train_size:]  
y_train, y_test = y[:train_size], y[train_size:]  

# Reshape for LSTM: [samples, time steps, features]  
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))  
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))  

# Build LSTM model  
model = Sequential()  
model.add(LSTM(50, activation='relu', input_shape=(seq_length, 1)))  
model.add(Dense(1))  
model.compile(optimizer='adam', loss='mse')  

# Train  
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))  

# Predict and inverse transform  
lstm_forecast = model.predict(X_test)  
lstm_forecast = scaler.inverse_transform(lstm_forecast)  

7.2 Facebook Prophet

Prophet is a user-friendly tool for time series with strong trends/seasonality (developed by Meta):

from prophet import Prophet  

# Format data for Prophet (columns: ds=date, y=value)  
prophet_df = df.reset_index().rename(columns={'Month': 'ds', 'Passengers': 'y'})  

# Fit model  
model = Prophet(yearly_seasonality=True, trend_regularization=True)  
model.fit(prophet_df)  

# Forecast 2 years ahead  
future = model.make_future_dataframe(periods=24, freq='M')  # 24 months  
forecast = model.predict(future)  

# Plot forecast  
fig = model.plot(forecast)  
plt.title('Prophet Forecast (Air Passengers)')  
plt.show()  

# Plot components (trend + seasonality)  
fig = model.plot_components(forecast)  
plt.show()  

Conclusion

Time series analysis in Python involves loading/cleaning data, exploring patterns, and building models to forecast future trends. We covered:

  • Data preprocessing (stationarity, missing values).
  • EDA (decomposition, ACF/PACF).
  • Models: Moving Average, Holt-Winters, ARIMA, LSTM, and Prophet.

Start with simple models (ARIMA) before advanced ones (LSTM). Always validate with metrics like RMSE and visualize forecasts. With practice, you’ll master forecasting for real-world datasets!

References