Module 1: Correlation Mining

Code Along

Correlation Mining

  • Correlation mining is used to identify relationships between variables in a dataset.

  • Example: Finding how house size, location, and other factors relate to house prices.

Dataset

Dataset Used: California Housing Dataset from Scikit-Learn

We will use the California Housing dataset from sklearn

from sklearn.datasets import fetch_california_housing 
data = fetch_california_housing()

Covert the data to DataFrame

import pandas as pd 
df = pd.DataFrame(data.data, columns=data.feature_names)
df['House_Price'] = data.target

Conduct Correlation Analysis

correlation_matrix = df.corr().round(4)
print(correlation_matrix)

P-Values

P-values show the statistical significance of correlation.

from scipy.stats import pearsonr
import numpy as np

# P-value Calculation for Correlation Significance
def correlation_p_values(df):
    p_values = pd.DataFrame(index=df.columns, columns=df.columns, dtype=float)
    for col1 in df.columns:
        for col2 in df.columns:
            if col1 == col2:
                p_values.loc[col1, col2] = 0.0  # Store as float, not string
            else:
                _, p_value = pearsonr(df[col1], df[col2])
                p_values.loc[col1, col2] = p_value  # Store as float
    return p_values

p_values = correlation_p_values(df)
print("\nP-values for Correlation Significance:\n", p_values)

Interpretation

P-Value Interpretation
p > 0.05 No strong evidence against the null hypothesis (Not significant)
p ≤ 0.05 Moderate evidence against the null hypothesis (Significant)

Null hypothesis

Scenario Null Hypothesis (H₀) Alternative Hypothesis (H₁)
Drug Testing The new drug has no effect on blood pressure. The new drug lowers blood pressure.
Housing Market There is no correlation between income and house prices. Higher income leads to higher house prices.

Visualization

Visualization helps you demonstrate correlation in a straightforward way.

plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, fmt=".4f", cmap='coolwarm', center=0)
plt.title('Correlation Matrix Heatmap')
plt.show()

Bonferroni Correction

The goal is to keep the probability of making one or more false discoveries across all tests below a predefined threshold (typically 0.05).

def bonferroni_correction(p_values_df, alpha=0.05):
    # Extract unique p-values (excluding diagonal and duplicates)
    unique_p_values = []
    n_vars = len(p_values_df.columns)
    
    for i in range(n_vars):
        for j in range(i+1, n_vars):  # Only upper triangle, excluding diagonal
            unique_p_values.append(p_values_df.iloc[i, j])
    
    # Convert to numpy array
    p_values_array = np.array(unique_p_values)
    n = len(p_values_array)
    
    # Calculate the Bonferroni-adjusted alpha
    bonferroni_alpha = alpha / n
    
    # Initialize an array to store the results
    result = np.zeros((n, 3), dtype=object)
    
    # Apply the Bonferroni correction
    for i, pval in enumerate(p_values_array):
        result[i, 0] = pval  # Original p-values
        result[i, 1] = bonferroni_alpha  # Bonferroni-adjusted alpha
        
        # Mark significance
        significance = 'Significant' if pval <= bonferroni_alpha else 'Not Significant'
        result[i, 2] = significance
    
    return result

result = bonferroni_correction(p_values)
print("\nBonferroni Correction Results:")
print("P-value | Alpha | Significance")
for row in result:
    print(f"{row[0]:.5f} | {row[1]:.5f} | {row[2]}")