Code Along
Correlation mining is used to identify relationships between variables in a dataset.
Example: Finding how house size, location, and other factors relate to house prices.
Dataset Used: California Housing Dataset from Scikit-Learn
We will use the California Housing dataset from sklearn
Covert the data to DataFrame
P-values show the statistical significance of correlation.
from scipy.stats import pearsonr
import numpy as np
# P-value Calculation for Correlation Significance
def correlation_p_values(df):
p_values = pd.DataFrame(index=df.columns, columns=df.columns, dtype=float)
for col1 in df.columns:
for col2 in df.columns:
if col1 == col2:
p_values.loc[col1, col2] = 0.0 # Store as float, not string
else:
_, p_value = pearsonr(df[col1], df[col2])
p_values.loc[col1, col2] = p_value # Store as float
return p_values
p_values = correlation_p_values(df)
print("\nP-values for Correlation Significance:\n", p_values)| P-Value | Interpretation |
|---|---|
| p > 0.05 | No strong evidence against the null hypothesis (Not significant) |
| p ≤ 0.05 | Moderate evidence against the null hypothesis (Significant) |
Null hypothesis
| Scenario | Null Hypothesis (H₀) | Alternative Hypothesis (H₁) |
|---|---|---|
| Drug Testing | The new drug has no effect on blood pressure. | The new drug lowers blood pressure. |
| Housing Market | There is no correlation between income and house prices. | Higher income leads to higher house prices. |
Visualization helps you demonstrate correlation in a straightforward way.
The goal is to keep the probability of making one or more false discoveries across all tests below a predefined threshold (typically 0.05).
def bonferroni_correction(p_values_df, alpha=0.05):
# Extract unique p-values (excluding diagonal and duplicates)
unique_p_values = []
n_vars = len(p_values_df.columns)
for i in range(n_vars):
for j in range(i+1, n_vars): # Only upper triangle, excluding diagonal
unique_p_values.append(p_values_df.iloc[i, j])
# Convert to numpy array
p_values_array = np.array(unique_p_values)
n = len(p_values_array)
# Calculate the Bonferroni-adjusted alpha
bonferroni_alpha = alpha / n
# Initialize an array to store the results
result = np.zeros((n, 3), dtype=object)
# Apply the Bonferroni correction
for i, pval in enumerate(p_values_array):
result[i, 0] = pval # Original p-values
result[i, 1] = bonferroni_alpha # Bonferroni-adjusted alpha
# Mark significance
significance = 'Significant' if pval <= bonferroni_alpha else 'Not Significant'
result[i, 2] = significance
return result
result = bonferroni_correction(p_values)
print("\nBonferroni Correction Results:")
print("P-value | Alpha | Significance")
for row in result:
print(f"{row[0]:.5f} | {row[1]:.5f} | {row[2]}")