Creating Correlation Matrix using DuckDB
Understanding the relationships between variables is fundamental in data analysis. The correlation matrix, a powerful statistical tool, allows us to examine the interdependencies among multiple variables. In this guide, we’ll explore how to leverage DuckDB’s functionality to compute a correlation matrix, shedding light on the connections between different data attributes.
Understanding Correlation
Correlation measures the strength and direction of the relationship between two variables. It ranges from -1 to 1. A value closer to 1 indicates a strong positive relationship, -1 represents a strong negative relationship, and 0 implies no linear relationship.
The Correlation Query
In DuckDB, the corr(y, x) function calculates the correlation coefficient between two columns. It can be used like this:
Loading editor...
WITH base AS (
SELECT *
FROM 'https://raw.githubusercontent.com/plotly/datasets/master/auto-mpg.csv'
)
SELECT CORR(horsepower, acceleration) FROM base
The Correlation Matrix Query
This SQL query utilizes common table expressions (CTEs) and COLUMNS expression for calculating correlation coefficients of all variables.
Loading editor...
WITH base AS (
SELECT * EXCLUDE ("model-year")
FROM 'https://raw.githubusercontent.com/plotly/datasets/master/auto-mpg.csv'
), add_rn AS (
SELECT *, ROW_NUMBER() OVER () AS rn FROM base
), unpivoted AS (
UNPIVOT add_rn
ON * EXCLUDE (rn)
INTO NAME k VALUE v
)
SELECT
k,
ROUND(CORR(COLUMNS(add_rn.* EXCLUDE (rn)), unpivoted.v), 3)
FROM add_rn
JOIN unpivoted USING (rn)
GROUP BY 1Explanation
Let’s break down the provided query to explain its functionality:
- Loading Data (
baseCTE): The query starts by loading data from the auto-mpg dataset available at a specific URL. - Adding Row Numbers (
add_rnCTE): It adds row numbers to the dataset, assigning each row a unique identifier (rn). - Unpivoting Data (
unpivotedCTE): TheUNPIVOToperation transforms the dataset from a wide format to a long format. - Correlation Calculation and Matrix Construction: The final part of the query aims to calculate correlation coefficients between multiple columns (
k) and values (v). TheCORRfunction computes the correlation, andROUNDto limit the decimal places to three. The query joins tables (add_rnandunpivoted) based on the row number (rn). It groups the results by the first column (k) to generate the correlation matrix.
DuckDB Correlation Matrix MACRO
Now to make things simpler, we can create a TABLE MACRO to query the data.
Loading editor...
CREATE OR REPLACE MACRO CORRELATION_MATRIX(base) AS TABLE (
WITH add_rn AS (
SELECT *, ROW_NUMBER() OVER () AS rn FROM base
), unpivoted AS (
UNPIVOT add_rn
ON COLUMNS(* EXCLUDE (rn))
INTO NAME k VALUE v
), get_col_names AS (
SELECT '' AS k, * EXCLUDE (rn) FROM add_rn LIMIT 0
)
SELECT * FROM get_col_names
UNION ALL
SELECT
k,
ROUND(CORR(COLUMNS(add_rn.* EXCLUDE (rn)), unpivoted.v), 3)
FROM add_rn
JOIN unpivoted
USING (rn)
GROUP BY 1
);
WITH base AS (
SELECT * EXCLUDE ("model-year")
FROM 'https://raw.githubusercontent.com/plotly/datasets/master/auto-mpg.csv'
)
SELECT * FROM CORRELATION_MATRIX(base)Stay tuned for more informative recipes in the Ducklicious DuckDB Cookbook.