SQL - PostgreSQL - Cumulative Distribution¶
Source: sql-snippets/cume_dist.md at main · count/sql-snippets (github.com)
View an interactive snippet here.
Description¶
Cumulative Distribution is a method for analyzing the distribution of a quantity, similar to histograms.
They show, for each value of a quantity, what fraction of rows are smaller or greater.
One method for calculating it is as follows:
SELECT
-- If necessary, use a row_number window function to get the position of this row in the dataset
(ROW_NUMBER() OVER (ORDER BY <quantity>)) / (SELECT COUNT(*) FROM <schema.table>) AS cume_dist,
<quantity>,
FROM <schema.table>
where
quantity
- the column containing the metric of interestschema.table
- the table with your data
Example¶
WITH data AS (
SELECT md5(RANDOM()::TEXT) AS username_hash,
RANDOM() AS random_number,
row_number
FROM GENERATE_SERIES(1, 10) AS row_number
)
SELECT
--we do not need to use a row_number window function as we have a generate_series in our test data set
username_hash,
(row_number::FLOAT / (SELECT COUNT(*) FROM data)) AS frac,
random_number
FROM data;
Output:
username_hash | frac | random_number |
---|---|---|
f9689609b388f96ac0846c69d4b916dc | 0.1 | 0.48311887726926983 |
68551a63b1eb24c750408dc20c6e4510 | 0.2 | 0.7580949364508456 |
86b915387b7842c33c52e71ada5488d2 | 0.3 | 0.9461365697398278 |
2f9b8d13471f5f8896c073edc4d712a8 | 0.4 | 0.2723711331889973 |
27cd443183f886105001f337d987bdaa | 0.5 | 0.2698584751712403 |
370ce4710f9900fa38f99494d50b9275 | 0.6 | 0.011248462980887552 |
98d2b23c493e44a395a330ca2d349a47 | 0.7 | 0.7387236527452572 |
328ac83d33498f81a6b1e70fb8e0dcbb | 0.8 | 0.5815637247802528 |
9426739e865d970e7c932c37714b32f0 | 0.9 | 0.5957734211030683 |
452a9d8f21c9773fbd5d832138d0d4d1 | 1 | 0.9090960753850368 |
Appendix: Links¶
Backlinks:
list from [[SQL - PostgreSQL - Cumulative Distribution]] AND -"Changelog"