SQL - PostgreSQL - Cumulative Distribution¶

Source: sql-snippets/cume_dist.md at main · count/sql-snippets (github.com)

View an interactive snippet here.

Description¶

Cumulative Distribution is a method for analyzing the distribution of a quantity, similar to histograms.

They show, for each value of a quantity, what fraction of rows are smaller or greater.

One method for calculating it is as follows:

SELECT
  -- If necessary, use a row_number window function to get the position of this row in the dataset
  (ROW_NUMBER() OVER (ORDER BY <quantity>)) / (SELECT COUNT(*) FROM <schema.table>) AS cume_dist,
  <quantity>,
FROM <schema.table>

where

quantity - the column containing the metric of interest
schema.table - the table with your data

Example¶

WITH data AS (
    SELECT md5(RANDOM()::TEXT) AS username_hash,
           RANDOM() AS random_number,
           row_number
    FROM GENERATE_SERIES(1, 10) AS row_number
)

SELECT
  --we do not need to use a row_number window function as we have a generate_series in our test data set
   username_hash,
  (row_number::FLOAT / (SELECT COUNT(*) FROM data)) AS frac,
  random_number
FROM data;

Output:

username_hash	frac	random_number
f9689609b388f96ac0846c69d4b916dc	0.1	0.48311887726926983
68551a63b1eb24c750408dc20c6e4510	0.2	0.7580949364508456
86b915387b7842c33c52e71ada5488d2	0.3	0.9461365697398278
2f9b8d13471f5f8896c073edc4d712a8	0.4	0.2723711331889973
27cd443183f886105001f337d987bdaa	0.5	0.2698584751712403
370ce4710f9900fa38f99494d50b9275	0.6	0.011248462980887552
98d2b23c493e44a395a330ca2d349a47	0.7	0.7387236527452572
328ac83d33498f81a6b1e70fb8e0dcbb	0.8	0.5815637247802528
9426739e865d970e7c932c37714b32f0	0.9	0.5957734211030683
452a9d8f21c9773fbd5d832138d0d4d1	1	0.9090960753850368

Appendix: Links¶

Backlinks:

list from [[SQL - PostgreSQL - Cumulative Distribution]] AND -"Changelog"