postgresql generate_series generate test data - ghdrako/doc_snipets GitHub Wiki

Generate data

CREATE SCHEMA IF NOT EXISTS temp;     
DROP TABLE IF EXISTS temp.users;     
BEGIN ;     
SET LOCAL statement_timeout = '120s' ;     
-- Requires PostgreSQL 16+ for underscores    
CREATE TABLE IF NOT EXISTS temp.users   
WITH (autovacuum_enabled = FALSE )  AS    
SELECT seq AS id,    
       'fname'  || seq  AS  first_name,
       'lname'  || seq  AS  last_name,
       'user_'  || seq ||  '@'  || (    
              CASE  (RANDOM() * 2):: INT    
                WHEN  0  THEN   'gmail'     
                WHEN  1  THEN   'hotmail'
                WHEN  2  THEN   'yahoo'     
              END    ) ||  '.com'   AS  email,    
        CASE  (seq % 2)    
           WHEN  0  THEN   'Driver'
           ELSE 'Rider'
        END   AS   type ,
        NOW()  AS  created_at,   
        NOW()  AS  updated_at   
FROM GENERATE_SERIES(1, 10_000_000) seq; 
COMMIT ;     
-- Update VM and stats   
VACUUM ( ANALYZE ) temp.users;

Generate Reproducible Random Data

To generate reproducible random data, you can use setseed:

db=# SELECT setseed(0.4050);
 setseed
─────────

(1 row)

db=# SELECT
    random() AS random_float,
    ceil(random() * 10) AS random_int_0_10,
    '2022-01-01'::date + interval '1 days' * ceil(random() * 365) AS random_day_in_2022
FROM
    generate_series(1, 2);

    random_float    β”‚ random_int_0_10 β”‚ random_day_in_2022
────────────────────┼─────────────────┼─────────────────────
 0.1924247516794324 β”‚               9 β”‚ 2022-12-17 00:00:00
 0.9720620908236377 β”‚               5 β”‚ 2022-06-13 00:00:00
If you execute the same block again in a new session, even in a different database, it will produce the exact same results:

otherdb=# SELECT setseed(0.4050);
 setseed
─────────

(1 row)

otherdb=# SELECT
    random() AS random_float,
    ceil(random() * 10) AS random_int_0_10,
    '2022-01-01'::date + interval '1 days' * ceil(random() * 365) AS random_day_in_2022
FROM
    generate_series(1, 2);

    random_float    β”‚ random_int_0_10 β”‚ random_day_in_2022
────────────────────┼─────────────────┼─────────────────────
 0.1924247516794324 β”‚               9 β”‚ 2022-12-17 00:00:00
 0.9720620908236377 β”‚               5 β”‚ 2022-06-13 00:00:00

DISCLAIMER: You should never forget that your generated test data behaves differently than the actual production data. Whether you create the data with this approach or some data-faking library makes no difference. All data has some patterns you may or may not know. Your biggest customers, as estimated by the Pareto principle, are 80% of your total storage size. And each or all of them together have a different data distribution than you would generate manually: Some values are barely used at all and some form hotspots that are used all over the place. This difference in data distribution may lead to indexes being used in production that won't be used with test data and vice versa.

CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255) NOT NULL UNIQUE,
    firstname VARCHAR(100) NOT NULL,
    lastname VARCHAR(100) NOT NULL
);

--generate 1 million random users

postgres=# DO $$
DECLARE
    i INT;
    user_email TEXT; 
    firstname TEXT;
    lastname TEXT;
BEGIN
    FOR i IN 1..1000000 LOOP
        -- Generate random first and last names
        firstname := LOWER(SUBSTRING(md5(random()::text), 1, 8));  -- 8-character random string
        lastname := LOWER(SUBSTRING(md5(random()::text), 1, 8));   

        -- Generate random email
        user_email := LOWER(SUBSTRING(md5(random()::text), 1, 8)) || '@example.com';

        -- Insert the generated email, first name, and last name, and skip duplicates
        INSERT INTO users (email, firstname, lastname)
        VALUES (user_email, firstname, lastname)
        ON CONFLICT (email) DO NOTHING;  
    END LOOP;
END $$;


There are also some commercial tools used to generate application-specific test data for PostgreSQL. They are available at http://www.sqlmanager.net/products/postgresql/datagenerator and http://www.datanamic.com/datagenerator/index.html.

  • generate_series() is a built-in PostgreSQL function that makes it easy to create ordered tables of numbers or dates.

  • 'generate_series()' is a PostgreSQL set returning function, for which the documentation reads: _Generate a series of values, from 'start' to 'stop' with a step size of 'step'.

Example generate all day of the month:

select cast(calendar.entry as date) as date
from
generate_series(date :'start'
               ,date :'start' + interval '1 month' - interval '1 day'
               , interval '1 day' ) as calendar(entry)
  • ''interval '1 month''' give 1st day of the month.
  • generate_series() is inclusive much like the BETWEEN operator, so we exclude the 1st day of the next month with the expression - interval β€˜1 day’.

The cast(calendar.entry as date) expression transforms the generated calendar.entry, which is the result of the generate_series() function call into the date data type.We need to cast here because the generate_series() function returns a set of timestamp entries and we don’t care about the time parts.

select id   from generate_series (1, 10) x(id); -- x - table name id - column name

with x (id)
  as (
  select 1
   union all
  select id+1
    from x
   where id+1 <= 10
  )
  select * from x
SELECT * FROM generate_series(1,3);
1
2
3
SELECT * from generate_series(0,10,2);
0
2
4
6
8
10
SELECT date(t)                     
FROM generate_series(now(),
  now() + '1 week', '1 day') AS f(t);
SELECT * from generate_series(
	'2021-01-01',
    '2021-01-02', INTERVAL '1 hour'
  );

The reason we got 25 rows (representing 25 hours rather than 24 as you might expect) is that the stop value can be reached using the equal one-hour INTERVAL (the step parameter). As long as the INTERVAL can increment evenly up to the stop date, it will be included.

SELECT * from generate_series(
	'2021-01-01','2021-01-02', 
    INTERVAL '1 hour 25 minutes'
   );
SELECT 'Hello' as myStr, * FROM generate_series(1,3);
myStr     | generate_series 
---------+-----------------
 Hello |               1
 Hello |               2
 Hello |               3
SELECT random()*100 as CPU, * FROM generate_series(1,5);

Cartesian product

SELECT * from generate_series(1,10) a, generate_series(1,2) b;
SELECT time, device_id, random()*100 as cpu_usage 
FROM generate_series(
	'2021-01-01 00:00:00',
    '2021-01-01 11:00:00',
    INTERVAL '1 hour'
  ) as time, 
generate_series(1,4) device_id;

Create 6 months of data with one-hour intervals, ending now()

SELECT time, device_id, random()*100 as cpu_usage 
FROM generate_series(
	now() - INTERVAL '6 months',
    now(),
    INTERVAL '1 hour'
   ) as time, 
generate_series(1,4) device_id;

Create 1 year of data with one-hour intervals, ending on a timestamp

SELECT time, device_id, random()*100 as cpu_usage 
FROM generate_series(
	'2021-08-01 00:00:00' - INTERVAL '6 months',
    '2021-08-01 00:00:00',
    INTERVAL '1 hour'
  ) as time, 
generate_series(1,4) device_id;

Create 1 year of data with one-hour intervals, beginning on a timestamp

SELECT time, device_id, random()*100 as cpu_usage 
FROM generate_series(
	'2020-12-15 00:00:00',
    '2020-12-15 00:00:00' + INTERVAL '2 months',
    INTERVAL '1 hour'
  ) as time, 
generate_series(1,4) device_id;

Creating more realistic numbers

Final value = random() * (max allowed value - min allowed value) + min allowed value

This equation will always generate a decimal value between (and inclusive of) the min and max value. If random() returns a value of 1, the final output will equal the maximum value. If random() returns a value of 0, then the result will equal the minimum value. Any other number that random() returns will produce some output between the min and max values.

SELECT
  time,
  device_id,
  round((random()* (100-3) + 3)::NUMERIC, 4) AS cpu,
  floor(random()* (83-28) + 28)::INTEGER AS tempc
FROM 
	generate_series(now() - interval '1 hour', now(), interval '1 minute') AS time, 
	generate_series(1,10,1) AS device_id;

This example produces a reading every minute for one hour for 10 devices. The cpu value will always fall between 3 and 100 (with four decimals of precision), and the temperature will always be an integer between 28 and 83.

(random()*(2*10^9))::integer
(random()*(9*10^18))::bigint
(random()*100.)::numeric(5,2)
repeat('1',(random()*40)::integer)  -- For a random-length string, up to a maximum length
substr('abcdefghijklmnopqrstuvwxyz',1, (random()*25)::integer) -- For a random-length substring
ARRAY['one','two','three'])[0.5+random()*3] -- random string from a list of strings

SELECT key
      ,(random()*100.)::numeric(4,2)
      ,repeat('1',(random()*25)::integer)
FROM generate_series(1,10) AS f(key);

-- use random ordering
SELECT key
      ,(random()*100.)::numeric(4,2)
      ,repeat('1',(random()*25)::integer)
  FROM generate_series(1,10) AS f(key)
 ORDER BY random() * 1.0;
/*
 * Function to create a random numeric value between two numbers
 * 
 * NOTICE: We are using the type of 'numeric' in this function in order
 * to visually return values that look like integers (no decimals) and 
 * floats (with decimals). However, if inserted into a table, the assumption
 * is that the appropriate column type is used. The `numeric` type is often
 * not the correct or most efficient type for storing numbers in a table.
 */
CREATE OR REPLACE FUNCTION random_between(min_val numeric, max_val numeric, round_to int=0) 
   RETURNS numeric AS
$$
 DECLARE
 	value NUMERIC = random()* (min_val - max_val) + max_val;
BEGIN
   IF round_to = 0 THEN 
	 RETURN floor(value);
   ELSE 
   	 RETURN round(value,round_to);
   END IF;
END
$$ language 'plpgsql';

SELECT
  time,
  device_id,
  random_between(3,100, 4) AS cpu,
  random_between(28,83) AS temperature_c
FROM 
	generate_series(now() - interval '1 hour', now(), interval '1 minute') AS time, 
	generate_series(1,10,1) AS device_id;

In general, the numeric data type will often perform worse in queries and features like compression because of how numeric values are represented internally. We recommend avoiding numeric types in schema design whenever possible, preferring the float or integer types instead.

Creating more realistic text

WITH symbols(characters) as (VALUES ('ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz 0123456789 {}')),
w1 AS (
	SELECT string_agg(substr(characters, (random() * length(characters) + 1) :: INTEGER, 1), '') r_text, 'g1' AS idx
	FROM symbols,
generate_series(1,10) as word(chr_idx) -- word length
	GROUP BY idx)
SELECT
  time,
  device_id,
  random_between(3,100, 4) AS cpu,
  random_between(28,83) AS temperature_c,
  w1.r_text AS note
FROM w1, generate_series(now() - interval '1 hour', now(), interval '1 minute') AS time, 
	generate_series(1,10,1) AS device_id
ORDER BY 1,2;
/*
 * Function to create random text, of varying length
 */
CREATE OR REPLACE FUNCTION random_text(min_val INT=0, max_val INT=50) 
   RETURNS text AS
$$
DECLARE 
	word_length NUMERIC  = floor(random() * (max_val-min_val) + min_val)::INTEGER;
	random_word TEXT = '';
BEGIN
	-- only if the word length we get has a remainder after being divided by 5. This gives
	-- some randomness to when words are produced or not. Adjust for your tastes.
	IF(word_length % 5) > 1 THEN
	SELECT * INTO random_word FROM (
		WITH symbols(characters) AS (VALUES ('ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz 0123456789 '))
		SELECT string_agg(substr(characters, (random() * length(characters) + 1) :: INTEGER, 1), ''), 'g1' AS idx
		FROM symbols
		JOIN generate_series(1,word_length) AS word(chr_idx) on 1 = 1 -- word length
		group by idx) a;
	END IF;
	RETURN random_word;
END
$$ LANGUAGE 'plpgsql';


SELECT
  time,
  device_id,
  random_between(3,100, 4) AS cpu,
  random_between(28,83) AS temperature_c,
  random_text(2,10) AS note
FROM generate_series(now() - interval '1 hour', now(), interval '1 minute') AS time, 
	generate_series(1,10,1) AS device_id
ORDER BY 1,2;

Creating sample JSON

WITH random_json AS (
SELECT json_object_agg(key, random_between(1,10)) as json_data
    FROM unnest(array['a', 'b']) as u(key))
  SELECT json_data, generate_series(1,5) FROM random_json;

json_data       |generate_series|
----------------+---------------+
{"a": 6, "b": 2}|              1|
{"a": 6, "b": 2}|              2|
{"a": 6, "b": 2}|              3|
{"a": 6, "b": 2}|              4|
{"a": 6, "b": 2}|              5|
CREATE OR REPLACE FUNCTION random_json(keys TEXT[]='{"a","b","c"}',min_val NUMERIC = 0, max_val NUMERIC = 10) 
   RETURNS JSON AS
$$
DECLARE 
	random_val NUMERIC  = floor(random() * (max_val-min_val) + min_val)::INTEGER;
	random_json JSON = NULL;
BEGIN
	-- again, this adds some randomness into the results. Remove or modify if this
	-- isn't useful for your situation
	if(random_val % 5) > 1 then
		SELECT * INTO random_json FROM (
			SELECT json_object_agg(key, random_between(min_val,max_val)) as json_data
	    		FROM unnest(keys) as u(key)
		) json_val;
	END IF;
	RETURN random_json;
END
$$ LANGUAGE 'plpgsql';

SELECT random_json();
SELECT device_id, random_json() FROM generate_series(1,5) device_id; # because we kept the same random output mechanism from the random_text() example, not every row includes JSON.
SELECT
  time,
  device_id,
  random_between(3,100, 4) AS cpu,
  random_between(28,83) AS temperature_c,
  random_text(2,10) AS note,
  random_json(ARRAY['building','rack'],1,20) device_location
FROM generate_series(now() - interval '1 hour', now(), interval '1 minute') AS time, 
	generate_series(1,10,1) AS device_id
ORDER BY 1,2;

how to create data that mimics a desired shape or trend

row_number() over() window function allows us to return the row number of a result set, and can utilize the ORDER BY and PARTITION keywords to further determine the row values.

SELECT ts, row_number() over(order by time) AS rownum
FROM generate_series('2022-01-01','2022-01-05',INTERVAL '1 day') ts;

row_number() over() requires PostgreSQL (and any other SQL database) to process the query results twice to add the values correctly. Therefore, it's very useful, but also very expensive as datasets grow.

generate_series() is a Set Returning Function (SRF). Like the results from a table, set data can be JOINed and queried. Additionally, PostgreSQL provides the WITH ORDINALITY clause that can be applied to any SRF to generate an additional, incrementing BIGINT column. The best part? It doesn't require a second pass through the data in order to generate this value!

SELECT ts AS time, rownum
FROM generate_series('2022-01-01','2022-01-05',INTERVAL '1 day') WITH ORDINALITY AS t(ts,rownum);

WITH ORDINALITY is more more efficient, but you can accomplish the same results using row_number() over()

Counters with reset

, we can use the incrementing row number and after some period of time, reset the count and start over using the modulus operator (%).

– This example resets the counter every 10 rows

WITH counter_rows AS (
	SELECT ts, 
		CASE WHEN rownum % 10 = 0 THEN 10
		     ELSE rownum % 10 END AS row_counter
	FROM generate_series(now() - INTERVAL '5 minutes', now(), INTERVAL '1 second') WITH ORDINALITY AS t(ts, rownum)
)
SELECT ts, row_counter
FROM counter_rows;

Increasing trend over time

SELECT ts, (10 + 10 * random()) * rownum as value FROM generate_series
       ( '2020-01-01'::date
       , '2021-12-31'::date
       , INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);

Simple cycles (sine wave)

–- subtract 1 from the row number for wave to start
-- at zero radians and produce a more representative chart
SELECT  ts,
 cos(rownum-1) as value
FROM generate_series('2021-01-01','2021-01-30',INTERVAL '1 day') WITH ORDINALITY AS t(ts, rownum);