Generating Data Sets Using mkjson
Table of content
As a developer working on a database, I often need access to a data
set to test various features. Initially, I created cr8
insert-fake-data to solve this problem.
cr8 insert-fake-data
reads the schema of a table and then
utilizes the faker library
to generate random records that match the table’s schema and inserts
them.
cr8 insert-fake-data is easy to use and often accomplishes exactly what I need. However, in some cases, it is not as flexible as I need it to be, which led me to create mkjson, another tool to generate random records.
Introduction ¶
mkjson is a
command line tool that generates JSON records. It takes pairs of
<key>=<value>
as arguments. You can use it to
generate static records, like so:
↪ mkjson x=10 y=20
{"x":10,"y":20}
The primary use case, however, is to generate randomized data. For
this, it provides the option to use functions as
<value>
. As an example, to generate random integers
between one and ten you can use the randomInt
function like
so:
↪ mkjson --num 10 x="randomInt(1, 10)"
{"x":3}
{"x":9}
{"x":4}
{"x":6}
{"x":9}
{"x":8}
{"x":5}
{"x":4}
{"x":10}
{"x":1}
Parser combinators inspired the available functions. The idea is to have a few building blocks that can be combined to create more complex constructs.
For example, say you want a number that is either between one and
five or between 80 and 85. For that mkjson
provides the
oneOf
function:
↪ mkjson --num 10 x="oneOf(randomInt(1, 5), randomInt(80, 85))"
{"x":1}
{"x":83}
{"x":2}
{"x":4}
{"x":83}
{"x":5}
{"x":3}
{"x":3}
{"x":83}
{"x":84}
Generating data sets for JOIN statements ¶
One of the use cases I had where cr8 insert-fake-data
fell short was to generate a data set that is suited for an INNER JOIN.
I wanted to have one large table and one small table.
Suppose the small table looks like this:
create table sensors (
id text primary key,
name text,
description text );
And the large table looks like this:
create table metrics (
sensor_id text,value double precision,
timestamp without time zone
ts );
With mkjson
the data for the sensors
table
could be generated like so:
↪ mkjson --num 100 \
"uuid4()" \
id="fromRegex('[a-z]{6,14}')" \
name="oneOf(null(), fromRegex('[a-z]{20,80}'))" | tee sensors.jsonl
description={"name":"igrebgjstln","id":"485141c9-6d4b-42d3-b18d-9ce0f216fb97","description":"vwimhoblbynoalrbzrjblaaynkdqcwudldtrfsfxmjgygmvelujxivqnkyvb"}
{"name":"mpwdwkd","id":"d0c5a93a-018f-4dd9-bc2c-8b32584e2a46","description":null}
{"name":"gdhnavqm","id":"68d62bfe-1e6a-4495-bc58-ca63a7004e75","description":"npswouxyumopmjvcgbubtwxnzlysijpqzkzbnbdgskwkhfyfmlnxbzleuxrtbzt"}
{"name":"gsgqjgcwndtxg","id":"0a12e9b7-504d-462f-afa4-ca2f93ef7fe6","description":"hmoerwsismcttcmpdtrefcbqtxgnurhrqulzpuyysxdpkyljbbxbpuzsxxtkxprzqglvyk"}
{"name":"wagbwkjeyuynoq","id":"e74607e0-22e2-4528-9554-c9ba0776e129","description":"lpospgtwbbeniujpfrppcjhnyjuizkrdnotgehettjmrtvuzinapjqbsxrrbhb"}
...
Now, the data for the larger table should reference records in the
sensors
table. For this to work we can extract the ids from
the sensors.jsonl
file:
cat sensors.jsonl| jq -r .id > sensor_ids.txt
And then lookup the ids when generating the data for the
metrics
table, like so:
↪ mkjson --num 1000000 \
"oneOf(fromFile('sensor_ids.txt'))" \
sensor_id="randomDouble()" \
value="randomDateTime('2019-01-01', '2020-05-01 23:00')" | cr8 insert-json --table metrics --hosts localhost:4200 ts=
Wrap up ¶
This was a short introduction showing how you could use mkjson to generate random data quickly. It may be too simplistic to model more advanced scenarios. But when all you need is a bunch of data with some simple properties, it could be the right tool and save you some time that you’d otherwise spend writing a custom generator.