Generating Data Sets Using mkjson
As a developer working on a database, I often need access to a data set to test various features. Initially, I created cr8 insert-fake-data to solve this problem. cr8 insert-fake-data
reads the schema of a table and then utilizes the faker library to generate random records that match the table’s schema and inserts them.
cr8 insert-fake-data is easy to use and often accomplishes exactly what I need. However, in some cases, it is not as flexible as I need it to be, which led me to create mkjson, another tool to generate random records.
Introduction ¶
mkjson is a command line tool that generates JSON records. It takes pairs of <key>=<value>
as arguments. You can use it to generate static records, like so:
↪ mkjson x=10 y=20
{"x":10,"y":20}
The primary use case, however, is to generate randomized data. For this, it provides the option to use functions as <value>
. As an example, to generate random integers between one and ten you can use the randomInt
function like so:
↪ mkjson --num 10 x="randomInt(1, 10)"
{"x":3}
{"x":9}
{"x":4}
{"x":6}
{"x":9}
{"x":8}
{"x":5}
{"x":4}
{"x":10}
{"x":1}
Parser combinators inspired the available functions. The idea is to have a few building blocks that can be combined to create more complex constructs.
For example, say you want a number that is either between one and five or between 80 and 85. For that mkjson
provides the oneOf
function:
↪ mkjson --num 10 x="oneOf(randomInt(1, 5), randomInt(80, 85))"
{"x":1}
{"x":83}
{"x":2}
{"x":4}
{"x":83}
{"x":5}
{"x":3}
{"x":3}
{"x":83}
{"x":84}
Generating data sets for JOIN statements ¶
One of the use cases I had where cr8 insert-fake-data
fell short was to generate a data set that is suited for an INNER JOIN. I wanted to have one large table and one small table.
Suppose the small table looks like this:
create table sensors (
id text primary key,
name text,
description text
);
And the large table looks like this:
create table metrics (
sensor_id text,
value double precision,
ts timestamp without time zone
);
With mkjson
the data for the sensors
table could be generated like so:
↪ mkjson --num 100 \
id="uuid4()" \
name="fromRegex('[a-z]{6,14}')" \
description="oneOf(null(), fromRegex('[a-z]{20,80}'))" | tee sensors.jsonl
{"name":"igrebgjstln","id":"485141c9-6d4b-42d3-b18d-9ce0f216fb97","description":"vwimhoblbynoalrbzrjblaaynkdqcwudldtrfsfxmjgygmvelujxivqnkyvb"}
{"name":"mpwdwkd","id":"d0c5a93a-018f-4dd9-bc2c-8b32584e2a46","description":null}
{"name":"gdhnavqm","id":"68d62bfe-1e6a-4495-bc58-ca63a7004e75","description":"npswouxyumopmjvcgbubtwxnzlysijpqzkzbnbdgskwkhfyfmlnxbzleuxrtbzt"}
{"name":"gsgqjgcwndtxg","id":"0a12e9b7-504d-462f-afa4-ca2f93ef7fe6","description":"hmoerwsismcttcmpdtrefcbqtxgnurhrqulzpuyysxdpkyljbbxbpuzsxxtkxprzqglvyk"}
{"name":"wagbwkjeyuynoq","id":"e74607e0-22e2-4528-9554-c9ba0776e129","description":"lpospgtwbbeniujpfrppcjhnyjuizkrdnotgehettjmrtvuzinapjqbsxrrbhb"}
...
Now, the data for the larger table should reference records in the sensors
table. For this to work we can extract the ids from the sensors.jsonl
file:
cat sensors.jsonl| jq -r .id > sensor_ids.txt
And then lookup the ids when generating the data for the metrics
table, like so:
↪ mkjson --num 1000000 \
sensor_id="oneOf(fromFile('sensor_ids.txt'))" \
value="randomDouble()" \
ts="randomDateTime('2019-01-01', '2020-05-01 23:00')" | cr8 insert-json --table metrics --hosts localhost:4200
Wrap up ¶
This was a short introduction showing how you could use mkjson to generate random data quickly. It may be too simplistic to model more advanced scenarios. But when all you need is a bunch of data with some simple properties, it could be the right tool and save you some time that you’d otherwise spend writing a custom generator.