Generating Data Sets Using mkjson

Friday, May 1, 2020

As a developer working on a database, I often need access to a data set to test various features. Initially, I created cr8 insert-fake-data to solve this problem. cr8 insert-fake-data reads the schema of a table and then utilizes the faker library to generate random records that match the table’s schema and inserts them.

cr8 insert-fake-data is easy to use and often accomplishes exactly what I need. However, in some cases, it is not as flexible as I need it to be, which led me to create mkjson, another tool to generate random records.

Introduction ¶

mkjson is a command line tool that generates JSON records. It takes pairs of <key>=<value> as arguments. You can use it to generate static records, like so:

↪  mkjson x=10  y=20
{"x":10,"y":20}

The primary use case, however, is to generate randomized data. For this, it provides the option to use functions as <value>. As an example, to generate random integers between one and ten you can use the randomInt function like so:

↪  mkjson --num 10 x="randomInt(1, 10)"
{"x":3}
{"x":9}
{"x":4}
{"x":6}
{"x":9}
{"x":8}
{"x":5}
{"x":4}
{"x":10}
{"x":1}

Parser combinators inspired the available functions. The idea is to have a few building blocks that can be combined to create more complex constructs.

For example, say you want a number that is either between one and five or between 80 and 85. For that mkjson provides the oneOf function:

↪  mkjson --num 10 x="oneOf(randomInt(1, 5), randomInt(80, 85))"
{"x":1}
{"x":83}
{"x":2}
{"x":4}
{"x":83}
{"x":5}
{"x":3}
{"x":3}
{"x":83}
{"x":84}

Generating data sets for JOIN statements ¶

One of the use cases I had where cr8 insert-fake-data fell short was to generate a data set that is suited for an INNER JOIN. I wanted to have one large table and one small table.

Suppose the small table looks like this:

create table sensors (
  id text primary key,
  name text,
  description text
);

And the large table looks like this:

create table metrics (
  sensor_id text,
  value double precision,
  ts timestamp without time zone
);

With mkjson the data for the sensors table could be generated like so:

↪  mkjson --num 100 \
  id="uuid4()" \
  name="fromRegex('[a-z]{6,14}')" \
  description="oneOf(null(), fromRegex('[a-z]{20,80}'))" | tee sensors.jsonl
{"name":"igrebgjstln","id":"485141c9-6d4b-42d3-b18d-9ce0f216fb97","description":"vwimhoblbynoalrbzrjblaaynkdqcwudldtrfsfxmjgygmvelujxivqnkyvb"}
{"name":"mpwdwkd","id":"d0c5a93a-018f-4dd9-bc2c-8b32584e2a46","description":null}
{"name":"gdhnavqm","id":"68d62bfe-1e6a-4495-bc58-ca63a7004e75","description":"npswouxyumopmjvcgbubtwxnzlysijpqzkzbnbdgskwkhfyfmlnxbzleuxrtbzt"}
{"name":"gsgqjgcwndtxg","id":"0a12e9b7-504d-462f-afa4-ca2f93ef7fe6","description":"hmoerwsismcttcmpdtrefcbqtxgnurhrqulzpuyysxdpkyljbbxbpuzsxxtkxprzqglvyk"}
{"name":"wagbwkjeyuynoq","id":"e74607e0-22e2-4528-9554-c9ba0776e129","description":"lpospgtwbbeniujpfrppcjhnyjuizkrdnotgehettjmrtvuzinapjqbsxrrbhb"}
...

Now, the data for the larger table should reference records in the sensors table. For this to work we can extract the ids from the sensors.jsonl file:

cat sensors.jsonl| jq -r .id > sensor_ids.txt

And then lookup the ids when generating the data for the metrics table, like so:

↪  mkjson --num 1000000 \
  sensor_id="oneOf(fromFile('sensor_ids.txt'))" \
  value="randomDouble()" \
  ts="randomDateTime('2019-01-01', '2020-05-01 23:00')" | cr8 insert-json --table metrics --hosts localhost:4200

Wrap up ¶

This was a short introduction showing how you could use mkjson to generate random data quickly. It may be too simplistic to model more advanced scenarios. But when all you need is a bunch of data with some simple properties, it could be the right tool and save you some time that you’d otherwise spend writing a custom generator.