Generating Data Sets Using mkjson

  Friday, May 1, 2020

As a developer working on a database, I often need access to a data set to test various features. Initially, I created cr8 insert-fake-data to solve this problem. cr8 insert-fake-data reads the schema of a table and then utilizes the faker library to generate random records that match the table’s schema and inserts them.

cr8 insert-fake-data is easy to use and often accomplishes exactly what I need. However, in some cases, it is not as flexible as I need it to be, which led me to create mkjson, another tool to generate random records.


mkjson is a command line tool that generates JSON records. It takes pairs of <key>=<value> as arguments. You can use it to generate static records, like so:

↪  mkjson x=10  y=20

The primary use case, however, is to generate randomized data. For this, it provides the option to use functions as <value>. As an example, to generate random integers between one and ten you can use the randomInt function like so:

↪  mkjson --num 10 x="randomInt(1, 10)"

Parser combinators inspired the available functions. The idea is to have a few building blocks that can be combined to create more complex constructs.

For example, say you want a number that is either between one and five or between 80 and 85. For that mkjson provides the oneOf function:

↪  mkjson --num 10 x="oneOf(randomInt(1, 5), randomInt(80, 85))"

Generating data sets for JOIN statements

One of the use cases I had where cr8 insert-fake-data fell short was to generate a data set that is suited for an INNER JOIN. I wanted to have one large table and one small table.

Suppose the small table looks like this:

create table sensors (
  id text primary key,
  name text,
  description text

And the large table looks like this:

create table metrics (
  sensor_id text,
  value double precision,
  ts timestamp without time zone

With mkjson the data for the sensors table could be generated like so:

↪  mkjson --num 100 \
  id="uuid4()" \
  name="fromRegex('[a-z]{6,14}')" \
  description="oneOf(null(), fromRegex('[a-z]{20,80}'))" | tee sensors.jsonl

Now, the data for the larger table should reference records in the sensors table. For this to work we can extract the ids from the sensors.jsonl file:

cat sensors.jsonl| jq -r .id > sensor_ids.txt

And then lookup the ids when generating the data for the metrics table, like so:

↪  mkjson --num 1000000 \
  sensor_id="oneOf(fromFile('sensor_ids.txt'))" \
  value="randomDouble()" \
  ts="randomDateTime('2019-01-01', '2020-05-01 23:00')" | cr8 insert-json --table metrics --hosts localhost:4200

Wrap up

This was a short introduction showing how you could use mkjson to generate random data quickly. It may be too simplistic to model more advanced scenarios. But when all you need is a bunch of data with some simple properties, it could be the right tool and save you some time that you’d otherwise spend writing a custom generator.