Overview

A common problem when starting a new project is getting fixtures in place to facilitate testing of reporting functionality and refining data models. To ease this, I've created a PDI job that creates the dimension tables, and populates a fact table.

My inspiration is from a variety of sources (for example the CDG project from Webdetails).

Here is the job that controls everything. It doesn't do anything special except link the transformations to each other.

Data warehouse generator job

Next up, the dimension data generator transformation.

Dimension data generator transformation

This transformation creates a template JSON document which is parsed and used in other steps, as well as defines the size of the date dimension and the start date. Here is the step by step breakdown.

To start, we generate a dummy row to kick off the transformation.

gendimdata-generate-rows

The following step is where all the dimensions are defined (might need to zoom this image a bit to see the text). We're defining a JSON structure in here that contains all the data we'll need later on for both populating the fact table as well as creating the dimensions.

gen data template

Once the data is serialized, the remaining steps configures the start date and number of days, then passes this information along to the next transformation.

set constants

Next up, we've got the dimension populator transformation.

Populate dimensions

Grab the input rows from previous transformation.

Grab input

Here is what the dimension generator looks like. As you can see, there's one of these steps for each dimension. In the example I've posted, I'm creating a dimension that contains a hierarchy with several levels.

Dimension data generator

Then we save the dimension data to the database

Dim table output

For the date and time dimensions, we do things slightly differently. To generate all the dates in the range requested, we kick off a javascript step that generates empty rows, one for each day requested.

Date row gen

We create a sequence, which is going to be our technical key later on. It represents the number of days since the first day generated.

Date sequence

We next do some date math to enrich our working set a bit. In my date dimension, I'm actually not using all of these, but it's handy to have them around.

Date calc

Finally we output to the dimension table.

Date table output

Similarly for time, we generate one row for each hour of the day.

Time generator

And save it to the dimension table.

Time table output

Finally, we have the transformation to populate the fact table.

FactGen

Grab the input rows from the previous transformation.

Grab input rows

Generate a series of days in the range requested.

Generate day rows

Create a sequence.

Generate day sequence

Do some date math.

Date calc

Metadata cleanups

Select values

Select values other tab

Generate row data for each dimension. So I've chosen to generate data for every hour of every day in the requested date range as well as for one of the dimensions (banner in this case).

Fact generator

Here's the second (start) javascript script on that step.

Fact generator tab 2

Closing thoughts

The above approach obviously requires a bit of copy pasta and customization to get it working for a particular use case.

I would have liked to completely automate the creation of dimensions and data generation based on the json schema, but
that requires to add extra information in the JSON structure (to model the hierarchies and levels). In a similar vein,
it would have been nice to abstract away the data generation to make it plug and pray with no customizations, but I don't
know if the javascript API exposes the necessary PDI hooks to do everything, and I'm too lazy to write this in Java.