Power BI for Data Engineers – Part 2

In part 2 of the series “Power BI for Data Engineers” we will dive into data model design(s) suitable for Power BI. In part 1 we discussed the metadata of the tables in Power BI, but how should these tables set up in the first place? Should you deliver one master table with all data, or should you set up a structure with several smaller tables? And how do these tables relate to each other?

Modelling your data properly is what’s going to make or break your Power BI performance and usability. However, for smaller solutions, the data model is not as of importance compared to larger solutions with over 500 million rows and 100 columns. Therefore, in this blog we focus on the latter scenario.

For clarity, I will be using an example I encountered at a client not too long ago. This client took new call center software into service and they would like to get insights into KPI’s such as:

  • (Average) number of calls per hour
  • (Average) number of calls per department
  • Average waiting time on the phone
  • Average call duration (from entering the queue to closing the call)

The number of calls are expected to be 1 million per month. However, this call center software doesn’t log one row per call, it logs many rows per call, think of the following scenario:

  • The customer calls and there are two people waiting in the queue, so I am also in the queue
  • After going through the queue, the customer is in contact with an employee, but the customer needs to be forwarded to another employee
  • The customer is back in the queue again until the new employee is available
  • The customer is in contact with the new employee and we are able to resolve my question together.

In this use case, the call center software generates six rows, one for each “action” and each step the customer moves forward in the queue. It looks something like this:

ConversationIDEmployeeIDCallerIDDepartmentIDEmployeeActionPositionDatetimeStartDatetimeEnd
f7197135-14e6nullc4abc1d3-50bnullnullWait32024-01-08 13:29:102024-01-08 13:31:54
f7197135-14e6nullc4abc1d3-50bnullnullWait22024-01-08 13:31:542024-01-08 13:34:21
f7197135-14e6nullc4abc1d3-50bnullnullWait12024-01-08 13:34:212024-01-08 13:38:00
f7197135-14e6692c8d98-7bdc4abc1d3-50b9dbbe320-35SarahTalk02024-01-08 13:38:002024-01-08 13:39:10
f7197135-14e6nullc4abc1d3-50bnullnullWait12024-01-08 13:39:102024-01-08 13:41:38
f7197135-14e6ae96b4a6-c36c4abc1d3-50b9dbbe320-35SarahTalk02024-01-08 13:41:382024-01-08 13:49:00

So you can imagine, the number of the rows will grow quickly. Power BI can process quite a bit of data, but with these number of rows you will hit the data model size limit of 1 GB in Power BI Pro or calculations might slow down as the number of rows grow. So how do you create the consumer layer in your data platform in such a way that this data can be analyzed in Power BI in a performant, efficient and user friendly way? The first step is selecting the right data modelling technique, which we will optimize in the second step.

Step 1: decide on your data modeling technique

Decide on your data modeling technique, but actually, for larger data models there is usually one best option. Power BI loves star schemas, so in about 95% of the times, the star schema will be your schema of choice. Star schema modeling is commonly used in relational data warehouses, and is also a great modeling technique for Power BI. I will not go into depth about why that is, but this video from Guy in a Cube explains the high over advantages.

A star schema consists of facts and dimensions, in which facts contain the transactions/events and dimensions the descriptive attributes. Star schemas are not only an optimal schema for Power BI, users of data models that are modelled as star schemas can easily locate descriptive attributes. Information about the Conversation can be found in the Conversation table, information about the Employee can be found in the Employee table. Adding descriptive attributes in the fact table (also known as degenerate dimensions) is also possible, but these attributes might be hard to find for users as these attributes are stored in an uncommon place. On the other hand, adding descriptive values in the fact table could prevent blowing up your semantic model size (more on this in a future blog) and therefore it is always important to find balance between usability and performance.

Looking at the data above, and knowing that Conversations, Employees, and Callers secretly have more attributes then just an ID, I would recommend creating the following tables:

  • Dim Conversation: dimension containing all related attributes to the conversation
  • Dim Employee: dimension contain all related attributes to the employee and their department (knowing that the employee is always part of only one department)
  • Dim Caller: dimension containing all related attributes to the caller
  • Dim Date: dimension contain all related attributes to the start date of the conversation
  • Dim Time: dimension containing all related attributes to the start time of the conversation
  • Fact Conversations: fact table containing the conversations, aggregated to only one row per conversation with the total sum of Wait Time and Talk Time of that specific conversation.

When drawn in a schema it looks like this:

The relationships between the dimensions and fact should always be one-to-many, with of course a few exceptions. The relationships should always between a dimension and a fact, or if chosen for a specific use case, between two dimensions to create a snowflake schema. It is not advised to connect two fact tables with each other, always insert a dimension, which could act as a bridge table. Connecting two fact tables directly can lead to undesired, incomprehensible results when building calculations and reports.

Step 2: optimizing the model

In the model above, I made some implicit choices to optimize the model even further to increase performance and usability. Let’s go over them:

  1. No GUIDs as IDs: As explained in Part 1, GUIDs take up a lot of space in a Power BI semantic model (I shortened the GUIDs in the above table for visual purposes). Therefore, changing GUIDs to proper IDs starting from e.g. 1 and counting upwards will save a lot of space.
  2. No datetime values: Datetime values are super unique, especially if you include seconds. When you look at one year, we have 365 days * 24 hours * 60 minutes * 60 seconds = possibly over 31 million unique values for each datetime column per year. These unique values take up a lot of space in the Power BI semantic model, therefore, for this case it is advised to split the date and time into separate dimensions. As an additional bonus, you will now be able to group the number of conversations per week, day and hour as you can select these attributes from the dimension. The requirements didn’t mention anything about the date en time the conversation ended, so this information is not added to the model.
  3. Aggregations: I aggregated the conversations table to a fact table in which there is only one row per conversation instead of multiple. Looking at the requirements, the users are not interested in the exact time of each action, only the (average) duration. Therefore, I added two additional columns “Wait Time” and “Talk time” which sums up the time these actions took within that conversation.
  4. Not too many decimals: The values, wait time and talk time are in seconds as no further detail is necessary. Therefore values such as 60, 71, and 90 are supplied, instead of 60,335352 / 71,3723732 / 90,37372 to save space in the semantic model.
  5. All required KPIs are now easy to calculate: Calculating the number of calls per department and per hour can now be easily counted in Power BI using a COUNT or COUNTROWS function. Average number of calls per department and per hour can be calculated using the AVERAGE function and also the average wait and talk time can be calculated in a similar fashion. Providing the data in a suitable format helps to keep your calculations in Power BI easy, simple and quick. But, watch out! DAX is very powerful, so definitely don’t precalculate everything in the platform! It can lead to unnecessarily large models and less flexibility in the model. Decide with the Power BI specialist where to allocate certain calculations.

So, this was part 2 of the series! Preparing the table structure, relationships between tables and columns in a suitable format for Power BI will help the Power BI developer enormously in building fast, efficient, usable and performant solutions. But, there is never one perfect way.. As the Microsoft documentation beautifully states: It’s important to understand that optimal model design is part science and part art. Sometimes you can break with good guidance when it makes sense to do so.

Welcome!

Thank you for visiting this (for now empty) blog! This is my very first blog post and I hope many more will come. I have bought this domain powerbiprincess.com a while ago, but never really started doing anything with it, until now! But what can you expect from me and this blog? Well, my beautiful, awesome colleagues nicknamed me “Paulien – the Power BI Princess” a while ago.

(more…)