DataOps Principle #9: Analytics is code (cont., cont.)
I love Regular Expressions… but I’ve been told I may lean on them too much. In fact, in the past I would ask syntax-based regular expressions questions during interviews in order to evaluate potential candidates. Now that I’m older and hopefully wiser, I’ve learned that a question such as “what is the difference between the ‘$’ or ‘^’ character in an expression?” is unproductive and does not determine whether or not a candidate would be suited for data engineering. In fact, I’ve worked with several exceptional colleagues over the years who could not recall specific regular expression syntax in an interview. Yet they knew when regular expressions (I’m calling Regex from now on) were the right tool to use and how to build an expression quickly using well know accelerators.
So, what does Regex have to do with Power BI? Well, from my colleagues and other experts in the data industry, I’ve learned that Regex is an essential tool to identify data drift within data pipelines. However, building Regex in the Power BI space is challenging:
1) No built-in support – If you type “Regular Expressions in Power BI” in your search engine of choice, you’ll find some excellent articles by Shivam Shukla and Philip Seamark. Yet these articles don’t cover Regex support with the Power Query language (M) or DAX. These articles explain how to use the R language, but this requires you to install R on your desktop and requires you to learn yet another language. Don’t get me wrong, R is a great language and a nice option for Power BI, but the two core languages of Power BI don’t support Regex.
2) Not production-ready – So if you’re left using R (or Python for that matter) for running Regex in Power BI, then you’re left with a big problem if you want to run this code in the Power BI Service. You need a personal gateway installed to schedule a refresh; furthermore, as of March 2022 you cannot use an enterprise data gateway (much to the consternation of several folks on the idea forum) or the Power BI service to schedule a refresh. In my experience, if you can’t use an enterprise data gateway or the service, the R/Python option is not viable for long-term success.
A Probable Solution (Work Around)
I’ve been experimenting with this approach for a few months now and this approach allows you to run Regex expressions against Power BI datasets leveraging the pipeline from Part 15. This approach involves the following components:
- PowerShell – Yes this is another language, however, it is not one you need to learn because it's written into the pipeline and uses the built-in Regex libraries to run the expressions.
- Regex Health table – Each row contains the data that defines how to inspect the contents of a column using a Regex. I chose this format because the alternative was storing a separate file in the project folder that would have a tabular schema. Instead of introducing a new file format or schema, I felt having a table within the dataset define the Regex checks would be easier to manage and take up less space in the model. The Regex Health table consists of 3 columns:
- Table - The table name as precisely defined in the model.
- Column - The column name as precisely defined in the model.
- Regex - The expression using the .NET format.
Figure 1 – A hidden table in the model to track Regex
- Power BI API – I've been waiting for the right time to take advantage of the newer feature of executing DAX queries over the Power BI Rest API. With this feature we can pull the contents of the Regex Health table but also pull the contents of the columns.
With these components you can then check the contents of a dataset and detect potential data drift with Regex using a flow as illustrated in Figure 2.
Figure 2 – Flow for detecting data drift with Regex
This flow is encapsulated into a single PowerShell function Confirm-RegexPassesWithPPU and added to a preexisting pipeline as defined in Figure 3. Now the pipeline will run the Regex tests after all the DAX tests (see Part 15) have completed.
Figure 3 – A new continuous deployment pipeline with Regex checks
I’d really like to hear your feedback on this possible solution, so I’ve loaded this pipeline, installation instructions, and a sample Power BI dataset on Github. Try it out and let me know what you think on LinkedIn or Twitter.