The Caveman’s Guide to Data Profiling: Part 1

25572662 - prehistoric painting with notebook digital illustrationNormally on our blog, our Director of Technology and all around rock star, Adam D’Angelo, fills you in on the latest technologies and techniques that we are using at Dev Technology Group to deliver value to our partners and exceed expectations by design. But this year, Adam added a new title to his resume, “Dad.” So while Adam’s spare time is being eaten up by diaper changes and late night feedings, yours truly will be filling in to show you how you can deliver value and jump start a BI or Data Migration project by data profiling. And not like a rock star…but like a Caveman.

So what do I mean by data profiling? Wikipedia defines it as “the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data.” And what do I mean by “like a Caveman?” I mean profiling data using nothing more than SQL, some scripting, and the source (or staging) database’s data dictionary. I’ll be relating this a hypothetical data migration project, but the techniques involved are applicable to BI projects as well. It should also be noted that there are applications, and definitions, of data profiling beyond ETL, but all projects need a scope and this where I’m setting my limit.

Now I know what many people reading this will ask: Why on earth would you manually profile your data when there are tools out there, including IDE’s like TOAD, that will profile it for you automatically?

The first reason is that even if procuring a profiling tool is in the project plan, there are times when the profiling tool will not be in place for your team when the project begins. It’s not even a given that information from an automated analysis (used by the DBMS and by extension, the DBAs to tune query performance) will be available. And even if it is, it won’t tell if you have logical issues with the legacy (but more about that in a minute).

And the clock will be ticking. Even on an Agile project, you will have to deliver artifacts such as logical and physical maps and migration plans because you WILL be told at some point that you have to prove to a skeptical stakeholder that migrated ALL of the required data correctly. And the sooners those stakeholders can start cross checking your work and your assumptions, the better off everyone involved will be. It will also only be a matter of time before whoever is leading the effort to develop the front end for your project that they leverage a masked sample of the transformed source data to give their developers a more realist set of test data. You can’t wait for that shiny new profiling tool to get started.

The second reason is pretty simple. When working on an ETL project, you need to know the data. And getting into the data source and seeing what the entities, attributes and values are yourself is the best way to do that. In part 2, I’ll show you how.


This post is the first in a 3-part series by Scott McGuffin. View the entire series here.