«

»

The Caveman’s Guide to Data Profiling: Part 2

32963966 - stone axe draft on old paper.In Part 1, we discussed why you need to get started immediately with profiling your data on an ETL project and why you’ll probably need to jump in with Caveman tools. These tools are simple SQL statements, some thing to handle the control logic, and the data dictionary where your source data resides (or is staged). So let’s start pseudo coding some examples of what to look for and how to look for it. And more importantly, how to automatically build the queries you’ll need for manual profiling.

Your DBMS will almost certainly have a table or view that gives you the names of the entities, attributes, and data types for every attribute in the database. For simplicity, going forward, I’ll assume the data is relational and we’re calling them tables, columns and data types respectively. I will also assume that your DBMS has some sort of native procedural language like T-SQL or PL/SQL that you will use for control logic, variables, and exception handling but there other languages that can do this for you as well.

So for our hypothetical dictionary view we’ll leverage SQL to generate more SQL by encoding a SELECT statement in a string and concatenating the values from the dictionary view into the string, then saving the output to a file that you will use later to actually run the profiling SQL. We’ll also assume that you are using a script, but you could create an application if need be.

Let’s start by finding out if you have any really old or really young people working for your client. This will require finding the range of values in the source tables. But first, you’re going to have to the pseudo code to generate the queries that will find that will look something like this.

First, define variables to log (or spool) your results into SQL scripts that can be run later to run the actual profiling

v_table_name  ARRAY

v_directory     STRING  <drive/folder>  — This will be a constant

v_spool_sql     STRING

v_spool_log     STRING

 

Second, populate the array with the names of add loop logic to your script.

 

START

 

SELECT table_name

INTO  v_table_name

FROM all_tables

WHERE owner = <source_schema>

 

FOR all_rows IN v_table_name LOOP

 

Finally extract meta data to build your final queries

 

v_spool_sql     STRING ‘spool ‘ || directory || ‘\’ || v_table_name || ‘.sql’

v_spool_log     STRING ‘spool ‘ || directory || ‘\’ || v_table_name || ‘.log’

 

TURN ON OUTPUT

 

SPOOL v_spool_sql      — Create your SQL script files

 

 

SELECT ‘SELECT ’ || column_name || ‘, MINIMUM(column_name),                 MAXIMUM(column_name) || ‘FROM ‘ || v_table_name

FROM table_columns_with_datatypes – this is your data dictionary view

WHERE datatype IN (‘date’,’number’);

 

So the final resulting script would look like this.

 

v_table_name  ARRAY

v_directory     STRING  <drive/folder>  — This will be a constant

v_spool_sql     STRING

v_spool_log     STRING

 

START

 

SELECT table_name

INTO  v_table_name

FROM all_tables

WHERE owner = <source_schema>

 

FOR all_rows IN v_table_name LOOP

 

v_spool_sql     STRING ‘spool ‘ || directory || ‘\’ || v_table_name || ‘.sql’

v_spool_log     STRING ‘spool ‘ || directory || ‘\’ || v_table_name || ‘.log’

 

TURN ON OUTPUT

 

SPOOL v_spool_sql      — Create your SQL script files

 

SELECT ‘SPOOL ‘ || v_spool_log FROM dual; — or similar dummy table

 

SELECT ‘SELECT ’ || column_name || ‘, MINIMUM(column_name),                 MAXIMUM(column_name) || ‘FROM ‘ || v_table_name

FROM table_columns_with_datatypes – this is your data dictionary view

WHERE datatype IN (‘date’,’number’);

 

SPOOL OFF

 

END LOOP

 

END

And using our example problem of checking for really old or really young employees, a sample of the resulting query on a table called “employee”, with a little formatting thrown in for clarity, might look like this.

 

SELECT emp_date_of_birth,

MINIMUM(emp_date_of_birth) as min_dob,

MAXIMUM (emp_date_of_birth) as max_dob

FROM employee;

 

Note that the reason I say “sample” is because the Caveman techniques demonstrated here will be generating a LOT of code so for brevity and clarity, will focus on this sample problem. Let’s see what the output of this generated query might be. Once again, I’ve thrown in a little extra formatting for clarity

 

column_name            min_dob              max_dob

emp_date_of birth   08/16/1888       05/10/2062

 

YIKES!  It looks like somewhere along the line, the legacy application wasn’t using two digit years and messing up the format as it stored the data. Does this seem extreme? Maybe so, but did you know that Oracle once had a platform specific bug that randomly converted some dates to BC dates? Strange things can and DO happen to data, and the sooner you find out about them, the better.

In Part 3, we’ll wrap this discussion up by discussing some of the other data quality issues to look for and how to generate queries to check for them.

 

This post is the first in a 3-part series by Scott McGuffin. View the entire series here.