In Part 1, we discussed why you need to get started immediately with profiling your data on an ETL project and why you’ll probably need to jump in with Caveman tools. These tools are simple SQL statements, some thing to handle the control logic, and the data dictionary where your source data resides (or is staged). So let’s start pseudo coding some examples of what to look for and how to look for it. And more importantly, how to automatically build the queries you’ll need for manual profiling.
Your DBMS will almost certainly have a table or view that gives you the names of the entities, attributes, and data types for every attribute in the database. For simplicity, going forward, I’ll assume the data is relational and we’re calling them tables, columns and data types respectively. I will also assume that your DBMS has some sort of native procedural language like T-SQL or PL/SQL that you will use for control logic, variables, and exception handling but there other languages that can do this for you as well.
So for our hypothetical dictionary view we’ll leverage SQL to generate more SQL by encoding a SELECT statement in a string and concatenating the values from the dictionary view into the string, then saving the output to a file that you will use later to actually run the profiling SQL. We’ll also assume that you are using a script, but you could create an application if need be.
Let’s start by finding out if you have any really old or really young people working for your client. This will require finding the range of values in the source tables. But first, you’re going to have to the pseudo code to generate the queries that will find that will look something like this.
First, define variables to log (or spool) your results into SQL scripts that can be run later to run the actual profiling
—
v_table_name ARRAY
v_directory STRING <drive/folder> — This will be a constant
v_spool_sql STRING
v_spool_log STRING
Second, populate the array with the names of add loop logic to your script.
START
SELECT table_name
INTO v_table_name
FROM all_tables
WHERE owner = <source_schema>
FOR all_rows IN v_table_name LOOP
Finally extract meta data to build your final queries
v_spool_sql STRING ‘spool ‘ || directory || ‘\’ || v_table_name || ‘.sql’
v_spool_log STRING ‘spool ‘ || directory || ‘\’ || v_table_name || ‘.log’
TURN ON OUTPUT
SPOOL v_spool_sql — Create your SQL script files
SELECT ‘SELECT ’ || column_name || ‘, MINIMUM(column_name), MAXIMUM(column_name) || ‘FROM ‘ || v_table_name
FROM table_columns_with_datatypes – this is your data dictionary view
WHERE datatype IN (‘date’,’number’);
So the final resulting script would look like this.
—
v_table_name ARRAY
v_directory STRING <drive/folder> — This will be a constant
v_spool_sql STRING
v_spool_log STRING
START
SELECT table_name
INTO v_table_name
FROM all_tables
WHERE owner = <source_schema>
FOR all_rows IN v_table_name LOOP
v_spool_sql STRING ‘spool ‘ || directory || ‘\’ || v_table_name || ‘.sql’
v_spool_log STRING ‘spool ‘ || directory || ‘\’ || v_table_name || ‘.log’
TURN ON OUTPUT
SPOOL v_spool_sql — Create your SQL script files
SELECT ‘SPOOL ‘ || v_spool_log FROM dual; — or similar dummy table
SELECT ‘SELECT ’ || column_name || ‘, MINIMUM(column_name), MAXIMUM(column_name) || ‘FROM ‘ || v_table_name
FROM table_columns_with_datatypes – this is your data dictionary view
WHERE datatype IN (‘date’,’number’);
SPOOL OFF
END LOOP
END
—
And using our example problem of checking for really old or really young employees, a sample of the resulting query on a table called “employee”, with a little formatting thrown in for clarity, might look like this.
SELECT emp_date_of_birth,
MINIMUM(emp_date_of_birth) as min_dob,
MAXIMUM (emp_date_of_birth) as max_dob
FROM employee;
Note that the reason I say “sample” is because the Caveman techniques demonstrated here will be generating a LOT of code so for brevity and clarity, will focus on this sample problem. Let’s see what the output of this generated query might be. Once again, I’ve thrown in a little extra formatting for clarity
column_name min_dob max_dob
emp_date_of birth 08/16/1888 05/10/2062
YIKES! It looks like somewhere along the line, the legacy application wasn’t using two digit years and messing up the format as it stored the data. Does this seem extreme? Maybe so, but did you know that Oracle once had a platform specific bug that randomly converted some dates to BC dates? Strange things can and DO happen to data, and the sooner you find out about them, the better.
In Part 3, we’ll wrap this discussion up by discussing some of the other data quality issues to look for and how to generate queries to check for them.
This post is the first in a 3-part series by Scott McGuffin. View the entire series here.
[…] « The Caveman’s Guide to Data Profiling: Part 2 […]