In Part 2, we discussed how to create a script to generate SQL to check for potentially invalid source data prior to beginning work on an ETL project. In our final installment, we’ll discuss some of the other data issues that you will want to evaluate and what your final meta-data script might look like.
As we said at the beginning of this series, when available, automated tools for profiling are certainly the best as they will offer you a menu of data profiling options to choose from. However, when they are not available, you will need to pick what metadata questions matter the most to your project. Here are the most likely candidates:
Range of values. Beyond the extreme dates of birth example we explored in Part 2, you may to evaluate other questions such as “What’s the biggest and smallest account?” or “Who’s been a customer the longest?” Fortunately, the template we used in Part 2 can be the starting point for all of those questions.
What is the cardinality of the data in both senses of the word? Both the relationships between parent and child data and the distribution of any code values in the data. And while it’s beyond the scope of this series, with some more work, a histogram of the source data could also be generated.
And finally, do the values in the source data match the values described in the source documentation?
Let’s explore than last bullet point first, since there are two ways to do this. The first way is to simply combine this task with the task outlined in the second bullet point and manually check for any invalid values later. Remember, you’re going to need to document your migration/ETL rules element by element so “eyeballing” the data is actually more practical than it might seem. The second way is to exclude the values listed in the documentation and display only the invalid values in the WHERE clause. The pseudo code for that will look something like this:
SELECT ‘SELECT DISTINCT’ || column_name
|| ‘FROM ‘ || v_table_name
|| ‘WHERE ‘ || column_name || ‘ NOT IN (<list of values>);’
FROM table_columns_with_datatypes – this is your data dictionary view
So let’s say your client has told you that they only employ salaried staff and outsource all other functions. But your subject matter expert says that no, housekeeping and cafeteria staff used to be employed directly. Let’s see who’s right.
SELECT DISTINCT job title
FROM product
WHERE column_name NOT IN (‘ACCOUNTANT’,’CONSULTANT’,’MANAGER’,’EXECUTIVE’);
And you get the following output.
job_title
JANITOR
COOK
CASHIER
SERVER
Whoa! It turns out that one point they did have hourly workers. You’ll need to develop a business rule for handling that legacy data. But there’s one problem with this approach. You’ll need to modify the query almost on a case by case basis. This might be useful if you have reason think there is a specific problem.
The other method is to simply combine the Cardinality check with the valid values check. The technique for that is this:
SELECT ‘SELECT ’ || column_name || ‘, COUNT(*),
|| ‘FROM ‘ || v_table_name
|| ‘GROUP BY || column_name
FROM table_columns_with_datatypes – this is your data dictionary view
Which results in ….
SELECT beverage_code, count(*)
FROM product
GROUP BY beverage_code;
And you get the following output ….
beverage_code count(*)
MANAGER 6
ACCOUNTANT 12
CONSULTANT 14
EXECUTIVE 3
JANITOR 2
COOK 4
SERVER 3
CASHIER 5
Note that if you take this approach, you’ll have to add additional logic in the WHERE clause to ensure that you are only polling look up table values. So let’s put this all together into one final script and add that logic.
—
v_table_name ARRAY
v_directory STRING <drive/folder> — This will be a constant
FROM table_columns_with_datatypes – this is your data dictionary view
WHERE datatype IN (‘date’,’number’);
— Get our possible lookup column values
SELECT ‘SELECT ’ || column_name || ‘, COUNT(*),
|| ‘FROM ‘ || v_table_name
|| ‘GROUP BY || column_name
FROM table_columns_with_datatypes
WHERE column_name IN
(SELECT column_name
FROM constraint_columns
WHERE constraint_type = ‘Foreign Key’)
SPOOL OFF
END LOOP
END
—
There will of course be many, MANY more tweaks and updates both as you explore your data on your own, and as you elicit feedback from your team and your client, but this will get you started on the concept. Good Luck!
The Caveman’s Guide to Data Profiling: Part 3
As we said at the beginning of this series, when available, automated tools for profiling are certainly the best as they will offer you a menu of data profiling options to choose from. However, when they are not available, you will need to pick what metadata questions matter the most to your project. Here are the most likely candidates:
Let’s explore than last bullet point first, since there are two ways to do this. The first way is to simply combine this task with the task outlined in the second bullet point and manually check for any invalid values later. Remember, you’re going to need to document your migration/ETL rules element by element so “eyeballing” the data is actually more practical than it might seem. The second way is to exclude the values listed in the documentation and display only the invalid values in the WHERE clause. The pseudo code for that will look something like this:
SELECT ‘SELECT DISTINCT’ || column_name
|| ‘FROM ‘ || v_table_name
|| ‘WHERE ‘ || column_name || ‘ NOT IN (<list of values>);’
FROM table_columns_with_datatypes – this is your data dictionary view
So let’s say your client has told you that they only employ salaried staff and outsource all other functions. But your subject matter expert says that no, housekeeping and cafeteria staff used to be employed directly. Let’s see who’s right.
SELECT DISTINCT job title
FROM product
WHERE column_name NOT IN (‘ACCOUNTANT’,’CONSULTANT’,’MANAGER’,’EXECUTIVE’);
And you get the following output.
job_title
JANITOR
COOK
CASHIER
SERVER
Whoa! It turns out that one point they did have hourly workers. You’ll need to develop a business rule for handling that legacy data. But there’s one problem with this approach. You’ll need to modify the query almost on a case by case basis. This might be useful if you have reason think there is a specific problem.
The other method is to simply combine the Cardinality check with the valid values check. The technique for that is this:
SELECT ‘SELECT ’ || column_name || ‘, COUNT(*),
|| ‘FROM ‘ || v_table_name
|| ‘GROUP BY || column_name
FROM table_columns_with_datatypes – this is your data dictionary view
Which results in ….
SELECT beverage_code, count(*)
FROM product
GROUP BY beverage_code;
And you get the following output ….
beverage_code count(*)
MANAGER 6
ACCOUNTANT 12
CONSULTANT 14
EXECUTIVE 3
JANITOR 2
COOK 4
SERVER 3
CASHIER 5
Note that if you take this approach, you’ll have to add additional logic in the WHERE clause to ensure that you are only polling look up table values. So let’s put this all together into one final script and add that logic.
—
v_table_name ARRAY
v_directory STRING <drive/folder> — This will be a constant
v_spool_sql STRING
v_spool_log STRING
START
SELECT table_name
INTO v_table_name
FROM all_tables
WHERE owner = <source_schema>
FOR all_rows IN v_table_name LOOP
v_spool_sql STRING ‘spool ‘ || directory || ‘\’ || v_table_name || ‘.sql’
v_spool_log STRING ‘spool ‘ || directory || ‘\’ || v_table_name || ‘.log’
TURN ON OUTPUT
SPOOL v_spool_sql — Create your SQL script files
SELECT ‘SPOOL ‘ || v_spool_log FROM dual; — or similar dummy table
— Get our minimum and maximum values
SELECT ‘SELECT ’ || column_name || ‘, MINIMUM(column_name), MAXIMUM(column_name) || ‘FROM ‘ || v_table_name
FROM table_columns_with_datatypes – this is your data dictionary view
WHERE datatype IN (‘date’,’number’);
— Get our possible lookup column values
SELECT ‘SELECT ’ || column_name || ‘, COUNT(*),
|| ‘FROM ‘ || v_table_name
|| ‘GROUP BY || column_name
FROM table_columns_with_datatypes
WHERE column_name IN
(SELECT column_name
FROM constraint_columns
WHERE constraint_type = ‘Foreign Key’)
SPOOL OFF
END LOOP
END
—
There will of course be many, MANY more tweaks and updates both as you explore your data on your own, and as you elicit feedback from your team and your client, but this will get you started on the concept. Good Luck!
Categories