Data Warehouse Transformation Code Smells


Keep on the lookout for these trouble signs in SQL transforms

There is a strange paradigm in Data Engineering when it comes to transformation code. While we increasingly hold extract and load (“EL”) programming to production software standards, transform code continues to be treated as a second-class citizen. Ironically, transform code often contains the complex business logic that could benefit greatly from being treated like software.

A code smell is “a surface indication that usually corresponds to a deeper problem in the system.” In even simpler terms, code smells are patterns in software that beg us to look a little closer. Code smells in your application are not unlike actual smells in a refrigerator: a pungent odor may signal that something unsavory is present (like that carton of decade-old moo shoo pork), or it may be as innocuous as Limburger cheese. Code smells do not guarantee that a problem exists, and often it is the case that the best possible refactor resembles a different smell. The value lies in that each occurrence prompts you to question what solution offers the most readable, maintainable transform code.

What follows is a collection of code smells specific to dimensional Data Warehouse transforms. Encountering them should give you pause, and present you with opportunities to leave the codebase better than you found it.

Multi-Column Coalesce Statements

COALESCE(reciept_price,  
         COALESCE(label_price,  
                  COALESCE(catalog_price),0))) AS item_price

Translated to English: “If the receipt price isn’t there try the label price, if that isn’t there try the catalog price, and if all else fails, figure the price at 0.”

Why it smells: Fumbling through a handful of columns to grab the first available value indicates that data is not well understood. Either the code does not know why one column value deserves preference, or the resulting column is a mashup of several states that should be independent.

Possible refactoring: The nested coalesce above likely represents multiple independent states forced into a false condition. Consider replacing with either an explicit decision tree (usually a CASE statement) or breaking each state into a distinct fact.

Reserved Words as Identifiers

SELECT  
   user_sp_role AS "ROLE"  

Translated to English: “Name the user_sp_role column ROLE , people will know what that means.”

Why it smells: A core tenet of Data Warehouse design is that the interface should err on the side of simplicity. Using reserved words (even reserved words allowed by your particular dialect) will introduce complexity and opportunities for confusion.

Possible refactorings: Stick to verbose identifiers that are easy to use, do not require quotes, and will keep the Data Warehouse accessible to users of all SQL aptitudes. ROLE could more intuitively be named web_application_role and avoid pointless confusion.

Sentinal NULL Values in Dimensions

SELECT  
    customer_name  
    ,customer_address_id # null means we have no address on file  
    ,customer_phone  
FROM   
   customer  

Translated to English: “If you want all the customers that do not have registered phone numbers just select where the phone number is NULL.”

Why it smells: NULL is a very important value in the Data Warehouse world. If a join goes bad, there will beNULL values. If a group by fails or a window function isn’t sliding as we expect, there are NULL values. When NULL plays double-duty as a legit data value, debugging becomes nearly impossible. Add to this that BI tools often behave inconsistently when presented with NULL values, and you have a perfect place for bugs to hide.

Possible refactorings: Don’t use NULL values in dimensions; explicitly state each possible condition (ie use ELSE in CASE statements) so that any NULL value immediately draws scrutiny. This will not only harden your transform code but contribute to the intuitive nature of the end product data. NULL can mean a lot of things, but 'No Phone Number Available' is crystal clear.

Note This smell only applies specifically to dimension attributes. NULL values are not only correct but important data points for additive facts (such as total_sale_value ).

Magic Numbers

SELECT  
    customer_name  
    ,customer_address  
    ,customer_phone  
... 
WHERE  
    customer_unit_id IN (1,3,19)  
AND  
    customer_value_type = "a"

Translated to English: “We don’t use customer units 8 or 13 anymore, so we ignore them (Ted says 1,3, and 19 are all that matter). We also only care about the primary website customer value types (Bob says those are designated by ‘a’). ”

Why it smells: Good code is self-documenting. This generally means you can read the code and understand what it does without a decoder ring. The example above isn’t challenging due to complex business logic or technical intricacy, but because it is overflowing with tribal knowledge.

Possible refactorings: CTEs are great tools for data mapping:

WITH   
value_types AS (  
    "a" AS primary_website_customer  
...
)  
...  
AND  
    customer_value_type = value_types.primary_website_customer 

When larger refactors are not possible, comments are better than nothing. Look for variables and constants that can be more descriptively named as a cheap way to greatly improve the codebase.

Data Elimination

WITH all_visits AS (  
    SELECT   
        *  
    FROM  
        website_visits  
),  
SELECT  
    *  
FROM   
    website_visits  
WHERE   
    visit_id IS NOT NULL

Translated to English: “website visits should always have a visit_id , so if they don’t, the record is bad and we should throw it out.”

Why it smells: The foundation of any Data Warehouse is truth. Not just some, but the whole truth, which destructive transforms cannot provide. A Data Warehouse missing records (even “bad” records) has no credibility, and you will quickly find consumers asking for raw source access.

Possible refactorings: Transform logic should be additive, presenting greater value to the end-user. In the example above, a new column valid_record would filter to the same dataset in a BI layer while providing consumers with the confidence of having access to “all the data”.

Assumptive Business Logic

CASE   
   WHEN user_time_zone IS NULL THEN (last_login time at time zone 'pst')  
   ELSE (last_login_time at time zone user_time_zone)  
END AS last_login_time

Translated to English: “Most of our web traffic is from the SF Bay area, so if a web visit is missing a timestamp we update it to PST.”

Why it smells: The job of the Data Warehouse is to provide users with the ability to make informed decisions, not to make decisions for them. Every time transform logic chooses a path for the data, it inevitably removes options from the consumer in the process.

Possible refactorings: In the example above, the original last_login_time would ideally renderlast_login_time_without_timezone along withlast_login_time_with_timezone ; the end-user can then decide to make assumptions about the missing timezones at their own peril.

Run Time as an Input

SELECT  
   *  
FROM  
   all_records  
WHERE  
   created_at::DATE >= DATEADD('days',-1, CURRENT_DATE()) 

Translated to English: “The records with a created date greater than yesterday are the new records.”

Why it smells: Any time the same code can be run twice against the same data and return different results, consider it a problem. Good transformation logic is both idempotent and deterministic. Unstable elements such as the current date or time make the code brittle, and can easily land the system in an uncorrectable state if a transform job fails or is run twice.

Possible refactorings: Design transforms in a manner that is self-healing. Using the same example:

  • Only a slight modification is needed if the records are guaranteed to be incrementing (no late-arriving records).

    SELECT  
    *  
    FROM  
    all_records  
    WHERE  
    created_at > (SELECT MAX(created_at) FROM target_table)
  • Greater volatility in source data calls for greater transform complexity (and greater computing cost). Depending on how late-arriving records can be, the code may be restricted to a window using a predicate statement.

    SELECT  
    all_records.*  
    FROM  
    all_records 
    WHERE  
    MD5(all_records.*::TEXT) NOT IN (SELECT MD5(target_table.*::TEXT) FROM target_table)
    
    /*   
    if records are always < 30 days late, you could restrict the lookup ie (SELECT MD5(target_table.*::TEXT) FROM target_table WHERE target_table.created_at::DATE >= DATEADD('days',-30, CURRENT_DATE()))  
    */

Inconsistant Tenses, Prefixes & Suffixes

SELECT
	user_id
	,id
	,identifier
FROM
	users
JOIN
	site
...
JOIN
	dim_visits

Translated to English: Unstructured grammar around identifiers, erratic prefixing of column names, and lack of a vocabulary system.

Why it smells: In a Data Warehouse the schema is the product interface. Unpredictable lexis serve as undue friction for the user. Is the table order or orders? Is the column sale_price or order_sale_price? Without a pattern, this is all overhead to the usability of a Data Warehouse.

Possible refactorings: Select conventions. Document them. Update the transform code to reflect them. The same query with homogenious language might look like:

SELECT
	user_id
	,site_id
	,visit_id
FROM
	user
JOIN
	site
...
JOIN
	visit

Technology References in Identifiers

CREATE OR REPLACE TABLE POSTGRES_USERS AS ...

Translated to English: Any table, view, schema, database or column where the name reflects the source system (ie postgres_user ), the extract-load medium (ie DATA_WAREHOUSE.STITCH.USERS ) or any other mechanical component of the ELT process (ie cron_daily.users ).

Why it smells: It can be hard for engineers to get out of our own headspaces. This smell often results from designing a schema “source down” instead of “end-user up”. The Data Warehouse must represent information in a way that reflects business domain objects; for example, a hospital does not think of its consumers as “_billing users_” and “_chart system users_” and “_prescription users_”, they are all simply “_patients_”.

This is a particularly hard smell to detect because the business domain often runs very close to the technology domain, and users may have trained themselves to incorrectly align one with the other. If a retailer has distinct eCommerce and physical point-of-sale systems, it is very easy to think that the eCommerce system represents web_users and the POS system represents in_store_users . But this is not the case; the business has only CUSTOMERS who may shop in a store, online, or both.

Possible refactorings: Think of your data product the way a UX designer would design an intent-driven application interface. If you log into your Medium account you are asked for your username and password, not your “dynamo_dbusername and password. By the same logic, your Data Warehouse userbase is interested in page visits, not Google Analytics page visits or Adobe Analytics page visits.

Procs/Functions External To The Codebase

SELECT  
   super_amazing_stored_proc(122, 'another_magic_value') AS RPI_VALUE

Translated to English: Functions that are not part of the native SQL dialect for the target Data Warehouse and are not created as part of the codebase.

Why it smells: If we view the transform codebase as the blueprints by which our Data Warehouse is constructed, stored procs (not created as part of the codebase) are “off the books jobs”. The codebase no longer has all the elements of the machine and cannot effectively reproduce the warehouse. This dangerous and brittle state leaves the warehouse open to catastrophic failure if (when) the instance goes down.

Possible refactorings: If you are using a SQL framework like DBT (or any SQL precompilation really), avoid stored procs and functions completely. For those rare instances where a stored procedure or function is the only viable solution (or if you are using stored procs as your transform layer), include the definition of the proc in your code base with either a DROP.. CREATE or CREATE OR REPLACE pattern to ensure that it is recreated from your code with every run. This will minimize the gap between the state of your code and the state of production.

Quoted Identifiers

SELECT  
   1776 AS "FOUNDING_YEAR" FROM countries."America"."Important Dates"

Translated to English: Identifiers that are written case-sensitive or including special characters or reserved words.

Why it smells: SQL is a 4th generation language, and the intent of conventions like case folding (treating identifiers as case-insensitive values) is to more closely resemble human-to-human communication. Quoted identifiers generally swim against the current of this intent, forcing users to consider capitalization and potentially leading to confusing "Leads_Prod" vs "leads_prod" situations (these are 2 distinct tables!).

Possible refactorings: Just don’t quote identifiers, ever. Avoid the confusion and the overhead by using verbose, descriptive names for databases, tables/views, and columns. As a bonus, your code will be portable this way (case folding is not consistent across different platforms so any quoted identifier is instantly non-portable).

Note: There was a valiant effort in the earlier days of data warehousing to quote everything, making identifiers as pretty and report-ready as possible with column names like "Monthly Report Status". At the time this made a lot of sense, as much of the consumption was directly from Data Warehouse tables into reports and spreadsheet extracts. Today I would argue that BI tools are the best place for this kind of “presentation polish”, and the Data Warehouse benefits more by keeping identifiers clean and verbose.

Timestamps Without Timezones / Timezones Not In UTC

SELECT  
   TIMESTAMP '2020-01-01 13:10:02' AS questionable_tstamp  
   ,TIMESTAMP WITH TIME ZONE '2020-01-01 01:11:21+04' AS another_confusing_tstamp

Translated to English: Any timestamp that is not explicitly cast to UTC value, especially the use of “local time” as a standard.

Why it smells: Timestamps are the messiest of datatypes. The implementation and handling of timestamps differ greatly from platform to platform, language to language, and especially tool to tool.

Possible refactorings: Explicitly convert all timestamps to UTC for storage. Note that this is not the same as converting and then stripping the timezone (a weird yet painfully common practice that likely stemmed from a belief that timestamps without timezones are “easier”).
Consistent use of UTC will streamline onboarding new datasets, eliminate daylight savings time confusion, and future-proof organizational knowledge past the point of a single timezone. Let the BI tools worry about timestamp presentation (most will do it anyway, and those “helpful” upstream conversions will likely do more harm than good).

Normalization

SELECT  
   u.user_name  
   ,u.user_area_code  
   ,si.site_name  
FROM   
   users u  
INNER JOIN  
   sites s  
ON   
   u.site_id = s.id  
INNER JOIN  
   site_identifiers si  
ON   
   s.id = si.site_id

Translated to English: Schemata that reflect traditional BCNF that you would expect to find in transactional database designs. In this example site_identifiers have been normalized out of site to protect referential integrity.

Why it smells: Data warehouses are OLAP structures that fulfill a very different need from transactional databases. Normalization and referential constraints are important parts of how OLTP systems do their job - but these tools are detrimental to the goals of a knowledge store. Data Warehouses do not represent the desired state (ie that all page_views have a source_id that exists in the traffic_sources table), they represent the reality (ie that a bug associated 1 million page_views to a non-existent source). From a higher vantage point, the presence of heavy normalization is probably a strong indicator that other OLTP conventions have been followed throughout the codebase.

Possible refactorings: Dimensional model design is outside the scope of this writing (for a greater understanding of how dimensional models differ from transactional models I highly recommend the Data Warehouse Toolkit by Ralph Kimball). In general, these normalized values should be “degenerated” to wide, flat dimensional tables like so:

SELECT  
   name  
   ,area_code  
   ,site_name  
FROM   
   users  

Hidden “Squishy” Logic

-- DDL for DATA_WAREHOUSE.SESSIONS  
WITH   
ordered_sessions AS (  
SELECT   
   *   
FROM   
   sessions   
ORDER BY insert_id  
)  
,session_merger AS (  
SELECT  
   CASE  
     WHEN TIMESTAMP_DIFF(a.session_end,b.session_start) < 60  
   AND  
     ABS(a.thumbprint_hash_points - b.thumbprint_hash_points) < 5   
   AND   
     EARTH_DISTNACE(a.location,b.location) < 0.5  
   THEN LEAST(a.session_id,b.session_id)  
   ELSE NULL   
   END AS merged_session_id  
FROM  
   ordered_sessions a  
INNER JOIN  
   ordered_sessions b   
ON   
a.insert_id +1 = b.insert_id  
)  
,refined_sessions AS (  
SELECT   
   o.*  
FROM  
   ordered_sessions o  
WHERE  
   o.session_id IN (SELECT merged_session_id FROM session_merger)   
)  
CREATE TABLE sessions AS 
SELECT * FROM refined_sessions

Translated to English: Complex transforms that are masked by seemingly stable identifiers.

Why it smells: “Squishy” logic is arbitrarily sound business logic: in the example above, the code decides that “two sessions occurring less than a minute apart, with extremely close thumbprints, and originating from (nearly) the same location are likely the same user session.” This smell here is not the logic - whether or not this is an accurate way to merge browser sessions is up to the business; the smell is representing “likely the same user session” as the absolute value session.

Possible refactorings: Data Warehouse transform code represents what is known to be true. In this example, we know that each session exists, while we hypothesise that certain sessions are actually the same session. If the hypothesis is supported by the business, it can easily be represented as additional information in the form of a likely_parent_session column. Aggregations on top of this hypothesis can exist in additional materializations, i.e. dim_collapsed_session and fact_collapsed_conversion etc. Often more than one hypothesis is needed to support the range of business use cases. In that event, each hypothesis can either materialize further downstream in a domain-specific mart or be “branded” and used to enrich dim_session in the Data Warehouse.

/* downstream mart */
SELECT
	amalgamated_session_id 
	,duration
	...
FROM marketing.amalgamated_sessions

/* "branded" in dim_session */
SELECT
	session_id	
	,aggressive_merge_session_id
    ,conservative_merge_session_id
    ,halifax_merge_session_id
...
FROM
    dim_session	

No Consumer Documentation

/* what do these mean?!? */
SELECT
	halifax_merge_session_id 
	,aggressive_merge_session_id 
FROM
	dim_session

Translated to English: For a consumer to make use of the Data Warehouse they need input from the transform authors.

Why it smells: The Data Warehouse is both a business tool and a consumer product. Like any complex tool intended for business use, it must ship with comprehensive documentation. Imagine if the only way to learn to use the VLOOKUP function in Excel was to call a Microsoft Engineer! Without consumer-facing documentation, the product would be impractical to use.

Possible refactorings: There is a multitude of places documentation can live. Nearly all Data Warehouse platforms support SQL comment meta for objects. If you use a transformation framework like DBT then consumer-facing documentation is baked in with dbt docs. Documentation can also be managed with tools like Sphinx, Read The Docs, or even simple markdown files. The documentation solution must, at a bare minimum: * be easy for consumers to access. * be maintained as part of the data product. * support effective search and navigation. * be as complete as possible, and “inside” references

Query-Style Identifier Naming

SELECT
	s.order_id
	,w.order_id
...
FROM
	confirmed_ecom_system_orders s
JOIN
	client_side_web_orders w

Translated to English: The use of shorthand model aliases, often one or two letters long.

Why it smells: Abbreviated shorthand is very useful for writing quick ad-hoc queries. But like all good software, transformation code should be self-documenting and use object names that mean something.

Possible refactorings: Naming identifers is one of the two hard things in software development. Use alias names that are descriptive, unique within the transform, and convey the content of the represented table/CTE/dataset:

SELECT
	ecom_orders.order_id
	,web_orders.order_id
...
FROM
	confirmed_ecom_system_orders ecom_orders
JOIN
	client_side_web_orders web_orders

Pandering Logic In KPIs

SELECT
	is_conversion_marketing
	,is_conversion_business_development
	,is_conversion_finance
FROM 
	web_orders

Translated to English: “Verticals refuse to agree on business logic around a KPI, so we support multiple versions of the truth.”

Why it smells: Organizational Maturity is a critical element in any successful data initiative. If the business is unwilling (or unable) to make sometimes-difficult decisions and move forward with a unified source of truth, this indecision will be reflected in the Data Warehouse codebase.

Possible refactorings: The refactor for this smell is technically simple but practically difficult. The business must evolve and declare a singular definition that all verticals will adopt. In SQL, this is as simple as:

SELECT
	is_conversion
FROM 
	web_orders

In the real world, this can be a political minefield.