Keep on the lookout for these trouble signs in SQL transforms
There is a strange paradigm in Data Engineering when it comes to transformation code. While we increasingly hold extract and load (“EL”) programming to production software standards, transform code continues to be treated as a second-class citizen. Ironically, transform code often contains the complex business logic that could benefit greatly from being treated like software.
A code smell is “a surface indication that usually corresponds to a deeper problem in the system.” In even simpler terms, code smells are patterns in software that beg us to look a little closer. Code smells in your application are not unlike actual smells in a refrigerator: a pungent odor may signal that something unsavory is present (like that carton of decade-old moo shoo pork), or it may be as innocuous as Limburger cheese. Code smells do not guarantee that a problem exists, and often it is the case that the best possible refactor resembles a different smell. The value lies in that each occurrence prompts you to question what solution offers the most readable, maintainable transform code.
What follows is a collection of code smells specific to dimensional Data Warehouse transforms. Encountering them should give you pause, and present you with opportunities to leave the codebase better than you found it.
Multi-Column Coalesce Statements
COALESCE(reciept_price, COALESCE(label_price, COALESCE(catalog_price),0))) AS item_price
Translated to English: “If the receipt price isn’t there try the label price, if that isn’t there try the catalog price, and if all else fails, figure the price at 0.”
Why it smells: Fumbling through a handful of columns to grab the first available value indicates that data is not well understood. Either the code does not know why one column value deserves preference, or the resulting column is a mashup of several states that should be independent.
Possible refactoring: The nested coalesce above likely represents multiple independent states forced into a false condition. Consider replacing with either an explicit decision tree (usually a
CASE statement) or breaking each state into a distinct fact.
Reserved Words as Identifiers
SELECT user_sp_role AS "ROLE"
Translated to English: “Name the user_sp_role column
ROLE , people will know what that means.”
Why it smells: A core tenet of Data Warehouse design is that the interface should err on the side of simplicity. Using reserved words (even reserved words allowed by your particular dialect) will introduce complexity and opportunities for confusion.
Possible refactorings: Stick to verbose identifiers that are easy to use, do not require quotes, and will keep the Data Warehouse accessible to users of all SQL aptitudes.
ROLE could more intuitively be named
web_application_role and avoid pointless confusion.
Sentinal NULL Values in Dimensions
SELECT customer_name ,customer_address_id # null means we have no address on file ,customer_phone FROM customer
Translated to English: “If you want all the customers that do not have registered phone numbers just select where the phone number is
Why it smells:
NULL is a very important value in the Data Warehouse world. If a join goes bad, there will be
NULL values. If a
group by fails or a window function isn’t sliding as we expect, there are
NULL values. When
NULL plays double-duty as a legit data value, debugging becomes nearly impossible. Add to this that BI tools often behave inconsistently when presented with
NULL values, and you have a perfect place for bugs to hide.
Possible refactorings: Don’t use
NULL values in dimensions; explicitly state each possible condition (ie use
CASE statements) so that any
NULL value immediately draws scrutiny. This will not only harden your transform code but contribute to the intuitive nature of the end product data.
NULL can mean a lot of things, but
'No Phone Number Available' is crystal clear.
Note This smell only applies specifically to dimension attributes.
NULL values are not only correct but important data points for additive facts (such as
SELECT customer_name ,customer_address ,customer_phone ... WHERE customer_unit_id IN (1,3,19) AND customer_value_type = "a"
Translated to English: “We don’t use customer units 8 or 13 anymore, so we ignore them (Ted says 1,3, and 19 are all that matter). We also only care about the primary website customer value types (Bob says those are designated by ‘a’). ”
Why it smells: Good code is self-documenting. This generally means you can read the code and understand what it does without a decoder ring. The example above isn’t challenging due to complex business logic or technical intricacy, but because it is overflowing with tribal knowledge.
Possible refactorings: CTEs are great tools for data mapping:
WITH value_types AS ( "a" AS primary_website_customer ... ) ... AND customer_value_type = value_types.primary_website_customer
When larger refactors are not possible, comments are better than nothing. Look for variables and constants that can be more descriptively named as a cheap way to greatly improve the codebase.
WITH all_visits AS ( SELECT * FROM website_visits ), SELECT * FROM website_visits WHERE visit_id IS NOT NULL
Translated to English: “website visits should always have a
visit_id , so if they don’t, the record is bad and we should throw it out.”
Why it smells: The foundation of any Data Warehouse is truth. Not just some, but the whole truth, which destructive transforms cannot provide. A Data Warehouse missing records (even “bad” records) has no credibility, and you will quickly find consumers asking for raw source access.
Possible refactorings: Transform logic should be additive, presenting greater value to the end-user. In the example above, a new column
valid_record would filter to the same dataset in a BI layer while providing consumers with the confidence of having access to “all the data”.
Assumptive Business Logic
CASE WHEN user_time_zone IS NULL THEN (last_login time at time zone 'pst') ELSE (last_login_time at time zone user_time_zone) END AS last_login_time
Translated to English: “Most of our web traffic is from the SF Bay area, so if a web visit is missing a timestamp we update it to PST.”
Why it smells: The job of the Data Warehouse is to provide users with the ability to make informed decisions, not to make decisions for them. Every time transform logic chooses a path for the data, it inevitably removes options from the consumer in the process.
Possible refactorings: In the example above, the original
last_login_time would ideally render
last_login_time_without_timezone along with
last_login_time_with_timezone ; the end-user can then decide to make assumptions about the missing timezones at their own peril.
Run Time as an Input
SELECT * FROM all_records WHERE created_at::DATE >= DATEADD('days',-1, CURRENT_DATE())
Translated to English: “The records with a created date greater than yesterday are the new records.”
Why it smells: Any time the same code can be run twice against the same data and return different results, consider it a problem. Good transformation logic is both idempotent and deterministic. Unstable elements such as the current date or time make the code brittle, and can easily land the system in an uncorrectable state if a transform job fails or is run twice.
Possible refactorings: Design transforms in a manner that is self-healing. Using the same example:
Only a slight modification is needed if the records are guaranteed to be incrementing (no late-arriving records).
SELECT * FROM all_records WHERE created_at > (SELECT MAX(created_at) FROM target_table)
Greater volatility in source data calls for greater transform complexity (and greater computing cost). Depending on how late-arriving records can be, the code may be restricted to a window using a predicate statement.
SELECT all_records.* FROM all_records WHERE MD5(all_records.*::TEXT) NOT IN (SELECT MD5(target_table.*::TEXT) FROM target_table) /* if records are always < 30 days late, you could restrict the lookup ie (SELECT MD5(target_table.*::TEXT) FROM target_table WHERE target_table.created_at::DATE >= DATEADD('days',-30, CURRENT_DATE())) */
Inconsistant Tenses, Prefixes & Suffixes
SELECT user_id ,id ,identifier FROM users JOIN site ... JOIN dim_visits
Translated to English: Unstructured grammar around identifiers, erratic prefixing of column names, and lack of a vocabulary system.
Why it smells: In a Data Warehouse the schema is the product interface. Unpredictable lexis serve as undue friction for the user. Is the table
orders? Is the column
order_sale_price? Without a pattern, this is all overhead to the usability of a Data Warehouse.
Possible refactorings: Select conventions. Document them. Update the transform code to reflect them. The same query with homogenious language might look like:
SELECT user_id ,site_id ,visit_id FROM user JOIN site ... JOIN visit
Technology References in Identifiers
CREATE OR REPLACE TABLE POSTGRES_USERS AS ...
Translated to English: Any table, view, schema, database or column where the name reflects the source system (ie
postgres_user ), the extract-load medium (ie
DATA_WAREHOUSE.STITCH.USERS ) or any other mechanical component of the ELT process (ie
Why it smells: It can be hard for engineers to get out of our own headspaces. This smell often results from designing a schema “source down” instead of “end-user up”. The Data Warehouse must represent information in a way that reflects business domain objects; for example, a hospital does not think of its consumers as “_billing users_” and “_chart system users_” and “_prescription users_”, they are all simply “_patients_”.
This is a particularly hard smell to detect because the business domain often runs very close to the technology domain, and users may have trained themselves to incorrectly align one with the other. If a retailer has distinct eCommerce and physical point-of-sale systems, it is very easy to think that the eCommerce system represents
web_users and the POS system represents
in_store_users . But this is not the case; the business has only
CUSTOMERS who may shop in a store, online, or both.
Possible refactorings: Think of your data product the way a UX designer would design an intent-driven application interface. If you log into your Medium account you are asked for your username and password, not your “dynamo_db” username and password. By the same logic, your Data Warehouse userbase is interested in page visits, not Google Analytics page visits or Adobe Analytics page visits.
Procs/Functions External To The Codebase
SELECT super_amazing_stored_proc(122, 'another_magic_value') AS RPI_VALUE
Translated to English: Functions that are not part of the native SQL dialect for the target Data Warehouse and are not created as part of the codebase.
Why it smells: If we view the transform codebase as the blueprints by which our Data Warehouse is constructed, stored procs (not created as part of the codebase) are “off the books jobs”. The codebase no longer has all the elements of the machine and cannot effectively reproduce the warehouse. This dangerous and brittle state leaves the warehouse open to catastrophic failure if (when) the instance goes down.
Possible refactorings: If you are using a SQL framework like DBT (or any SQL precompilation really), avoid stored procs and functions completely. For those rare instances where a stored procedure or function is the only viable solution (or if you are using stored procs as your transform layer), include the definition of the proc in your code base with either a
DROP.. CREATE or
CREATE OR REPLACE pattern to ensure that it is recreated from your code with every run. This will minimize the gap between the state of your code and the state of production.
SELECT 1776 AS "FOUNDING_YEAR" FROM countries."America"."Important Dates"
Translated to English: Identifiers that are written case-sensitive or including special characters or reserved words.
Why it smells: SQL is a 4th generation language, and the intent of conventions like case folding (treating identifiers as case-insensitive values) is to more closely resemble human-to-human communication. Quoted identifiers generally swim against the current of this intent, forcing users to consider capitalization and potentially leading to confusing
"leads_prod" situations (these are 2 distinct tables!).
Possible refactorings: Just don’t quote identifiers, ever. Avoid the confusion and the overhead by using verbose, descriptive names for databases, tables/views, and columns. As a bonus, your code will be portable this way (case folding is not consistent across different platforms so any quoted identifier is instantly non-portable).
Note: There was a valiant effort in the earlier days of data warehousing to quote everything, making identifiers as pretty and report-ready as possible with column names like
"Monthly Report Status". At the time this made a lot of sense, as much of the consumption was directly from Data Warehouse tables into reports and spreadsheet extracts. Today I would argue that BI tools are the best place for this kind of “presentation polish”, and the Data Warehouse benefits more by keeping identifiers clean and verbose.
Timestamps Without Timezones / Timezones Not In UTC
SELECT TIMESTAMP '2020-01-01 13:10:02' AS questionable_tstamp ,TIMESTAMP WITH TIME ZONE '2020-01-01 01:11:21+04' AS another_confusing_tstamp
Translated to English: Any timestamp that is not explicitly cast to UTC value, especially the use of “local time” as a standard.
Why it smells: Timestamps are the messiest of datatypes. The implementation and handling of timestamps differ greatly from platform to platform, language to language, and especially tool to tool.
Possible refactorings: Explicitly convert all timestamps to UTC for storage. Note that this is not the same as converting and then stripping the timezone (a weird yet painfully common practice that likely stemmed from a belief that timestamps without timezones are “easier”).
Consistent use of UTC will streamline onboarding new datasets, eliminate daylight savings time confusion, and future-proof organizational knowledge past the point of a single timezone. Let the BI tools worry about timestamp presentation (most will do it anyway, and those “helpful” upstream conversions will likely do more harm than good).
SELECT u.user_name ,u.user_area_code ,si.site_name FROM users u INNER JOIN sites s ON u.site_id = s.id INNER JOIN site_identifiers si ON s.id = si.site_id
Translated to English: Schemata that reflect traditional BCNF that you would expect to find in transactional database designs. In this example
site_identifiers have been normalized out of
site to protect referential integrity.
Why it smells: Data warehouses are OLAP structures that fulfill a very different need from transactional databases. Normalization and referential constraints are important parts of how OLTP systems do their job - but these tools are detrimental to the goals of a knowledge store. Data Warehouses do not represent the desired state (ie that all
page_views have a
source_id that exists in the
traffic_sources table), they represent the reality (ie that a bug associated 1 million
page_views to a non-existent source). From a higher vantage point, the presence of heavy normalization is probably a strong indicator that other OLTP conventions have been followed throughout the codebase.
Possible refactorings: Dimensional model design is outside the scope of this writing (for a greater understanding of how dimensional models differ from transactional models I highly recommend the Data Warehouse Toolkit by Ralph Kimball). In general, these normalized values should be “degenerated” to wide, flat dimensional tables like so:
SELECT name ,area_code ,site_name FROM users
Hidden “Squishy” Logic
-- DDL for DATA_WAREHOUSE.SESSIONS WITH ordered_sessions AS ( SELECT * FROM sessions ORDER BY insert_id ) ,session_merger AS ( SELECT CASE WHEN TIMESTAMP_DIFF(a.session_end,b.session_start) < 60 AND ABS(a.thumbprint_hash_points - b.thumbprint_hash_points) < 5 AND EARTH_DISTNACE(a.location,b.location) < 0.5 THEN LEAST(a.session_id,b.session_id) ELSE NULL END AS merged_session_id FROM ordered_sessions a INNER JOIN ordered_sessions b ON a.insert_id +1 = b.insert_id ) ,refined_sessions AS ( SELECT o.* FROM ordered_sessions o WHERE o.session_id IN (SELECT merged_session_id FROM session_merger) ) CREATE TABLE sessions AS SELECT * FROM refined_sessions
Translated to English: Complex transforms that are masked by seemingly stable identifiers.
Why it smells: “Squishy” logic is arbitrarily sound business logic: in the example above, the code decides that “two sessions occurring less than a minute apart, with extremely close thumbprints, and originating from (nearly) the same location are likely the same user session.” This smell here is not the logic - whether or not this is an accurate way to merge browser sessions is up to the business; the smell is representing “likely the same user session” as the absolute value
Possible refactorings: Data Warehouse transform code represents what is known to be true. In this example, we know that each session exists, while we hypothesise that certain sessions are actually the same session. If the hypothesis is supported by the business, it can easily be represented as additional information in the form of a
likely_parent_session column. Aggregations on top of this hypothesis can exist in additional materializations, i.e.
Often more than one hypothesis is needed to support the range of business use cases. In that event, each hypothesis can either materialize further downstream in a domain-specific mart or be “branded” and used to enrich
dim_session in the Data Warehouse.
/* downstream mart */ SELECT amalgamated_session_id ,duration ... FROM marketing.amalgamated_sessions /* "branded" in dim_session */ SELECT session_id ,aggressive_merge_session_id ,conservative_merge_session_id ,halifax_merge_session_id ... FROM dim_session
No Consumer Documentation
/* what do these mean?!? */ SELECT halifax_merge_session_id ,aggressive_merge_session_id FROM dim_session
Translated to English: For a consumer to make use of the Data Warehouse they need input from the transform authors.
Why it smells: The Data Warehouse is both a business tool and a consumer product. Like any complex tool intended for business use, it must ship with comprehensive documentation. Imagine if the only way to learn to use the
VLOOKUP function in Excel was to call a Microsoft Engineer! Without consumer-facing documentation, the product would be impractical to use.
Possible refactorings: There is a multitude of places documentation can live. Nearly all Data Warehouse platforms support SQL
comment meta for objects. If you use a transformation framework like DBT then consumer-facing documentation is baked in with
dbt docs. Documentation can also be managed with tools like Sphinx, Read The Docs, or even simple markdown files. The documentation solution must, at a bare minimum:
* be easy for consumers to access.
* be maintained as part of the data product.
* support effective search and navigation.
* be as complete as possible, and “inside” references
Query-Style Identifier Naming
SELECT s.order_id ,w.order_id ... FROM confirmed_ecom_system_orders s JOIN client_side_web_orders w
Translated to English: The use of shorthand model aliases, often one or two letters long.
Why it smells: Abbreviated shorthand is very useful for writing quick ad-hoc queries. But like all good software, transformation code should be self-documenting and use object names that mean something.
Possible refactorings: Naming identifers is one of the two hard things in software development. Use alias names that are descriptive, unique within the transform, and convey the content of the represented table/CTE/dataset:
SELECT ecom_orders.order_id ,web_orders.order_id ... FROM confirmed_ecom_system_orders ecom_orders JOIN client_side_web_orders web_orders
Pandering Logic In KPIs
SELECT is_conversion_marketing ,is_conversion_business_development ,is_conversion_finance FROM web_orders
Translated to English: “Verticals refuse to agree on business logic around a KPI, so we support multiple versions of the truth.”
Why it smells: Organizational Maturity is a critical element in any successful data initiative. If the business is unwilling (or unable) to make sometimes-difficult decisions and move forward with a unified source of truth, this indecision will be reflected in the Data Warehouse codebase.
Possible refactorings: The refactor for this smell is technically simple but practically difficult. The business must evolve and declare a singular definition that all verticals will adopt. In SQL, this is as simple as:
SELECT is_conversion FROM web_orders
In the real world, this can be a political minefield.