Data Engineering is Fixing Messy Data - jmadison222/knowledge GitHub Wiki

| Home |


Most data is a mess. Then you have to fix it. That’s my definition of data engineering--Fixing Messy Data.

Once fixed, the data is both ready for business use and able to be composed with other data for larger and larger business solutions. Such data is sometimes called "tidy data" or "composable data" or what I like to call "lego brick data" since when it’s done right you can click the data sets together like the famous toy bricks.

What follows is a list of the many things that can go wrong with data and usually must be corrected to make the data fit for use. Use the items as a checklist to ensure it’s really ready to go.

None of these are theoretical. They are all real examples from my professional data experience starting in 1991 in Desert Storm and still rolling as of 2025.

95% of my data experience is in the space of the lake, warehouse, analytics, and data science. Consider the items from that perspective. The transactional layer presents a different set of data challenges. I’ll leave those challenges to the application experts since those data issues are really more about application behavior.

There are many TODO items, but the headings are pretty complete and can be used as a punch list even in the current form.


1. Structure

1.1. Primary keys that aren’t unique

💡
Check the keys on each table as soon as you encounter the table. Fix the key if it’s broken. Add the key check to your permanent quality checking process.

The single most important aspect of good data is that all tables are reliably keyed. Keys must be unique across all rows. Frequently the aren’t. Most analytical data set don’t turn on enforcement in the database, so it’s up to the processing of the data to keep the keys unique. With painful frequency, this doesn’t happen. Fortunately, there is one tight query in SQL that checks this. I use this as a question in almost every data interview I conduct, and the candidate rarely gets it right. Here is the query:

select some_supposed_key, count(*)
  from some_table
 group by some_supposed_key key
 having count(*) > 1
;

Such that the result set needs to be empty.

1.2. Primary keys that don’t match

💡
Once you have keys, make sure they match when the data entity is the same, both throughout the database, and across systems in the enterprise.

We were in the middle of a $4,000,000 consulting engagement. The consulting architect had our major enterprise systems on the whiteboard. The warehouse was in the middle. He explained how the data would flow in from the systems and be joined in the warehouse. One of our best internal tech leads quickly pointed out that the keys across the systems don’t match—​they use different identifiers for the same notions. While the keys are unique within the systems (they are legitimate primary keys in isolation), they aren’t the same across systems and thus can’t be joined without significant (massive?!) effort, if they can be joined at all.

The consultant said we’d get to that later—​blowing past the single biggest problem we had and never getting back to it. With the non-matching key problem never addressed, the $4,000,000 consulting engagement recommended a $75,000,000 multi-year program that was quickly shot down. The consulting firm and all the internal leaders who sponsored the engagement were gone within six months. They keys were never fixed—​nor was there even a proposal on how to do so in the 4-inch-thick binder we paid $4,000,000 for.

The tech lead who pointed out the problem and was junior to me at the time went on to be my AVP some years later. Excellent promotions.

1.3. Hierarchies that don’t roll up

TODO.

1.4. Intelligent keys tightly coupled to the business process

TODO.

1.5. Pivoting data in the database when a BI tool would have been much easier.

TODO.

1.6. Fitting hierarchical structures like XML or JSON into a relational database

💡
Expect your transactional systems to use flexible data structures. Expect your analytics systems to need flattened relational data. Expect to pay a big bill in between. Use automation that augments the humans, but don’t expect to find complete automation.

This is a perfectly legitimate need, but it’s harder than most people think. It’s legitimate because transactional system benefit heavily from flexible storage like XML or JSON. You want your operational system to take advantage of that from a cost-of-ownership, maintainability, and similar architectural perspectives. But it’s also legitimate to say that downstream analytics and and data science benefit from relational data, and often very denormalized or "flat and wide" data. Between the two, something has to do the transforming—​usually a big program with a price tag that no one likes.

It is often asked whether there is some generalized programmatic way to make the conversion. No. The flexible structure as in zero normal form. The relational structures need to typically be in second or higher normal form. Going from lower to higher normal form is not possible in a generalized manner because of Information Theory—​you cannot go from a lower information state to a higher information state without intelligent intervention.

1.7. Using the whole row as a key when finding a business key is too hard

TODO.

1.8. Normalized models making Humpty Dumpty impossible to put back together

💡
Right-size your data normalization. More is not always better. Default your thinking to third normal form then work from there. Favor more normalization if you have diverse sources, diverse use cases, or many unknowns; otherwise lean toward less.

TODO.

See also: Using Data Vault anywhere other than the 5% of the time it might be useful*

1.9. Machine generated column names that give you no clue what the data is

TODO.

2. Content

2.1. Not null fields with null values

TODO.

2.2. Having major highways running into the ocean in a geospatial system

TODO.

2.3. Numeric fields with formatting in them

TODO.

2.4. Columns with multiple meanings

TODO.

See also: Putting different money in the same bucket by improper reuse of codes across departments.

2.5. Capturing age instead of start date and then never updating the age

💡
Always capture start date, not age. Age can always be derived.

TODO.

2.6. Bad commas and quotes in CSV files making it impossible to parse

💡
Try to get the data changed to a more descriptive format like JSON or XML. If you must use CSV, try to use quoting more aggressively. If you must use troubled CSV data, use an ETL tool with modules made to handle this.

TODO.

2.7. Data with a proprietary encoding when an industry encoding was available

💡
Gather and report the long-term costs of using a proprietary encoding and ensure that the organization is not making a bad decision based only on near-term costs.

TODO: Tell the ICD-9/ICD-10 story.

2.8. Notes in just about any free-form field regardless of the field’s original intent

TODO.

2.9. Having only the aggregated values when you need the finer grain value

TODO.

2.10. Dates encoded as strings making date logic harder

💡
Store dates as dates. If this causes ETL ingestion troubles, fix the trouble some other way—​or do it only on the far left edge of the ETL process only.

Dates can be fussy, mostly because invalid formats get rejected, leading to ETL jobs having troubles. A workaround is to use numbers or strings, such that the date 1776-07-04 can be stored as the number 17760704 or the string '17760704'. This is convenient for ETL processing, and it will at least still sort correctly as long as you do year/month/day in that order. But it is very common to need to calculate time windows, which is not at all reasonable to do with numbers and dates. I once worked in a warehouse where all dates were stored as numbers as a "best practice" in pursuit of ETL fault tolerance and a [questionable] performance gain. Calculating data spans required always doing type conversions, which was not the end of the world, but it made for clutter and was a bit more error prone.

2.11. Structured data put into free-form text fields in various ways

💡
Ensure that all system left- or right- align a given field the same way.

The company had a four-part policy identifier. One of the parts was three characters. But it could start out as just two characters since not enough was known about the policyholder early in the lifecycle to fill in the third character. Some system would left-justify the field within the three positions when it was in its two-character form. Some would right-justify it. People new to the environment were constantly bitten by this issue since the policy identifiers always had the three positions, and with the same name for the four parts—​so it was a perfectly clear join…​except it didn’t work.

2.12. Key fields being left-justified in some systems and right-justified in others

TODO.

2.13. The number 1 being used to mean "option 1" or "1 dollar" in the same field

TODO.

3. Data Processing

3.1. Control characters in text files

TODO.

3.2. Sequential numbers with gaps

I worked in a small company that had 12 locations. We were about to add the 13th. My coworker who was responsible for adding new locations asked if we should skip 13 because it’s bad luck. I said, let me tell you what bad luck is—​giving an out-of-sequence number to a sequential numbering system! He used 13.

Using min() or max() to conveniently ignore non-unique values when grouping data

TODO.

3.3. EBCDIC data with packed decimals, implied decimals, redefines, and occurs varying

💡
Use an ETL tool to rip the format to something more modern. While such formats are old, they are rigorously defined by a copybook. The ETL tool will know how to translate the copybook into something modern. If you’ve lost the copybook, cry—​then plan to allocate many person hours to translating the data.

These constructions were slick in their day—​they saves space when space as at a premium. Now space is cheap (whether RAM or disk), and people’s time (which is the tradeoff that was made) is more expensive than ever. To this day, if you have old mainframe feeds, you have to shred all these ancient constructs into their proper form. It can be ugly and time consuming.

3.4. Unicode representing "anything" while your system handles far less than "anything"

💡
Decide early to either build your whole system around unicode, or to downgrade your unicode to a more basic data set and accept the data loss as the price of easier processing. Whatever you decide, commit to it.

TODO.

3.5. Processing structured data in a programming language instead of SQL

TODO.

3.6. Data that should have loaded but didn’t

TODO.

3.7. Data that should have loaded once but loaded twice

TODO.

3.8. Having different systems use similar wildcards, but with different characters

TODO.

3.9. Vehicle trips that never end or jump to random places

TODO.

4. Business Process

4.1. An frequent value turning out to be the first one in a drop-down list

TODO.

4.2. Not selecting an existing customer from an MDM system and entering them again

TODO.

4.3. ZIP code 12345 frequently occurring in a state where ZIP codes start with "06"

TODO.

4.4. ZIP code 06111 having urban density despite being in a rural area

TODO.

4.5. Dropping the zero off ZIP code

TODO.

5. Portfolio & Architecture

5.1. Using a Push architecture when a Pull architecture was readily doable

TODO.

5.2. Making the consumer do the publisher’s job

TODO. The publisher has the power but the consumer has the incentive. The publisher lets the mess flow because of their lack of incentive. But the consumer has to fix it because they lack the power.

5.3. Surrogate keys with accidental intelligence actually used downstream

💡
Go ahead and put some rudimentary intelligence in your surrogate keys if it seems valuable—​such intelligence can be very useful for basic traceability and troubleshooting. I believe it is worth the risk. Supplement that intelligence with actual fields that carry the same information. Govern data use to ensure downstream areas use the explicit fields not the implied intelligence.

TODO: Describe the situation.

5.4. Using Data Vault anywhere other than the 5% of the time it might be useful

💡
Make sure you’re in a Formula 1 race and have Formula 1 staff before building this Formula 1 model. Literally the book on this topic has a race car on the cover. It’s an apt metaphor.

Data Vault is a brilliant modeling approach. It is highly generalized and scalable. But it comes with a price—​it is conceptually advanced and mechanically sophisticated. The advanced conceptual nature is easy to miss because it seems so simple—​it turns all data sets into one of three constructs that just repeat. But it’s actually really hard to abstract to this level of thinking for those without a strong engineering background. The mechanical sophistication is a side effect of the attempted simplification—​by making everything just one of three constructs, the edge and corner cases take on much of the complexity, resulting in some really hard problems in the last few situations it attempts to address. I believe such an approach is useful if it is justified, but you need a REALLY large, advanced problem to justify such an advanced approach. Most of the time, it’s overkill.

5.5. Data transformations with zero documentation

TODO.

5.6. Bucketing different money into the same codes across departments

TODO.

See also: Columns with multiple meanings.

⚠️ **GitHub.com Fallback** ⚠️