Understanding the Data Quality Crisis: From Good to Bad Data
Written on
Chapter 1: The Data Quality Dilemma
In the realm of data management, it's wise to treat every dataset as if it were a cluttered storage space rather than a meticulously organized archive, unless proven otherwise. The key takeaway? Approach your data with a healthy skepticism.
When faced with uncertainty, consider your data as a chaotic junkyard. However, even if a dataset isn't a complete mess, there are two primary ways that suitable data can deteriorate into unusable information:
- Loss of Information During Transformation
- Issues in Information Selection
This image is credited to the author.
While there are many more pitfalls that can lead to poor data outcomes, let's focus on these two significant ones for the moment.
Section 1.1: Loss of Information During Transformation
Data quality diminishes whenever there's a glitch in translating real-world situations into electronic records. This issue can arise from various factors, including faulty hardware, broken instruments, or unexpected real-life complications. Consider these questions: Were your sensors properly calibrated? Did your laptop run out of power? Did the personnel responsible for data entry accurately record the information? How reliable is memory as a temporary storage solution? (Quick! How many hours did you sleep last night? Now, can you recall how many hours you slept two Mondays ago?)
Section 1.2: Issues in Information Selection
Another reason why your data collection can spiral into chaos is due to poor decisions regarding what to document and how. Were your crucial attributes recorded correctly? Did you overlook important data because you deemed certain attributes as insignificant? For instance, did you neglect to log the date and time of each observation? Did you opt for a simplistic scale when a more precise measurement was warranted? Did you ask all the necessary questions? These design missteps frequently plague well-intentioned projects.
Chapter 2: Who is Responsible for Data Quality?
Is it everyone’s responsibility? This notion often implies it’s no one’s responsibility. It’s concerning if you can’t identify a specific role dedicated to ensuring data quality, and it’s even more troubling that there’s no industry-wide consensus on this matter. We are constructing a data-driven world based on a shaky foundation with makeshift solutions.
Until we have specialists trained in data design, collection, documentation, and curation, we cannot expect that randomly chosen data enthusiasts will possess the necessary skills to guarantee quality data.
Unfortunately, there isn’t a designated job title for ensuring data integrity—a topic I have previously addressed. Until we establish a formal role and educational pathway that encourages students to pursue the complex skills required for this profession, we will continue to encounter junkyards where we anticipated finding treasures. Rather than libraries filled with profound knowledge, we may end up with piles of nonsensical information.
Thanks for reading! Interested in a YouTube course?
In my upcoming article, I will delve into how various incentives and perspectives toward data influence the advice given by different data professionals regarding data collection practices.
If you enjoyed this article and are seeking an engaging applied AI course suitable for both novices and experts, consider checking out the one I developed for your enjoyment:
The first video, "How to Tell Good Data from Bad Data," explores the critical distinctions between high-quality and low-quality datasets.
The second video, "When Big Data Goes Bad - Comedy in Place (E98)," humorously illustrates the challenges of managing large datasets.
P.S. Ever tried clicking the clap button on Medium multiple times to see the result? 🤔
Enjoyed the author? Connect with Cassie Kozyrkov
Let’s connect! You can find me on Twitter, YouTube, Substack, and LinkedIn. Interested in having me speak at your event? Please use this form to reach out.