To CUVCAT Or Not To CUVCAT; That Is The Question
Big data and the leveraging of business intelligence have been vital components in data-driven decision making. The cost of not having data insights in which to make informed decisions has never been higher.
Yet, the process of cleansing, transforming, and validating data to fit an organization’s standards and industry best practices can be an extensive and fairly mysterious task.
The overarching concept of “Data Quality” was created to ensure the reliability of an organization’s data curation process according to some set of guidelines. Certainly, the definition of data quality and the threshold of what is “good enough” will vary from organization to organization, but the key to implementing and maintaining a data quality standard is to ensure an organization has a framework to guide them.
At Resultant our preferred framework for data quality is CUVCAT, an acronym that identifies the necessities of quality data: completeness, uniqueness, validity, consistency, accuracy, and timeliness.
The risk of using partial data is telling a story that’s incomplete and can be misleading and dangerous for organization, leaving organizations with more questions than answers. The same goes for retrieving data that is partially completed. Validating completeness gives a trustworthy indication that all anticipated data is accessible to the organization. Records can be unintentionally rejected, loss, or returned null as it moves down the data pipeline. Testing for data completeness quickly assesses missing data and in return allows development errors to be uncovered and corrected in near real time.
Just as data sets can have too few records, they can also run the risk of having too many records as well. Data duplication is usually the culprit and can be avoided by ensuring your data is completely unique to avoid skewed reporting and analysis. For example, financial data that contains 100 records, one for each sales transaction, will give one result, but by having duplicate data within the file, sales/revenue would be inflated, providing inaccurate information for an organization. Ensuring that all records are unique prevents data from being misrepresented and erroneous decisions made based upon that data.
To help govern the way data is presented, a set of requirements, or guidelines, must be established by an organization. Data should be structured in an acceptable data type and format, and data values should be within set constraints in order to comply with guidelines. If a data item does not conform to the established guidelines, it will be invalid and instead adds errors to the data set. Organizations having a complete understanding of the business requirements and guidelines for each object in a data set allows validity to be better monitored and controlled.
Data may be referenced different ways according to various business user needs. Therefore, it is imperative that the data presented to users is consistent and true without the structure or the actual content of the data being changed. Inconsistencies throughout a data set can compromise data integrity which leads to decreased value to an organization. Objects representing the same information but in different areas of a data source should have identical data types, naming conventions, and actual data values to ensure cross-database and cross-functional data integrity.
Validating accuracy is usually what most people think of when they hear the term data quality. Accuracy is measured by the extent in which a data item is correctly represented in the context of a data set. Simply, checking if the data is right. Inaccurate data can directly lead to erroneous decision making. It’s important to note that data accuracy is distinctively different from data validity. Valid data is verified against constraints and previously defined standards. Data accuracy is confirming that the values stored for an object are in fact the correct values. There is a chance that data can be valid but not accurate for an object.
Timeliness is simply providing the right people with the right data at the right time. A data set’s value relies on the correctness of the data but also how quickly that data can be ingested and become available for use, in other words, the timeliness the data is available to be leveraged. As business operations continue to evolve, data refreshes and changes quite often. Having timely access to data as it changes helps the business make appropriate, well-informed decisions.
In summary, there are many ways that data quality can be validated, but ultimately best practices leverage the CUVCAT framework. Without the six CUVCAT concepts supported, data should not be considered completely trustworthy.
At Resultant, we have developed Validatar, an automated data quality assurance tool to deal with the challenges related to providing quality data. Validatar consists of a test repository to create and store test cases, an execution engine to run those tests, and a results repository that stores the test results for easy retrieval.
We want our clients and customers to have a higher trust in their data quality validation process and make better business decisions without having to resort to manual data testing.