Cast your mind back to your childhood years when you played the Telephone Game when each child whispered a phrase to his/her neighbor and at the end of the line the original message had morphed into an entirely new hilarious phrase. Unfortunately in business, using data of poor quality from an originating source system can have disastrous consequences, especially with regulatory reporting, fraud detection and preventative maintenance schedules, to name just a few critical data-driven processes. Imagine the magnitude of the problem when you magnify the number of source systems, the data volume and the constant streams of data.
Once bad data creeps into your business processes it will lead to skewed insights, widespread mistrust of downstream data use cases and give rise to poorly governed workarounds.
How to check data accuracy for high quality data
With the streams of big data your business is spewing out, how would you do spot-checks between the source and destination values that are in perpetual motion? You could write scripts to compare values at a point-in-time but this would require highly skilled database practitioners with deep expertise in data organisation, data access and extraction. And rock-star developers aren’t exactly motivated to carry out this task nor should they be, given their expertise can be better utilised elsewhere.
How do you make data verification part of your culture?
Doing spot-checks is one thing, but how do you scale data verification across multiple replication pipelines and isolate data flaws in large data sets? Given that some business processes are critical, especially around regulatory reporting how would you integrate data validation as part of your operation processes? It helps to start thinking about implementing alerts/alarms systems to flag anomalies, also consider adding auditable reports to the mix. And if your critical business processes depend on a smaller set of parameters of a much larger dataset, consider adding parameterisation to your validation filters and metrics, e.g. the number of records replicated, checksum value on selectable columns or data completeness (i.e. data values replicated instead of nulls).
- Reconcile data movement between source and destination, and do it often
- Make sure reconciliation is an auditable process and part of everyday operational processes
- Make sure the business is alerted when there is data discrepancy and remediate immediately
- Using data verification with data lineage can help determine the root cause of data flaws