The perils of dirty data

19/02/2008 12:46:54

Sometimes it's a problem of starting out with bad data, through user error or even deliberate sabotage. Sometimes the data starts out good but gets lost, truncated, or altered when it moves from one system or database to another. Your data may go stale, or it may become collateral damage in a turf war inside your organisation -- everyone clinging to their own little piece of the data store, nobody willing to share.

The task certainly isn't helped by the overwhelming volume of data companies generate each day.

Data projects can go bad in many ways. Here are five of the most common: what went wrong, what happened as a result, and what you can do to avoid having the same thing happen to you. The names of the companies involved have been obscured to protect the guilty. Don't let your own project become someone else's horror story.

1. The "Dear Idiot" letter Be careful where you get your data -- it may come back to haunt you. This tale of terror comes from the customer call centre of a large financial services institution. As in nearly all help desks, service reps take calls and enter customer information into a shared database.

This particular database had a salutation field that was editable. Instead of being constrained to Mr, Ms, Dr, etc., the field could accept 20 or 30 characters of whatever the rep typed. As service reps listened to the complaints of angry customers, some of them began adding their own, not entirely kind, notes to each record, like, "what an idiot this customer is".

This went on for years. No one noticed because no other system in the organisation pulled data from that salutation field. Then, one day, the marketing department decided to launch a direct mail campaign to promote a new product. They came up with a brilliant idea. Instead of purchasing a list, why not use the service desk database?

So the letters went out: "Dear Idiot Customer John Smith".

Strangely, no customers signed up for the new service. It wasn't until the organisation began examining its outgoing mail that it figured out why. The moral of this story?

"We don't own our data any more," says Arvind Parthasarathi, vice president of product management and data quality for data integration specialists Informatica.

"The world is so interconnected that it's likely someone will pick up your information and use it in a way you never anticipated. Because you're pulling data from everywhere, you need to make sure you have the right level of data quality management before you use it for anything new."

What constitutes the "right level" will vary depending on how you use the data. "In the direct mail industry, getting 70 to 80 per cent of your data correct is probably good enough," he adds.

"In the pharmaceutical industry, you want to be at 99 per cent or better. But no company really wants, needs, or will pay for perfect data; it's just too expensive. The issue always is, how will it be used and at what point is it good enough?"

2. Dead men cast no votes Data cleansing can be a matter of life and death -- literally. PR specialist Nancy Kirk was volunteering in the US congressional elections of 2006, calling registered voters to get them to the polls, when she noticed something odd: Three out of 10 voters she dialled were deceased and thus ineligible to vote (except in certain precincts in Chicago).

The problem of having data that is literally dead is not uncommon in the commercial world, and it has real consequences for the living.

Jim Keyster, president of The Keane Organization's investor retention and communication services division, has spent the past year rolling out an investor data quality program for Keane's clients, which include major insurance companies, mutual funds and Fortune 500 firms.

On average, Keyster says, 8 to 15 per cent of clients' data records contain anomalies such as mistyped Social Security numbers or outdated addresses. But about one in five of those anomalies is a shareholder who's been dead for more than 10 years. In one case, a client had an "active" account for a shareholder who last drew breath more than 72 years ago.

"This isn't client negligence, it's just a naturally occurring problem," Keyster says. Private companies go public, change names, get acquired, or spun off, and their shareholder data follows along, often for decades.

But the consequences can be greater than just money wasted on unnecessary mail. The biggest concerns are fraud and identity theft. Some stranger could be cashing the late shareholder's dividend checks, the rightful heirs could be denied their inheritance, or confidential company info could leak out.

The solution? Software such as Keane's Score application can identify data anomalies across different systems and flag them for review. But all companies must exercise due diligence, have good internal controls, and scrutinise their data on a regular basis, says Keyster.

"Virtually every business has this problem to some degree," he says. "From a risk management point of view, the best practice is to make sure you're keeping it in check. Understanding how this natural phenomenon impacts you is a good first step."

3. Duped by duplicates User error is bad. User ingenuity can be worse. Take the case of the major insurance carrier that kept most of its customer data within a mainframe application from the 1970s. Data entry operators were instructed to first search the database for existing records before entering new ones, but the search function was so slow and inaccurate that most operators gave up and entered the records from scratch.

The result? Individual companies ended up in the database 700 or 800 times, making the system even slower and less accurate.

Unfortunately, the application was so deeply embedded in the company's other systems that management was reluctant to spend the money to rip and replace. Finally, the carrier's IT department made the business case that the company's inability to locate existing customers would ultimately cost it $750,000 a day in new premiums.

At that point, the company used SSA-Name3 by Identity Systems to clean the data, ultimately weeding out 36,000 duplicate records.

Dupes are one of those problems that keep IT managers up at night. The larger your database, the worse the problem usually is, says Ramesh Menon, a director at Identity Systems, which provides identity searching and matching software for organisations such as AT&T, FedEx and the Internal Revenue Service.

Unfortunately, nobody knows how big their problem is, he says. "If anybody tells you 'I have exactly 2.7 percent duplicates in my customer database', they are wrong."

There's no magic bullet, either. Menon says the solution lies in using data matching technology to isolate "the golden record", a singular view of information across multiple data repositories. Even then, the hardest part may be getting all the vested parties in an organisation to agree on what data they're willing to share, as well as what constitutes a match.

"Two different sections of the same organisation may have completely different definitions of what a match or duplicate contact is," he says. "These kinds of integrations fall apart because people can't agree about who owns the data or what information can be exchanged with others."

4. When data decays Remember text-based adventure games such as Zork? Apparently, somebody somewhere is still making these things. Worse, they're using data that's equally ancient.

MailChimp co-founder Ben Chestnut tells the story of an old-school games developer that used MailChimp's e-marketing service to contact 10,000 previous customers, alerting them that he'd finally finished version two.

Most of the addresses were at least 10 years old -- some of them Hotmail accounts discarded so long ago that Microsoft was using the addresses as spam traps. Within a day, all MailChimp e-mail was blacklisted by Hotmail's spam filter.

Fortunately for MailChimp, the developer had kept pristine records, down to the IP address each customer had used to download his games. That's what saved them, says Chestnut. "We fired off a quick note to Hotmail's abuse desk -- proved they were legitimate customers, just old. The next day we got delisted. That's pretty rare."

All data ages quickly, but contact data ages faster than most.

"You have to make the assumption that data decays like a radioactive sample," says Informatica's Parthasarathi. "You have to go into every system and periodically update it."

Jigsaw.com, an online contacts database geared toward sales professionals, takes a Wiki-style approach to data cleansing. Its 335,000 members get points for uploading their own contacts to Jigsaw and correcting others. Every record must be complete, and if Jigsaw users enter information that's incorrect or old, they lose points. Members spend their points by buying information for people they want to reach.

Jigsaw CEO Jim Fowler says an Atlanta-based technology company recently asked his firm to compare its contacts databases to Jigsaw's and weed out the bad data.

"They had 40,000 records," he says. "Only 65 percent of them were current and 100 percent were incomplete. We're finding that most of our corporate customers have sets of data so cruddy no one can match to them. Corporations spend millions on CRM, and it's amazing how bad that data is."

The real value is not the data itself, but the ability to keep up with how quickly it changes.

"The power of Jigsaw is complete data and self-cleansing," says Fowler. "If our self-correcting mechanisms don't work, we're just another crappy data company."

5. The war on error The difference between good data and bad can be as small as a single dot. Penny Quirk, principal consulting manager at Robbins-Gioia's Records Optimization Solutions, says she once consulted on a major data integration project where everything seemed to go fine. Six months later someone opened a data table and found rows of symbols but no data.

"It was a character coding error," says Quirk. "They used ellipses in some fields, and wherever someone had entered two dots instead of three it triggered the whole line of data to go corrupt."

The company had to painstakingly re-create the database from a backup, searching for the ellipses, then replacing them with the actual data.

More often than not, the problem is more than mere data entry errors or garbage in/garbage out. Most organisations fail to adequately plan when moving data between different operating systems or upgrading from older versions of SQL, says Quirk. They'll do it too quickly, using whatever resources are available now with the hope of cleaning it up later. (A bad idea, she adds.)

Worse, their test environments and production environments may not match, or they may test using a small subset of data, only to have big problems arise later with the data they didn't test.

"Organisations making dramatic changes in technology without putting forth the necessary time and effort to manage the data reconciliation, integration and conversions can become victims of bad data," Quirk says. "As data is moved from one source to another, the number of chances for it to become bad is astronomical."

Quirk's advice? Don't expect IT departments to validate your data set. Get the power users who work with the data to help plan and test the integration. Before you decide to consolidate, look at all your data fields and identify the applications that may be pulling data from them. When possible, test with all your data, not just a subset because even the tiniest errors can send you and your data into a world of pain.

One final horror story illustrates just how big a small error can become.

Peter Teuten, president and CTO of Keane Business Risk Management Solutions, tells of a client that created a SQL server database to determine whether corrupt CAD files were circulating in its network. If the number of corrupted packets exceeded a certain threshold, the company would know to implement data mining and cleaning tools.

The problem? They accidentally inverted the rule set for the database; the more corrupt packets it found, the better their network appeared in the reports.

"The network was eventually infiltrated by a worm, which corrupted their engineering CAD files," says Teuten. "They had to rebuilt most of them from scratch, which cost them millions of dollars. All from a very simple data extraction error -- two numbers were reversed."

If that doesn't scare you into approaching your next data integration project with caution, nothing will.

Pharma industry touts cure for data security ills

By Matt Hines

Medical research often leads to unexpected breakthroughs in other peripheral areas.

Based on the success being enjoyed by a project developed among a handful of leading pharmacy industry players, some experts say that you can add enhanced data security to the long list of advancements attributable to the health care industry.

Founded in 2001, RxHub was the brainchild of three of America's largest pharmacy benefits managers (PBMs) -- companies responsible for handling the unseen legwork necessary to allow pharmacies to dole out prescriptions to eligible consumers, and for customers to employ their health care benefits to cover related expenses.

Those companies -- AdvancePCS (since acquired by CVS-Caremark), Express Scripts and Medco Health Solutions -- were looking for a way to better facilitate the massive volumes of data transfer needed to match customers with their medical, insurance and payment records to cut costs and eliminate potential mistakes.

In creating RxHub, a joint venture that serves as an electronic clearinghouse responsible for gathering the medical and benefits data needed to serve customers, people involved with the project claim that the pharmacy companies also pioneered an information-sharing model that other businesses may want to emulate to relieve their own data security headaches.

At its core, RxHub claims to be a universal communication framework that links health care providers, insurance companies, the PBMs and local pharmacies for the purpose of sharing electronic records and prescription data.

One of the most attractive side benefits of the venture, backers claim, is that in addition to streamlining that process, the effort has also helped the partners pilot a new manner in which to access and correlate sensitive information to protect the interests of the many businesses and customers they serve.

Rather than forcing any of the firms in the prescription drug food chain to create additional databases for the purpose of providing records to their various external partners, RxHub serves in a data transport role that mines information from all the sources in real time to process transactions, without ever aggregating the information itself.

Built around master data management software provided by vendor Initiate Systems, the RxHub infrastructure allows the involved parties to perform all the records validation work necessary for pharmacies to verify prescription and payment information in a matter of seconds, without demanding that anyone in the ecosystem create or retain any additional records.

In doing so, the system allows the companies to live up to the demands of regulations, including the Health Insurance Portability and Accountability Act (HIPAA), while protecting themselves from potential data leakage incidents, RxHub executives said.

"The idea was to create a hub for all of this sensitive data without ever creating a master database where the information itself would be stored; we're more like FedEx, we look at the routing information, handle the package, and get it there reliably," said JP Little, chief executive of RxHub.

"We've been very careful about how we architected these systems and our business itself from the get-go in terms of not wanting to retain any data," he said. "This work was being done before we existed, but there was no hub; serving that role in the middle, the responsibility for security and controls remains with that various stakeholders, and the risk is lowered across the entire process."

RxHub effectively specialises in real-time data mining, built on Initiate's Enterprise Master Person Index (EMPI) technology.

When a patient requests a prescription at a pharmacy using the tools -- which currently cover an estimated 135 million US consumers -- the transaction is pushed through RxHub, which in turn verifies the involved person's information wherever it is stored by the various PBMs and health care providers involved.

Rather than creating a central record of all of that data, or storing any related information itself, the company merely delivers the relevant results regarding eligibility and payment details to the pharmacy, which can dole out the involved medicines and patient benefits.

By eliminating the need to create additional databases of patient information throughout all the businesses involved, both the companies themselves and consumers covered by the system are less likely to become victimised by leakage events or attacks, proponents of the system say.

"A hacker would need to break into one of the PBMs or some other element of the overall prescribing system to get this data, but we don't actually have it," Little said. "The idea is to eliminate the need to create additional databases; we access the data needed to carry out the transaction from all these different sources, but we never retain it; that's the beauty of this type of an industry hub model."

Officials with Initiate say that the distributed architecture approach being utilised by the pharmacy sector is catching on throughout a number of other areas within the health care industry, including some government projects.

Driven primarily by concerns around HIPAA and other regulations in the health care sector, company officials said that similar mandates in other markets such as the financial services space and even the law enforcement industry are driving interest from other types of organisations.

"With this type of data mining, companies are able to tie together separate systems in a real-time environment without putting themselves at risk of a data leakage event or an attack on the information," said Scott Schumacher, chief scientist for Initiate.

"Companies can create centralised registries for these types of customer records that gather the data from source systems without ever disturbing it, or retaining anything that could be used to carry out fraud," he said. "We believe that along with a number of other benefits to business, this approach also creates a more dynamic security model for data sharing, and we're hearing from many different types of organisations today who have an interest in doing something similar."

Schumacher said that Initiate typically competes for deals with IT industry giants IBM and Oracle, which have built similar systems for fostering secure information federation among business partners.

And some industry watchers agree that the strategy adopted by RxHub in the pharmacy business could appeal to other types of firms, in particular retailers, as they struggle with issues of data retention, protection and regulatory compliance.

"This type of approach is already being used in the public sector, intelligence community and in the financial services industry," said Ray Wang, analyst at Forrester Research.

"Whether it is tracking down suspects or identifying risk among credit bureaus, in effect, the idea is to find a quick match and link without the saving of data to ensure that no information is inappropriately saved, or can be leaked."


[ Printer Friendly Version ]

[ Other stories about IBM, Microsoft, ACT, Oracle, Express Scripts, AT&T, AT&T, AdvancePCS, FedEx, CVS, Forrester Research, Informatica, Internal Revenue Service, RxHub, SSA, Wang ]