The three pillars of data

17/06/2007 21:08:00

Data is your most precious asset. Regulations governing retention, security and retrieval now carry severe penalties for mishandling. And brave new architectures - in which siloed applications give way to service-oriented ones that span the enterprise - demand consistent, constantly available data idependent of the software that people originally used to create it.

Here, we examine the three pillars of data: security, quality and availability. The intent is to foster best practices that ensure your data receives the attention it deserves. No one can achieve zero defects, but the advice here could bring you a step closer.

Secure your enterprise data

By Paul F. Roberts

Regulations and a fear of banner headlines put the focus on data, not network, security

For DuPont, Gary Min may have seemed a model employee. A research chemist at DuPont's research laboratory in Ohio, Min was a naturalised US citizen with a doctorate from the University of Pennsylvania who had worked for DuPont for 10 years, even earning a business degree from Ohio State University with help from his employer.

During that time, he had moved up the ranks within the company, taking on various responsibilities on research and development projects within its Electronic Technologies business unit. He specialised in the company's Kapton line of high-performance films, which are used, among other places, in NASA's Mars Rover.

But Min's veneer of respectability began to crack on December 12, 2005, when he told his employer he would be leaving his job. According to a civil complaint filed by DuPont against Min, a company search the next day revealed that Min had recently been an avid user of the company's electronic document library, accessing almost 23,000 documents between May and December 2005, including more than 7300 records in the two weeks prior to his giving notice.

Alarmingly, Min had strayed from his area of specialisation, rummaging through sensitive documents related to Declar, a DuPont polymer that competed directly with PEEK, a product made by Min's future employer, Victrex.

With Min indicating he would relocate to a Shanghai office of Victrex, DuPont appealed to both law enforcement and the civil courts that it was worried its former researcher was absconding with a treasure trove of trade secrets for Victrex and perhaps other Chinese companies.

DuPont is not alone. The broad outlines of the Min case -- his Chinese nationality, his links to companies operating in that country, and the broad scope of his attempted intellectual-property heist from DuPont -- are in keeping with what the FBI says is an epidemic of state-sponsored economic espionage. By one estimate, there are as many as 3000 front companies in the United States whose sole purpose is to steal secrets and acquire technology for China's booming economy.

Welcome to the brave new world of enterprise security, circa 2007. It's a world where the troubles of yesteryear -- loud and stupid Internet worms and viruses such as MSBlaster, Sobig, or SQL Slammer -- seem trivial. In their place are rogue insiders with legitimate credentials, armed with Trojans and rootkits controlled from afar that may lurk for years without detection, bleeding companies of sensitive information.

It's a world in which premeditated plunder of specific data, rather than the mere breaching of the perimeter, is the point of network intrusions. And that means companies, more than ever, must monitor and secure data to prevent it from falling into the wrong hands.

Higher value, freer flow "This is a problem of the evolving value of data," says Marv Goldschmitt, vice president of business development at Tizor, a data auditing and protection firm. "Data has taken on a value beyond what it originally had, and individuals don't know how to deal with that," he says. Moreover, the migration of almost all intellectual property and critical data to purely digital form, as well as the interconnectedness of corporate networks with each other and the Internet, stand in the way of discovering when data has been pilfered or that anything has gone awry, Goldschmitt says.

Security experts are painfully aware that clamping down on insider threats and data leaks is an order of magnitude more difficult than stopping malware. And while recognition of the data-security problem is spreading fast within enterprises, very few have taken steps to lock down their sensitive data and intellectual property.

"In our experience, most firms are far from addressing it," says Phil Neray, vice president of marketing at Guardium, a database threat and security monitoring firm. "These companies have hundreds of systems installed around the world but very few installed to protect intellectual property."

"The risk level is still very high," says Steve Roop, vice president of products and marketing at Vontu, one of a slew of smaller DLP (data-leak prevention) firms.

According to data accumulated from Vontu risk assessments on customer networks, approximately 2 per cent of all sensitive or confidential files are exposed to theft by unauthorised personnel, and around one of every 400 e-mails that leave a company exposes sensitive data -- either sent to an unauthorised recipient or sent to an authorised recipient in an insecure form that can be sniffed or otherwise stolen.

Companies usually overlook that exposed data because their security posture is still focused on network perimeters, not on what might be going on behind the firewall or even over secure connections with business partners and suppliers, says Paul Stamp, an analyst at Forrester. "The perimeter around data is shrinking. Between joint ventures and collaborative [business to business] stuff and remote users, the perimeter has become highly porous."

Exposure via business partners and third-party contractors is a top concern at Communications Data Services (CDS), a subscription service bureau that's part of Hearst, says Paul McCarthy, director of information services. In its databases, CDS maintains files (including credit card numbers) for 155 million active subscribers to publications such as Better Homes and Gardens, US News and World Report, Vogue and Readers' Digest.

Much of that sensitive data comes to CDS through channels that can be difficult to police, such as agents and third-party contractors, as well as over the phone and via the Web, McCarthy says.

Regulatory imperatives Securing critical data that may be used in a variety of contexts is a daunting prospect for any enterprise. But the harsh reality of regulations such as Sarbanes-Oxley and the PCI (Payment Card Industry) data security standard are helping set priorities for enterprises that might otherwise remain in denial.

In particular, Sarbanes-Oxley's requirement that companies audit the access of privileged users to sensitive data -- and PCI's requirement to track user identity information whenever credit card data is touched -- are pushing companies to home in on where sensitive data resides and how it is being used, Goldschmitt says.

At CDS, PCI and Sarbanes-Oxley prompted the company to take a close look at all of its processes for handling subscriber data, McCarthy says. In addition to doing its own SAS (Statement on Auditing Standard) 70 audits of internal security controls, CDS is regularly audited by third parties.

Increasingly, audits are forcing enterprises such as CDS to push security measures closer to where data resides, whether on laptops, in databases, or in shared directories, Stamp says. It's a simple prescription but one that's difficult to implement because most companies start out with a hazy understanding of what their sensitive data is, let alone where it resides on their networks.

"Companies wake up and realise: 'We don't know anything!'" Goldschmitt says. "We've had companies come to us and say: 'We have 20,000 data servers and absolutely no idea which of them have sensitive data on them'."

Zeroing in and locking down When the panic subsides, the hard work of discovery begins. Fortunately, enterprises have more data security tools at their disposal today than ever before.

Most companies in the DLP space, including Vontu and Tizor, can audit network activity to find sensitive data such as credit card numbers, magnetic-stripe data, or intellectual property on database and file servers, and monitor user access to that data. Firms such as PointSec -- now part of CheckPoint -- and startup Provilla can perform similar audits at the desktop level, monitoring file copying to portable storage devices, as well as e-mail and Internet-based file transfers.

Once that key data has been identified, DLP firms offer various strategies for securing it -- from tagging key intellectual property with signatures that raise alarms whenever they pass outside the company's control to blocking USB ports to prevent data transfer to portable devices. None of those approaches is sufficient to protect data without larger organisational changes, experts say.

"There are really cultural changes that need to occur," Guardium's Neray says. "You've got to focus on insiders and trust -- trust and verify."

Companies need to define security policies that cover critical data and educate employees about acceptable behaviour. "If you've got an SAP application, your company might access the database 22,000 times a day as part of your normal business processes. But if someone's using Microsoft Excel and bogus credentials to access SAP, that's a violation of policy," Neray says.

He adds that traditional perimeter defences and identity- and access-management products also play a vital role in data security. In particular, companies should use their identity-management platforms and strict policies to link specific IP addresses to specific users, rather than allowing shared credentials to muddy the waters should a forensic examination need to take place.

"The problem is you've got applications like SAP and Oracle eBusiness Suite, which have privileged credentials to access the database, and those are widely available in the IT environment. Developers are using them, [database administrators], and the help desk," he says.

Enterprises also need to build practical, bottom-up policies that actually get enforced, rather than imposing unrealistic, top-down security policies that just get ignored, Stamp says. "Once you have a handle [on] where your data is and where it's going, you can start shoring up your infrastructure from the ground up."

Building barriers Some of those measures can be straightforward. Companies seeking to protect data on laptops and other mobile devices have been a boon to top-tier data encryption vendors such as RSA and PGP.

Even at PKWare, makers of PKZip, simple encryption features that work across diverse platforms have helped drive sales. Data security now accounts for half of the company's business, compared with just 20 percent three years ago, says Todd McLees, vice president of marketing.

As CDS has discovered, start with the obvious and build from there. The company used a layered approach to get a handle on external security -- with standard security measures such as firewalls, VPNs, and SSL encryption -- then added configuration control technology from Tripwire.

More recently, McCarthy says, CDS has deployed outbound filtering technology from Palisade Systems that can do packet-level inspection and spot data such as credit card numbers that might be traversing the company's network or leaving the company over FTP or HTTP.

CDS has gone further than tackling sensitive data as it flows among authorised employees inside the company. It also has determined the behaviour of hundreds of companies that contract with the magazines CDS works with, many of which pay far less attention to data security -- and may send spreadsheets or CDs with sensitive subscriber data to the company.

Nonetheless, the threat of a Gary Min-style rogue insider looms large. The goal, McCarthy says, is to put up enough barriers that it becomes almost impossible for a lone insider to do significant damage.

"You want to reduce it to the point where nobody can act alone and do something," McCarthy says, "where you need a conspiracy of persons to make it happen."

+++

Improve the quality of enterprise data

By Peter Wayner

Ensuring data quality is always harder than it seems, but new tools are making the toughest task in ICT a bit easier

When I was a young programmer at an investment bank, my desk was next to the department of "data integrity", a small group with the thankless job of making sure that the databases held accurate records of stock transactions. The bank's computers could process millions of transactions in seconds, but a mistyped key or a missing value could jam the entire assembly line for data.

When things were running smoothly, we would amuse ourselves with philosophical discussions about just what it meant for data to have integrity. At the time, the bank didn't want insight or truth in their databases -- they just wanted the books to balance and the system to hum along. It was almost as if data integrity were an afterthought.

That view has changed. Data integrity -- or data quality, as the current parlance goes -- has become a hot topic in many IT departments. The CEO who used to be impressed by the Web site with forms for customers to fill out is now wondering why the data is such a mess. The marketing group wants real leads backed by real data, not a bit dump filled with inconsistency and inaccuracies.

A number of software vendors are tackling the problem by offering tools and packages that treat data as more than a pile of bits: They are building sophisticated, logical frameworks for information and tossing around philosophical words such as "ontology" to describe their models for numbers and strings in the database fields. After all, the problems of data quality exist because bits can never be perfect reflections of the underlying information.

Scrubbing data clean These systems often have a sophisticated gloss but are typically practical tools designed to help an IT shop remove the most glaring and expensive problems. So while the problems may be framed in elevated terms, the solutions generally take the form of plain old if-then-else statements.

The systems scrub, or cleanse, the data by applying rules that remove all possibilities for false duplication. They might replace all instances of "Bob" with "Robert", for example, or recognise that all old telephone numbers from a particular city must now come with a current area code.

One of the oldest and most common applications for data quality software is address "cleansing", the process whereby a company takes a mailing list and ensures that all of the addresses are current, valid, and as complete as possible. Pitney Bowes Group 1 Software helped the US Postal Service develop the technology for parsing and correcting -- and now Pitney Bowes is selling it for more general applications.

The technology aggregates rules for understanding addresses into a modular application that can recognise errors, correct them, and add the most complete ZIP code. It can distinguish between the two identical abbreviations in "St. Paul's St." and understand that "Saint Pauls Street" is the same road.

After early success with cleaning up addresses, Group 1 is now working to open up its tools so that they can help other parts of the enterprise. Navin Sharma, director of product management, explains that one big opportunity is in straightening out customer records, consolidating them when necessary.

Group 1's latest offering helps the sales force straighten out mistakes: When a new customer record arrives, Sharma explains, "We standardise it, we validate it and complete it. Is this customer already in the master data hub? Do I already have information? If so, I want to synchronise all of my systems with the latest information; otherwise, I want to add him as a new customer."

Such cleansing processes can be complicated. Jeff Jonas, chief scientist at IBM's Entity Analytics Solutions, says, "There are some risks if one over cleans the data -- especially if trying to decide which incorrect values can be discarded -- because you may end up dropping useful data."

At IBM, they avoid throwing out any data by venturing a best guess, not a permanent decision, about which values are "clean". Jonas explains: "Sometimes one learns something later that requires one to rethink an earlier decision; eg, maybe the bad data turns out to be an essential data point like a person's new nickname."

Business makes the call Getting the input to make decisions about what is correct, or clean, is getting easier, because many of the new products have simple user interfaces that enable everyone in the enterprise to pitch in, a process that takes the weight off the shoulders of the IT department. Karen Hsu, principal product manager for data quality at Informatica, says her company is working to open up its tools to the people at all levels of the corporation.

"What we've heard from the customers is, 'I'm constantly asked to look into why a customer name isn't correct and that isn't my expertise'," Hsu says. "So we've let the business take on the responsibility. Those types of rules are things that the business can create and monitor on an ongoing basis. If there was a missing part, they would be notified by a dashboard rather than waiting for IT to do it."

Informatica's latest offering, like many in the space, offers a visual programming language that can create rules and workflows for cleansing data. They make it easier for nonprogrammers to add rules and tweak the existing ones to cope with changing business conditions.

IBM has its own data quality solutions, WebSphere Product Center and Customer Center, which are designed to help customers create a single, correct version of the truth so that data can be used in a variety of applications without inconsistencies.

The structure and role for such tools is changing rapidly. The original tools were designed to work in the background to remove inaccuracies by parsing information, applying rules, and matching disparate sources. New versions from many vendors work within a service-oriented architecture providing answers immediately, a process that allows developers to eliminate ambiguities or inaccuracies before they occur.

Compliance alert The vendors are also building dashboards that flag problems and let managers drill down into the data set to examine them. One of the biggest new applications for such tools is regulatory compliance. Software to ensure data quality can reduce workloads and prevent companies from inadvertently ignoring the law.

Kathleen Hondru, vice president of marketing at Innovative Systems, says her company is helping clients in banks and insurance companies scrutinise their client lists and look for matches against government watch lists. The company's matching engine can screen against all of the possible variations on a name and associate all of the potential "aliases" with the original record.

This application is a good example of how a number of tool vendors offer systems that do more sophisticated matching operations than can be easily accomplished with traditional relational databases. The tools preprocess the information and ensure that the matching is faster, simpler, and more consistent.

These applications of different kinds of computer science research show that the domain is just beginning to enter the mainstream of the IT world. In the past, IT managers talked about generating reports, but now they ask whether data cleansing can help them produce more accurate ones. The compliance officers who once asked for simple tracking and alarm bells are now wondering whether better tools can provide more comprehensive oversight.

The future of quality Better tools for a variety of data quality applications are in the works. Theresa DeRycke is a so-called data therapist for CRMfusion, a company that specialises in data quality solutions for on-demand CRM, including its DemandTools offering for Salesforce. "Once the data is cleaned up, then you have to think about maintaining it," she says. "I think the next hot topic is execution of the data -- territory management. Now that we have all the data in, cleaned, and a way to keep it clean, how do we divvy it up?"

One company, Silver Creek Systems, is taking automation of data matching to the next level with semantic technology. Its DataLens solution separates such complex data as product information into content groups, standardises it, and creates taxonomies in a manner that minimises human intervention.

It's important to note, however, that humans can never be taken out of the equation. Contradictory or incomplete data strewn around the enterprise in various databases and formats is the ugliest problem in IT. Reconciling and normalising all that data is hard, tedious work. There's no silver bullet, but new solutions are going a long way toward enabling enterprises to create a single version of the truth without driving IT insane.

+++

Improve availability of enterprise data

By Doug Dineley

Standard configurations and standard procedures keep unplanned downtime at bay

Ask an expert about data availability and how to ensure it and the conversation quickly turns to the subject of human error. Not that IT mistakes are the leading cause of unplanned downtime; the research firm Gartner identifies software failures as the chief culprit, and "operator error" as the second most common cause, ahead of hardware outages; building or site disasters; and metro disasters, such as storms or floods, in that order. But of all of these major causes, human error is the one that IT can really do something about.

IT folks close to the action generally agree with Gartner's ranking, although some suggest that Gartner may even have underestimated the role of mistakes. Software failures often result from configuration errors, and sometimes they arise as the result of improper testing: an incompatibility isn't discovered because an application was tested on a different system configuration than the one in production, for example, or performance testing didn't give the app the workout it would get in real life.

Even many hardware failures can be laid at the feet of IT malpractice. If systems aren't cooled properly, if they're improperly racked, or if the procedure for starting them up and shutting them down isn't followed correctly, equipment life is shortened and premature failures can result. Even for dumb hardware, it pays to read the manual.

But whether it's software testing practices, hardware maintenance procedures, or the plain old boneheaded mistake lurking in the dark, the question is what to do about it.

Goofproofing If you've recently suffered from a blunder-induced outage, you might be tempted to ask, Why me? Mauricio Daher, a principal consultant with the storage services provider GlassHouse Technologies, can tell you: not enough red tape.

In Daher's line of work, which is helping large IT organisations prepare for disaster and recover from outages, he's seen his fair share of glitches attributable to human error.

"Out of those," he says, "it is mostly, 'Gee, somebody reconfigured a LUN [logical unit number] that was actually a production LUN but they thought it was something else'. These are simple things that I see happening again and again because of the nature of my business."

You might think human error is an equal-opportunity affliction, but these sorts of slips just don't happen in better-run enterprises, Daher points out. "By the time you get to a point where you can input those commands, you've been through so many bits of red tape that it's impossible to make a mistake," he says. "That type of mistake really doesn't happen in a mature organisation, because there are so many safeguards."

Daher and GlassHouse use the CMM (Capability and Maturity Model) to evaluate datacentres. Essentially, CMM is a model for process improvement that measures maturity level on a five-point scale. When Daher assesses an IT organisation, he is looking for standard operating procedures, whether they have SLAs in place, how they measure against those SLAs, and whether there is accountability at various points in the personnel chart.

Training, documentation and standardisation are the essential ingredients of process success. Falling short on the CMM scale typically has more to do with a lack of discipline than a shortage of skills.

"At one end [of the CMM scale], you might have some superstars who do a really good job of managing [the datacentre], and they're indispensable, but unfortunately they haven't documented fully, and if one of those guys gets hit by the proverbial bus, you're in trouble," Daher says.

"And the other extreme is a fully documented environment where everything is automated, and if something's not automated, there is a manual procedure in place that runs like clockwork."Which of those descriptions hits closest to home?

Choosing a well-known standard such as ITIL (Information Technology Infrastructure Library) is helpful in that new hires already versed in it will get up to speed in your environment faster, although Daher notes that many successful datacentres had similarly rigorous practices in place years before ITIL became fashionable.

The key is that your internal standards be rigorous, well documented, and drilled into everyone in the organisation. And those standards should extend all the way down to simple tasks such as configuring a switch and even to the naming conventions used for your zone sets.

That last recommendation came out of Daher's work with a large oil company, in which the two administrators who managed the storage fabric used different naming conventions, and even these were inconsistent. This worked just fine on a day-to-day basis, but it's a potential showstopper if one of those admins -- or worse, someone else in the IT organisation -- had to recover from an outage on his own.

"A lack of consistency in the documentation of such a simple thing seems minor, but it can really kill you and prolong your pain when you're trying to do really complex things at 2 in the morning." It all comes down to accountability, Daher says, adding: "If their boss had really been accountable for hard results, that sort of thing just wouldn't happen."

Ironing out the process For Tom Ferris, manager of servers and storage for an international financial institution that prefers to remain nameless, the success of his company's high-availability initiative depends as much on implementing standardisation and controls as it does on traditional disaster-recovery planning.

He says most of the problems his group experiences are due to inadequate testing, misconfiguration, or other mistakes, and they are revamping their processes to address them.

"A lot of the emphasis of the high-availability program is on putting the technology in place for redundancy and fail-over capabilities and that type of thing, but in my mind that doesn't really get you high availability," he says. "Most of the outages that we've experienced, and if you look at what the analysts say, most of the outages in general, are not caused by the technology; they're caused by people making changes."

The high-availability program dovetails with a utility computing initiative also going on at the company, giving Ferris and his group an opportunity to change the processes for application provisioning and administration in a way that serves both. The goal is to move away from dedicated servers for each application to a shared infrastructure model, in which the application owners will purchase a set of services -- compute, storage, availability, and so on -- from the IT group.

Each of the IT services will be available in gold, silver, and standard service levels. Before deploying an application, the owners will need to determine how much computing resource it needs, how much storage it needs, and the level of availability it requires, all of which will determine whether the app is deployed on a stand-alone machine, into a cluster with local fail-over, or into a cluster that supports both local fail-over and fail-over to a business continuity site 50km away.

While each service level maps to a specific standard configuration, the administrative model will be consistent across all three tiers. The consolidated infrastructure dramatically lowers hardware costs, especially for high-availability configurations and, as Ferris notes, especially if you are faced with different groups having their own separate test and dev, staging, and production servers.

"Especially when you get into high availability," he says, "[having all of your apps running on their own servers] becomes very unwieldy. If you can take all of your Oracle databases and combine them on, let's say, a three-node cluster, like we're doing, you can house a lot of databases there. You don't have to have 15 separate database servers, and based on the requirements of the application you can configure the database for the type of fail-over you need pretty easily, because you've already got your cluster built."

One key element is standardising on configurations for production servers and ensuring that the servers in test and development match it. A central group responsible for release management will usher any new code or changes into production, making sure they are bundled up from test and development, put into staging, run through a checklist of tests, and finally promoted into production.

"In the staging and production environments, the application developers and application owners won't have administrative access any more," Ferris explains. "They might not even have administrative access in test and development." If they do, Ferris says, the environment would be closely managed to ensure that the configurations in testing match those of production servers.

The IT group uses BladeLogic to manage those configurations and control releases, and to run compliance reports to check for variance from standard configurations. The controls help prevent mistakes from impacting production servers, and the standard system images help speed up provisioning -- a benefit that extends to disaster recovery.

"We've packaged the configuration of [our] Veritas cluster server, the baseline OS, and the Oracle database into a reusable configuration that makes it easy to rebuild the environment from scratch," Ferris says. "You can set variables for IP addresses, so it's easy to re-create a multitier application in a new environment."

Investing in availability In addition to providing important safeguards and making complex infrastructure easier to manage, the combination of standard configurations, standard procedures, automated provisioning tools, and a consolidated infrastructure helps to drive down the cost of high availability. Other technologies are playing a role here, too, notably clustered storage and server virtualisation.

But while many of the associated costs are coming down, keeping datacentres running will always require significant investment in the people that maintain them, not to mention the time and effort poured into improving the processes by which the whole infrastructure is managed. Training, standards, and careful management of changes will only increase in importance as applications continue to become more complex and more interdependent.

You might find a good lesson in the famous case of the missing NetWare server that ran for four years after being sealed behind a wall by construction workers: The best thing you can do for a system is to leave it alone. Of course, that's not possible for most business applications, especially in these days of rapid change. But if you can't build a wall, you can at least start laying down some red tape.


[ Printer Friendly Version ]

[ Other stories about IBM, Boss, VIA, ACT, RSA, Oracle, Group 1 Software, Exposure, FBI, SAS, Logical, Speed, Microsoft, DLP, DuPont, PKWare, Pitney Bowes, Gartner, Informatica, Veritas, VERITAS, SAP, PGP, NASA, HIS Limited, Innovative Systems, Tripwire, Three Pillars, Creek ]