[ Table of Contents | NEXT ARTICLE ]

DATA MINING AND THE GOVERNMENT: IS THERE A UNIQUE CHALLENGE?
by Patricia L. Carbone


As data mining products and companies become more prolific and more popular with the computer press, it is inevitable that the federal government will become interested in whether or not this new technology will apply to their data and situations. Most of the press is now focused on those commercial applications such as the example of Wal-Mart using data mining to look at the success of their retail stores. Others focus on the commercial use of customer segmentation to determine what products a particular type of customer may prefer to purchase.

Many governmental organizations feel they have unique problems and needs that are not addressed by the commercial marketplace. They are unsure of whether data mining will ever bring the payoffs seen by the commercial world. Many governmental decision makers are very similar to Dollar, Mark, and Otto, the senior executives that Dr. Bhandari introduced in his opening article dated March 31st, who simply do not understand the full potential and use of data mining. And there is still the lingering memory of the unfulfilled promise of expert systems from the 1970s. The government invested a large amount of money in expert systems solutions. The promise was than including an expert system in an application would eventually cure the common cold. Sadly, we found that the hype of expert systems was truly just hype. Unfortunately, a stigma was attached to anything labeled "artificial intelligence" from that point on.

So, let's look at some of the various communities within the federal government and see what issues they face with data mining. Let's start with the parts of the government that have more "traditional" applications for data mining, such as the Internal Revenue Service. These organizations have applications very similar to the commercial banking and financial communities. A pressing concern for these organizations is the quantity of data being collected and the privacy of the data. The potential for use of data mining techniques for these organizations should be great, as witnessed by the success of data mining in the financial world.

Next consider organizations such as NASA and the National Institutes of Health. These organizations are collecting scientific data for analysis to improve our understanding of the universe and of the human body. There are two overriding concerns for these organizations. One is the quantity and rate of data being collected, particularly at NASA. The Earth Observing System will be collecting terabytes of data per day. Imagine the speed at which data mining and other analysis algorithms must work, so as not to overwhelm the scientific community. The other concern is the large quantity of data that is in textual format. Text is an area that has been addressed by the information retrieval community. We are all familiar with Internet search engines like Alta Vista or Yahoo that enable us to retrieve documents that pertain to a particular keyword. But what about summaries of those documents? What if I do a search on "data mining" and discover that over half of the articles returned have to do with coal mining? The information retrieval community is expanding into data mining through the development of text summarization techniques, or the ability to scan through a document and provide a short summary of the article. Obviously, we would all like this technology to be available immediately so that we can more easily determine the document we really want to retrieve from our internet search engine.

Everyone is extremely interested in the intelligence community, particularly with the popularity of the Tom Clancy novels and James Bond movies. The intelligence community would include those organizations within the Department of Defense, the FBI, and other intelligence agencies. There is a certain fascination and supposition that the intelligence community is at the forefront of technology so that they can protect the security of the United States. The community obviously has access to large amounts of data through various collection mechanisms such as satellites. But consider the form that the data will take: structured data, such as that stored in a DBMS; imagery data, such as that collected by a satellite; textual data, particularly all those documents that are available on the Internet; video data, containing news programs from networks like CNN; geospatial data; and audio data. The data can be quite "dirty" or noisy, since the collection mechanisms can vary in age and type, or the sources of the data can be contradictory among each other. "Dirty" data can also include data that is not in English -- all those documents, news stories, and audio broadcasts that are in all the languages of the world. And above all, the intelligence community has to protect the security of the information they have collected, so security and privacy issues are the primary concern.

Finally, let's look at the Department of Defense (Army, Navy, Air Force and Marine Corps). Here is where we are now getting into data that is being collected in real time to enable real-time decision-making that actually mean the difference between life and death situations. For example, look back at the USS Stark incident. If data mining had been prevalent then, perhaps an identified pattern of "aggressive behavior" would have more readily indicated that the incoming aircraft was a threat and that it should have been shot down. And compare that situation to the USS Vincennes incident, where the identified pattern hopefully would have not been one of "aggressive behavior," but one of "commercial airline traffic." Since the data being monitored is real-time data, often from aircraft moving at 400 miles per hour, the commanders of planes and vessels must make real-time decisions that can create a life-or-death situation. Similar situations occur with Federal Aviation Administration data, or that data being monitored by air traffic controllers, so that split-second decision making is necessary.

Of course, there are the normal problems faced by many organizations. Much of the data is collected in legacy systems, using technology from the 1970s and the early 1980s. There are huge problems with differences among the "owners" of the data. As a simple example, imagine the Army description of a unit (being a brigade) versus the Navy description of a unit (being a ship) versus the Air Force description of a unit (being a wing command). Who is responsible for defining or refining the definition of a unit? It is a difficult task to persuade a department to change its structure in order to facilitate data mining across multiple departments. The government does not have the luxury enjoyed by a commercial company to simply allocate x millions of dollars to overhaul a number of departments, build a data warehouse that combines the various data, and facilitate the use of data mining to improve its particular services. And remember that the federal government has many millions of reluctant taxpayers like us who do not want to see our hardearned money wasted.

So is the government unique or not? Can data mining be inserted into the various domains to save dollars, security, and lives? These questions are still fully unanswered, but we in the data mining field are quite optimistic. Dr. Bhandari discussed the issues of security and privacy in a previous article. Certainly, most agencies will do all they can to protect the security of all data and identified information. The issue of data mining in textual data is very definitely being addressed more and more by the commercial community due to interest by the average Internet users in having more efficient search engines. Data mining for real-time analysis and decision support will also continue to be explored. At some point, however, there may have to be decisions made between doing real-time analysi, or providing tools to handle data being collected in real-time based on off-line data mining.

Above all, those of us in the data mining community would like to avoid the bitter experiences of the expert systems community and avoid promising instant, generic solutions to truly complicated, albeit non-unique, problems. As Dr. Bhandari suggested in the earlier article, we must make an effort to educate the senior executives within the governmental organizations about the business intelligence technologies.

---

Dr. Inderpal Bhandari, DS* executive editor at large, secured this guest editorial. Dr. Bhandari may be contacted at inderpal@virtualgold.com More information may be obtained at http://www.virtualgold.com


[ Table of Contents | NEXT ARTICLE ]