[ Table of Contents | NEXT ARTICLE ]

TRACKING THE USE OF DATA AND INFORMATION: A MATTER OF NATIONAL SECURITY
By Ed Colet


Substantial efforts are devoted to technology for collecting, storing and analyzing data. Data is an asset and converting it into useful knowledge provides competitive advantages. Unfortunately, the theft of data and information is a surreptitious way to gain advantage, and apparently a difficult matter to prosecute. In this column, I address ways to ensure that data as a valuable asset can be protected, and any misuse readily apparent.

To summarize the current situation, it appears that China has been able to develop modern nuclear weapons systems based on the theft of secrets from US nuclear research labs. The thefts appear to have occurred in the early 1980's, were first suspected in 1992, a suspect identified in 1995, and finally in 1999 does the US have their most convincing evidence - transfer of highly classified nuclear design codes into an unclassified computer. But from this point, it is not known what happened to the data. Thus prosecution for espionage is hindered because the law requires evidence that materials have actually been passed onto a foreign power (in this case, China). It is this evidence that is lacking. As John L. Martin, former chief of the Justice Department's Internal Security Section is quoted in the NY Times: "in a bank robbery...money is missing; in a homicide, there is a body; in a kidnapping, someone disappears. But espionage is a phantom act: usually nothing is missing, nothing is left behind and the victims may not know they were a target." Thus without physical evidence of some form, prosecution is difficult. The implication then is that there should be ways to give data and its use some type of physical form that could stand as evidence.

Because espionage cases represent "phantom crimes", suspicions are raised by widely indirect and circumstantial evidence. The convicted spy, Aldrich Ames, was suspected due to spending patterns that appeared beyond his means, including his driving a Jaguar. In the current case, it was a surprising nuclear test by China in 1992 that triggered the US suspicions of theft. The former Los Alamos scientist, Wen Ho Lee, was suspected based on his access to related and classified information, his travels to China, his meetings with scientists, even the way he was hugged by a Chinese official in what now appears to be an overly congratulatory manner, and only recently by evidence of an inappropriate data transfer. Yet almost 2 decades after the theft, he remains uncharged and not prosecuted.

It would be nice to think that data mining can readily discover spies in our midst. We would simply mine spending habits, travel patterns, personal interactions, national weapons tests or other global events and with sure certainty be able to identify spies. This capability will most likely remain as data mining fiction though.

The more practical alternative to the data mining fiction of mining everything is to ensure that sensitive information leaves data trails that can more easily be followed and mined. This would be less personally intrusive, more useful for legal and evidentiary purposes, and easier to implement. To what extent can we do this? At least three necessary conditions need to be in place. These pre-requisites are (1) the capability to log activity; (2) the ability to digitally mark or tag information; (3) the ability to track digitally marked information across networks. Portions of all three of these conditions are already well developed.

Logging capability -- To a large extent, we already have robust logging capabilities in place. Log files are important for error tracking, replication and validation, system performance efficiencies, and even fraud detection. At the level of an individual application, we can readily create individual application log files. It's a little more difficult to create an individual log file that tracks activities between separate applications (e.g. a SAS data set converted to an SPSS data set via the DBMS/COPY application). Within an Intranet it is possible to create log files that track file transfers and data movements - and these are routinely used for network performance analysis and back-up procedures. But movement across networks such as the Internet is more difficult. If information is downloaded from a website onto a zip drive, renamed, and then uploaded onto another website, it's difficult to trace this path and know that the information is the same because there's no way to readily identify the information at the source and at the destination as being the same.

Digital marking -- A digital watermark is a subtly embedded piece of information in a file. It's equivalent to the watermark on paper currency. But a digital watermark can do more than its paper equivalent. It can be implemented to remain despite any transformations of the file, and this would ensure that there is a continuous way to link information pieces together. Digital watermarks are already used to prevent the misuse of copyrighted information. For example, some image files have digital watermarks that become apparent only when the file is printed, rendering the image unacceptable and therefore preventing the unauthorized publication of the image. By the same token, if a highly sensitive data object (a single data file, or collection of related files) were digitally watermarked to ensure that it's used only by authorized users running authorized applications installed on authorized machines that are located in authorized intranets, then the use of highly sensitive information can be tightly controlled.

Tracking by digital watermarks -- Defining accepted uses of digitally watermarked information and limiting their use to only specified activity may adversely affect productivity (e.g., a researcher finding that it's not possible to run a new analysis to test a model on the data). If so, it should also be possible to allow less restrictive use of information but have it readily be traceable. In the context of daily work, a lot of information is routinely downloaded, transformed and uploaded across systems. Being able to track and trace these paths efficiently may be useful. In order to do this, digital watermarks would have to be detectable by automated network crawlers that are perhaps run at routine intervals. The resulting paths would then be placed into log files. These log files containing the paths of how information is used can then be analyzed and standard data mining approaches can detect surprising and possibly inappropriate uses of information.

There is already substantial effort underway in each of these proposed technologies. Implemented together as a solution can ensure that the competitive asset provided by data and its use remains protected. In some cases, it may be a matter of national security.

---

Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.

For more information, see http://www.virtualgold.com


[ Table of Contents | NEXT ARTICLE ]