DATA MINING FOR ENVIRONMENTAL MANAGEMENT
by Amit Roy
It is evident from the articles that have appeared in previous issues of DS*STAR that data mining has begun to have an impact on several diverse areas of management in the commercial world -- ranging from industrial management to strategy in professional sports. This suggests that the data mining is a general-purpose tool that could be profitably applied to non-commercial human endeavors as well. This article explores the potential applications of data mining to environmental management.
The last two decades have seen a surge in the general awareness of environmental issues, and in the United States this has been accompanied by legislation targeted at maintaining and improving the environment. The implications of environmental legislation are generally widespread, both physically and economically, and environmental monitoring drives much of the cost of compliance regulation. Consequently, there are now a large number of environmental databases, many of which are in the public domain.
Typically, these databases are used to generate reports mandated by law. It is the thesis of this article that mining environmental databases could lead to the generation of paradigm shifting hypotheses. This article shows that there is a tremendous amount of data pertaining to the environment, that there are complicated modeling and analyses of such data to understand relationships and dependencies, and these results impact government programs and legislation. Therefore, data mining may uncover previously unknown patterns that may have significant implications.
One of the most ubiquitous of all environmental legislation is the Clean Air Act, which specify National Ambient Air Quality Standards (NAAQS) for six "criteria pollutants" (ozone, particulate matter, nitrogen dioxide, carbon monoxide, sulfur dioxide, and lead). Large sections of the most densely populated areas of the United States are in non-compliance with the Clean Air Act, and annually approximately 100 million people are potentially exposed to concentrations of criteria pollutants above the NAAQS. However, since the economic impact of regulation to reduce emissions of air pollutants and their precursors is likely to be wide-ranging, it is imperative that every alternative control strategy be examined.
It is in this context that data mining could provide valuable insights. At present, air pollution control strategies are devised using photochemical air quality simulation models (PAQSMS), which in turn are evaluated using air quality data collected by a national network of monitoring stations.
This air quality data is maintained in the Aerometric Information Retrieval System (AIRS), a national database of air quality data that is administered by the U.S.EPA. The monitored data include the criteria pollutants as well as a number of other chemical species known to be associated with the criteria pollutants.
The analysis of this data is challenging because of the spatio-temporal nature of the data, and the nonlinear relationships between many of the air pollutants. Emission control is a particularly contentious issue because studies with PAQSMs suggest that the air quality in one region is affected by atmospheric releases in other states, thousands of miles away.
To make matters worse, there is no clear consensus among scientists on the appropriate metric with which to measure the severity air pollution -- one issue being debated pertains to the averaging time of the monitored concentrations. Long averaging times may not adequately protect the public from peaks in air pollutant concentrations, while short averaging times may not provide adequate protection against relatively low but persistent pollution.
There are numerous other environmental databases to which data mining could be applied.
The National Health and Nutrition Examination Survey (NHANES) is designed to gather information on the health and nutritional status of the population of the United States. There have been three NHANES surveys thus far; the most recent of which was conducted over the period 1988-95 and involved interviews and physical examinations of approximately 34,000 individuals.
In the future NHANES will be conducted on an annual basis so that health and nutritional trends can be discerned in a more timely manner.
The National Status and Trends (NS&T) Program administered by the National Oceanic monitors spatial and temporal trends of chemical contamination in coastal sediments on a national scale. The Toxics Release Inventory (TRI) is a national database containing the magnitude and location of environmental releases of toxic chemicals. The TRI documents major reported environmental releases of approximately 650 chemical contaminants.
The Environmental Monitoring and Assessment Program (EMAP) is a research program administered by the U.S.EPA to assess the trends in the ecological resources.
In addition to the possibilities of data mining being applied to each of these databases is the possibility of data mining across these databases. It is hoped that it will not be long before data mining technology is be brought to bear on some of the many challenging problems that have been alluded to in this article.
---
For more information, see http://www.virtualgold.com.