GOLDEN MEANS: A ROSE BY ANY OTHER NAME...
by Inderpal Bhandari, executive editor at large
Last week, I suggested that there was a link between data mining, data warehousing and digital libraries. Digital libraries are repositories of multimedia information. In this article, I would like to continue exploring that relationship.
First, some background on digital libraries. One of two mechanisms is used to retrieve information in digital libraries, depending on whether the contents of the library are annotated or not. Annotations are textual data. There is an annotation for every entry (an image, video or audio blob) in the digital library.
One can query the annotated data to retrieve multimedia content. If the contents of the library are annotated, then queries are run against the annotations. Content with annotations that match the conditions specified in the query is retrieved.
Annotation is a manual process and, therefore, expensive. Sometimes, it is possible to pick up annotated data for free. For example, the closed caption text of a TV broadcast can be used to annotate a video recording of the broadcast. But usually there is a considerable expense associated with the annotation of multimedia data.
In order to avoid that expense, one can run the query against the content of multimedia itself. Technically, this is quite a challenge. Unlike the querying of textual data, where it is straightforward to specify conditions of interest, such is not the case in the direct querying of audio or image or video data.
The latter situation is more akin to searching a space defined by a family of parameters that represent different aspects of that content. When executed, the conditions in such a query are used to determine acceptable ranges for every parameter in the space. A search is then done for multimedia blobs with parameter values that satisfy those ranges. Acceptable blobs are retrieved. The degree to which a blob satisfies the criteria of the query is usually reflected in the order of retrieval.
For example, in the case of images, several parameters jointly will define aspects such as color, texture and shape. Queries are expressed in terms of those aspects, e.g., "Find oval-shaped figures of the same size that are red and share a common point" may be an attempt to retrieve all images of roses in that library. Contrary to the popular saying, roses that can be referred to in this strange way will not smell as sweet as the real thing.
What about the possibility of mining the multimedia data in digital libraries? Given that simple query-by-content can be a thorny exercise, as evidenced by the example above, mining-by-content will be that much more difficult. But what if we were willing to swallow the cost of annotating the multimedia content in a digital library? Would that simplify the exercise of mining such data?
Yes, but only to a limited extent. Let's return to the rose example. Consider the images of roses to be suitably annotated, e.g., they have labels associated with them that say "rose". Clearly, now it will be easy to find and retrieve such images. However, data mining must go beyond such retrieval. In its strict interpretation, data mining refers to a process based on analysis of data that leads one to some startling, counter-intuitive discovery of knowledge. How could one make such discoveries from the multimedia data?
By finding interesting roses. Which then raises the question, what is an interesting rose? In order to answer that question, one must devise a measure of interestingness for roses over and beyond the information that is available in the digital library. In other words, a model of analysis for rose data must be created.
This suggests that the digital library is more analogous to an operational data repository than it is to a data warehouse. Unlike the data warehouse, it is not designed with analysis in mind. That suggests a construct distinct from the digital library analogous to the data warehouse may be required before multimedia data can be mined effectively, a point not generally appreciated.
Interested in learning more about digital libraries, data warehouses and data mining? Contact us at http://www.virtualgold.com