A LOST ART: CATEGORIZING YOUR DATA
By Zak Pines
Prior to data mining, data must either be collected or acquired. But you know that already. It is after this when another vital process occurs, a step that may not be given the attention it deserves. The data must be prepared for analysis.
An essential part of data preparation is attribute categorization, an important and potentially overlooked piece of the data mining puzzle. If attributes are improperly categorized, then the data mining analyst will miss out on significant results. Categorizing attributes allows the data miner to find patterns about groups of numbers. The goal of any data miner is to find meaningful results, and properly categorizing attributes can help to do so.
A standard categorization will be grouping numbers, whether it is a dollar value in stock data or measurements of speed of baseball pitches. In the most extreme case, if there is a data set where every number value for a given attribute is different, and this data is not categorized, then it would be impossible for the program to find any interesting results. There cannot be patterns for a given attribute if that attribute does not have any common values.
Categorizing can be especially effective with data mining, because one of the uses of data mining is to help predict future patterns. The predictions will be somewhat general (an analyst could never predict, say, an exact dollar return on a stock), and categorizing will serve to help the analyst identify general trends in the data.
Traditionally, a common method of categorizing data is subtracting the lowest number in the data from the highest number in the data, dividing this range by the number of categories (10 or 20 is standard), and then making categories with this standard interval. So, if data values ranged from 20 to 220, and the analyst wanted to have 10 categories, the categories would be grouped 20-40, 40-60, etc. But, this is not always the best categorization technique.
There are five factors that an analyst may consider when categorizing data. Each of these factors will have a different effect on the impact of a person's data mining results.
The first factor is the number of categories. Too many categories can make the data overly specific and by doing so the data miner may miss out on general patterns. Likewise, too few categories can make the data overly general and the data miner may miss out on specific patterns. A balance must be found here. Depending on the type of data and the domain, the number of categories can vary from as few as two to as many as 20.
A second factor is the distribution of the data set into the categories. One method of data categorization will have the data evenly distributed into the categories. An example of an even distribution into the categories is, if there are four categories, each category should contain 25 percent of the data. If there is a category with significantly less than 25 percent of the data, there is less chance that a data mining program will identify these patterns as interesting, because of the lack of impact it has when compared to the entire data set.
A third factor is the range of the categories. If a range of a category is too big, it may contain data that should not logically be grouped together. For example, if I am doing a study on family income, I would not want to group together families with incomes between $25,000 and $1,000,000, because families within this grouping have very little in common due to the large range of the category.
A fourth factor is the consistency of the range of the categories. Keeping the range of the categories consistent helps keep the analysis of the attribute well organized. An example of this would be grouping yearly salaries in categories $0-20,000, $20,000-40,000, $40,000-60,000, etc.
The fifth and final factor to consider when categorizing data is logical breaks between categories. An example of this would be tax brackets. If a break between tax brackets occurs at $70,000 and an analyst is looking for trends that could be affected by taxes, it would not make sense to categorize salaries $40,000-60,000 and $60,000-80,000. It would make sense to have the breaks between categories occur at the same dollar values where there are breaks between the tax brackets. No matter what other factors are considered when categorizing data, the user should always have logical breaks between categories, because illogical breaks will make the results difficult to interpret.
It is nearly impossible to satisfy all five of the above conditions when categorizing data, so the data miner must make decisions as to which factor he or she considers to be the most important. The solution will vary depending on the analysis or the domain, but understanding the tradeoffs between the different factors will help the data miner make the best decision. Usually, the miner should choose the number of categories he wants, and then choose what is more important, either distributing the data evenly into categories or the consistency of the range of the categories, while maintaining logical breaks between the categories.
These categorization steps will prove to be worthwhile in the data mining process. A data set with properly categorized attributes will return the most meaningful results to the user. And, in data mining, that's the bottom line.
Zak Pines is a member of the Research and Development team of Virtual Gold, Inc, a company specializing in data mining technology and services. He is one of the lead developers of Advanced Scout, a data mining program used extensively by coaches of the National Basketball Association to devise new strategies based on the automatic identification of hidden patterns in game data and video. He is also involved in developing end-to-end data mining solutions for other sports and in various other industry domains.
Prior to joining Virtual Gold, he worked on the Advanced Scout project at the IBM T.J. Watson Research Center (Hawthorne, NY), from 1995 to 1997.
He is scheduled to graduate Yale University in 2000 with a B.A. in economics. He is a sports editor and columnist for the Yale Daily News (http://www.yaledailynews.com), and he is the director of sports programming and a sports broadcaster for WYBC-1340 AM, New Haven, CT (http://www.wybc.com).
For more information, see http://www.virtualgold.com.