DATA MINING AND TEENAGE WEB SURFING
By Ed Colet
It's always interesting to discover surprising patterns - whether through data mining or traditional research studies. In this week's column I report on some surprising patterns about the web surfing habits of teenagers, and suggest that data mining can be helpful for understanding these usage patterns, as well as to establish responsible guidelines and controls.
In the past year, it's been recognized that it's important to devote attention to the activities and habits of teenagers. A better understanding of teenagers can possibly prevent tragic events like the one at Columbine HS in Colorado from occurring. But the interest in teenagers is also driven by fact that retailers and advertisers have realized that today's teenagers have significant purchasing power. Catering to this demographic group may require understanding them and developing unique strategies to target them. Witness the fact that TV networks are even designing new shows around teen issues and frantically seeking new actors and actresses to fill these roles (NY Times magazine, 9/5/1999).
In terms of what teenagers are doing online, there are some surprising results reported in "Business Wire", 9/7/99. A telephone survey jointly conducted by Websense, Inc. and Yankelovich Partners over the last weeks of July and the first weeks of August found some surprising facts regarding teenager's activity online. Although the data "don't point to a crisis", there was a disturbing "reality gap" between what parents thought their children are doing online, and what they're actually doing online. The study reported that teens spend more than one hour online/day, and over 10 hours/week. Specifically, "58% of teens have accessed an objectionable website, 39% have seen sites featuring offensive music, 25% have seen sites featuring sexual content, 20% have seen sites featuring violence." All of this despite the fact that 95%, and 86% of teens said they have rules about Internet usage at school and at home respectively. Clearly, rules are meant to be broken?
Perhaps reassuringly, the study also reports that about 90% of teens and parents favor the use of filtering software to restrict access to web sites - yet only 17% and 14% of teens said that filtering software at school and home had prevented them from viewing an objectionable web site. Apparently, the use of filtering software does not work as well as one would hope.
Solely relying on filtering software to prevent access to offensive web sites will have limited success due to the way filters are currently designed and implemented. Essentially, a filter is a list of offensive sites/URLS that are stored by the browser. Page requests are handled much like the use of a proxy server. Requests to connect to a site(s) that appears on this list are not passed on to the server for retrieval. Only sites that are not on the list are passed on and pages retrieved. Therein lies the problem: namely that newly created sites are not likely to be on this list simply because they are new and have yet to be accessed. They can only be added to this list at a future point in time - once it's been discovered that the site is being frequently accessed. This discovery can be made via data mining of web logs and discovering usage patterns. As such, data mining to discover and prevent access to offensive sites only after a pattern of access has been discovered is retroactive and much like closing the barn door after the horse has left.
An alternative method would be to develop proactive filters - and this too can be facilitated through data mining. Using techniques of word-spotting, keyword searching, and text classification, it's possible to develop web crawlers that automatically search the 'Net for new sites and classify those that are likely to be offensive. These newly discovered offensive sites can then be added to the list of sites to be filtered. Thus rather than waiting for these sites to appear on Web logs, they are filtered before they can ever be accessed. In truth, this is probably easier said than done because automated decisions (filter: Yes/No) on the basis of purely automated text searching are bound to lead to some misclassifications thereby limiting some truly educational and informational sites. It is also difficult to filter non-text content such as images or audio and video content. Perhaps a robust solution would be to have human in the loop to review the list of newly discovered candidate sites, and then manually classify and decide whether they should be filtered or not. Since it appears that students are actually in favor of filters, an easy way to implement this would be to have the human in the loop be a student?
So, there are some surprising patterns about teen habits of web surfing some of which are potentially disturbing. But since adults as well as teens are both in favor of reasonable controls and guidelines for web access, then it seems entirely possible to use data mining technologies to improve the performance and role of filters and to develop reasonable controls and guidelines. Data mining can also shed light on the critically important issue of understanding when a web usage pattern(s) is something to be truly worried about - what types of patterns are symptomatic of a teenager in crisis? And at this point, human intervention is critical.
Ed Colet is the Acting Director of Research at Virtual Gold Inc., responsible for developing analytical methods for data mining and for investigating human factors and usability issues of business intelligence systems. At present, he is in the final stage of completing a doctoral dissertation in the Cognition and Perception program at New York University's Department of Psychology. Ed has also worked for IBM Research at the T.J. Watson Research Center. At IBM, Ed was a member of the group that developed Advanced Scout, the data mining application for NBA teams. His research interests focus on statistical methods and human factors.
For more information, see http://www.virtualgold.com.