AxIS LogMiner - a software tool for data preprocessing for Intersites Web Usage Mining
With more than 3.3 billions documents online and 20 millions new Web pages published each day the World Wide Web should, shortly, become the main source of information. With such a huge offer, it is becoming increasingly difficult, for Web site publishers, to attract and keep users on their Web sites. However, designing popular Web sites is not possible without understanding the needs of their users. Therefore, analyzing users' behavior (recorded in the Web log files) is an important task when designing Web sites.
Web Usage Mining (WUM) consists in the application of Data Mining (DM) procedures
in analyzing the user access on a Web site. Similar to any KDD process, the
WUM contains three main steps: preprocessing, knowledge extraction and results
analysis. In this paper we focus on the first step, data preprocessing in WUM,
which is a fastidious and complex process (according to a poll from, two thirds
of DM analysts consider that the time spent for data cleaning and preparation
represent more that 60% of the total analysis time).
The objectives of the analyst are: to determine the exact list of users who
accessed the Web site and to reconstitute the sequence of actions performed
on the Web site by each user (called ``user session''). To do this, the analyst
may only use the log files and, eventually, the site map. In the first step
several tasks have to be completed, such as: data preparation, data cleaning,
data transformation and data reduction. Also irrelevant or noisy data (e.g.
requests from Web robots) must be excluded from the processed log file.
When a WUM process is
jointly applied to the Web logs from several Web sites, generally belonging
to the same organization, we call this process the Intersites WUM.
Today, an important organization may have several Web servers for its Web
sites. For example INRIA (The French National Research Institute for Research
in Computer Science and Control) has one main Web server for http://www.inria.fr/,
one server for each of the six research units (see,
http://www.inria.fr/inria/unites.en.html, for the complete list) that
INRIA has in France (as of December 2003) and other servers, e.g. for the
Web sites' search engines or the intranet. A user navigates through all these
servers in a transparent manner as the pages from different Web servers are
strongly interlinked. There are many chances that visitors would not even
remark that the Web server changed (to see this, they must look at the address
displayed in the address bar). However, for the WUM analyst this change is
very important. Since the users are looking for a specific piece of information,
their complete visit is constituted from all the log entries found in the
various log files (because there is only one log file for one Web server).
The analyst has to reassemble the path followed by the user through different
servers. Our solution is to join all these log files and to reconstitute this
visit.
The structured data that we extracted from the log files is transferred to
a relational database. In the last step of our method we enriched the structured
data by means of data generalization and summarization operations. This will
further allow to the DM analyst to focus only on information of interest.
In accordance with the work of W3C on Web Characterization Terminology we reformulate the definition for the main WUM terms, used in this document, and propose new definitions for the visit, episode and Web server log file.
Within our project AxIS, we developed LogMiner - a software tool for analyzing the users’ access on a Web site. This analyze is based on HTTP access log files of the Web site server which needs to be analyzed.
The Web site user’s
access analyzes consists in:
Starting from the raw HTTP log files from the Web server and using LogMinner the analyst obtains a preprocessed log file in a specified format which contains rich information about users’ requests (user, IP address, user agent, date, time, duration, HTTP request status, requested URL, referrer URL). Also user requests are grouped in sessions, visits and episodes. Each line from this file identify a user request and contains information about request session, request visit and request episode.
In order to reduce the data redundancy, to obtain statistics about sessions, navigation and episodes and to facilitate utilization of these data by third parties, the WUM analyst can use LogMiner to export the data from preprocessed log file to a relational database.
The WUM analist can use to store the log file into relational database either the LogMiner graphical user interface or can execute the LogMiner in command line mode.