AxIS LogMiner - a software tool for data preprocessing for Intersites Web Usage Mining


Background and Motivations | Definitions | AxIS LogMiner application

Background and Motivations

With more than 3.3 billions documents online and 20 millions new Web pages published each day the World Wide Web should, shortly, become the main source of information. With such a huge offer, it is becoming increasingly difficult, for Web site publishers, to attract and keep users on their Web sites. However, designing popular Web sites is not possible without understanding the needs of their users. Therefore, analyzing users' behavior (recorded in the Web log files) is an important task when designing Web sites.

Web Usage Mining (WUM) consists in the application of Data Mining (DM) procedures in analyzing the user access on a Web site. Similar to any KDD process, the WUM contains three main steps: preprocessing, knowledge extraction and results analysis. In this paper we focus on the first step, data preprocessing in WUM, which is a fastidious and complex process (according to a poll from, two thirds of DM analysts consider that the time spent for data cleaning and preparation represent more that 60% of the total analysis time).
The objectives of the analyst are: to determine the exact list of users who accessed the Web site and to reconstitute the sequence of actions performed on the Web site by each user (called ``user session''). To do this, the analyst may only use the log files and, eventually, the site map. In the first step several tasks have to be completed, such as: data preparation, data cleaning, data transformation and data reduction. Also irrelevant or noisy data (e.g. requests from Web robots) must be excluded from the processed log file.

When a WUM process is jointly applied to the Web logs from several Web sites, generally belonging to the same organization, we call this process the Intersites WUM. Today, an important organization may have several Web servers for its Web sites. For example INRIA (The French National Research Institute for Research in Computer Science and Control) has one main Web server for http://www.inria.fr/, one server for each of the six research units (see, http://www.inria.fr/inria/unites.en.html, for the complete list) that INRIA has in France (as of December 2003) and other servers, e.g. for the Web sites' search engines or the intranet. A user navigates through all these servers in a transparent manner as the pages from different Web servers are strongly interlinked. There are many chances that visitors would not even remark that the Web server changed (to see this, they must look at the address displayed in the address bar). However, for the WUM analyst this change is very important. Since the users are looking for a specific piece of information, their complete visit is constituted from all the log entries found in the various log files (because there is only one log file for one Web server). The analyst has to reassemble the path followed by the user through different servers. Our solution is to join all these log files and to reconstitute this visit.
The structured data that we extracted from the log files is transferred to a relational database. In the last step of our method we enriched the structured data by means of data generalization and summarization operations. This will further allow to the DM analyst to focus only on information of interest.

Definitions

In accordance with the work of W3C on Web Characterization Terminology we reformulate the definition for the main WUM terms, used in this document, and propose new definitions for the visit, episode and Web server log file.

Resource
according to the W3C's URI (Uniform Resource Identifier) specification, a resource R can be ``anything that has identity'' (W3C). Some examples may include an html file, an image and, more recently, a Web Service.
Web Resource
a resource accessible through any version of the HTTP protocol (e.g. HTTP1/1, HTTP-NG)
Web Server
the server that provides access to Web resources
Web Page
the set of data consisting in one or several Web resources that can be identified by an URI. If the Web page is constituted from n resources then the first n-1 are embedded and the n-th URI is the identifier of the Web page.
Page View
it occurs at a specific moment in time, when a Web page is displayed in a Web browser.
Web Browser or Web Client
a client (software) capable of sending Web requests, handling the responses and displaying the requested URIs
User
a person using the Web browser
Web Request
a request made by a Web client for a Web resource. It can be explicit (initiated by the user), or implicit (initiated by the Web client). Another differentiation is: embedded Web request (a request made following a link) or user-input Web request (a request manually initiated by the user, e.g. by typing the address in the address bar, selecting the address from the bookmarks, history, etc.).
User Session
a delimited number of user's Web requests (embedded or user-input, also called clicks), across one or more Web servers.
Visit
a subset of consecutive page views from a user session occurring closely enough (by means of a time threshold or a semantical distance between pages).

AxIS LogMiner application

Within our project AxIS, we developed LogMiner - a software tool for analyzing the users’ access on a Web site. This analyze is based on HTTP access log files of the Web site server which needs to be analyzed.

The Web site user’s access analyzes consists in:

Starting from the raw HTTP log files from the Web server and using LogMinner the analyst obtains a preprocessed log file in a specified format which contains rich information about users’ requests (user, IP address, user agent, date, time, duration, HTTP request status, requested URL, referrer URL). Also user requests are grouped in sessions, visits and episodes. Each line from this file identify a user request and contains information about request session, request visit and request episode.


LogMiner Work Flow

In order to reduce the data redundancy, to obtain statistics about sessions, navigation and episodes and to facilitate utilization of these data by third parties, the WUM analyst can use LogMiner to export the data from preprocessed log file to a relational database.


LogMiner Database Structure

The WUM analist can use to store the log file into relational database either the LogMiner graphical user interface or can execute the LogMiner in command line mode.


LogMiner GUI


Background and Motivations | Definitions | LogMiner application