Pretreatment of web log files
Abstract
The pretreatment of web data is often the most laborious and requires the most time, this due in particular to the lack of structuration and the large amount of noise present in the raw data. Pretreatment of Web log files is to clean and organize the data contained in these files to prepare them for future analysis. Web log files are often text type, an objective of the pretreatment step is to transfer the data in an easier to use environment (eg in a database).
In this paper we will start with the presentation of different formats of web log files, then we will present the different pretreatment methods that we used as cleaning of Web robots queries, removing queries relating to scripts (.js, .css, .swf), identifications of users, sessions and visits.Full Text:
PDFReferences
Cooley, R., Mobasher, B., & Srivastava, J. (1999). Data preparation for mining world wide web browsing patterns. Knowledge and information systems, 1(1), 5-32.
Tan, P. N., & Kumar, V. (2004). Discovery of web robot sessions based on their navigational patterns. In Intelligent Technologies for Information Analysis (pp. 193-222). Springer Berlin Heidelberg.
M. Spiliopoulou. Data Mining for the Web. Proceedings of the Symposium on Principles of Knowledge Discovery in Databases (PKDD), 1999.
Tanasa, D., Trousse, B., Masseglia, F., & AxIS, P. (2004). Application des techniques de fouille de donnes aux logs web: Etat de lart sur le Web Usage Mining. Mesures de l'internet, 126-143.
Tanasa, D., & AxIS, A. (2002, December). Lessons from a web usage mining intersites experiment. In Proceedings of the First International Workshop on Data Cleaning and Preprocessing of the ICDM02 (pp. 99-107).
R. Cooley. Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data. PhD thesis, University of Minnesota, 2000.
Aye, T. T. (2011, March). Web log cleaning for mining of web usage patterns. InComputer Research and Development (ICCRD), 2011 3rd International Conference on (Vol. 2, pp. 490-494). IEEE.
Pamutha, T., Chimphlee, S., Kimpan, C., & Sanguansat, P. (2012). Data Preprocessing on Web Server Log Files for Mining Users Access Patterns.International Journal of Research and Reviews in Wireless Communications (IJRRWC) Vol, 2.
Merzoug, N., & Bessa, H. Application du processus de fouille de donnees d'usage du web sur les fichiers logs du site cubba.
Charrad, M. (2005). Techniques d'extraction de connaissances appliquees aux donnees du Web. Transformation, 56, 5-2.
Tanasa, D., & Trousse, B. (2003). Le prtraitement des fichiers logs web dans le Web Usage Mining multi-sites. Journes Francophones de la Toile (JFT2003), 113-122.
Langhnoja, S., Barot, M., & Mehta, D. (2012). Pre-Processing: Procedure on Web Log File for Web Usage Mining. International Journal for Emerging Technology and advanced enfineering, 2(12).
Tanasa, D., Trousse, B., Masseglia, F., & AxIS, P. (2004). Application des techniques de fouille de donnes aux logs web: Etat de lart sur le Web Usage Mining. Mesures de l'internet, 126-143.
Charrad, M., Ahmed, M. B., & Lechevallier, Y. (2005). Extraction des connaissances partir des fichiers logs. Atelier fouille du Web EGC2006, 768.
Sharma, A. (2008). Web Usage Mining: Data Preprocessing, Pattern Discovery and Pattern Analysis on the RIT Web Data (Doctoral dissertation, PhD thesis, Rochester Institute of Technology).
Khalil Gdoura, Web Usage Mining-Dtermination des facteurs de succs dun site web par un modle de rgression logistique, Ecole Suprieure de la Statistique et de lAnalyse de lInformation, 2008 / 2009.
https://developer.mozilla.org/fr/docs/Gecko_user_agent_string_reference
Catledge, L. D., & Pitkow, J. E. (1995). Characterizing browsing strategies in the World-Wide Web Computer Networks and ISDN systems, 27(6), 1065-1073.
Refbacks
- There are currently no refbacks.
Copyright (c) 2015 Journal of Information Sciences and Computing Technologies
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Copyright © 2014 Journal of Information Sciences and Computing Technologies. All rights reserved.
ISSN: 2394-9066
For any help/support contact us at jiscteditor@scitecresearch.com.