Machine learning based web page classifier
Date of Issue2016-05-24
School of Electrical and Electronic Engineering
In recent years, the usage of the Internet has increased tremendously, and the total number of web pages has become enormous. The Internet is accessed by everyone for various purposes, and is growing very rapidly. In 2015, worldwidewebsize estimated there are around 50 billion webpages in the Internet . Web Directories, such as DMOZ (directory.mozilla.org) and Hotfrog, has classified the web pages into a set of categories. This is done to assist internet users and search engine such as Google. Search engine has been known to use the web directory to find and rank the web pages for certain keywords. The largest web directory, DMOZ, is a human-edited directory and has listed around 4 million web pages . Most web directory hires web experts to classify the web pages into different categories, and this approach is not effective because of the rate the internet is growing. Hence, to improve the effectiveness and automate web categorization, some methods related to machine learning and data mining have been researched to categorize the web pages automatically. In this project, the features that was used for the classifier is all related to the HTML structure of the web pages. Most common HTML tags, metadata, and images are extracted based on the HTML document. The classifiers that will be used are Neural Network for Pattern Recognition, and Support Vector Machine. Four classes of web pages are chosen for this project, and those are: Online Store, Internet Forum, News Article, and Blog Article. The web pages are collected manually through Google Search Engine. Furthermore, the final application for this project is to be able to classify a web page by using its URL as an input.
Final Year Project (FYP)
Nanyang Technological University