Paper Title
Unsupervised Approach For Semi-Structured Data Record Extraction From Multiple Pages Using Tag Tree Similarities

Abstract
In this paper we present a novel unsupervised approach for data records extraction from multiple similar web pages using tag tree similarities. Extracting the data records from multiple web pages consist of following sequences. We first identify the related web pages from the web source. Next we construct the DOM tree for related web pages using html parser. We then compare two or more web pages to eliminate unwanted regions such as header, menu bar, navigation bar, advertisements, etc and find the region containing data records also referred to as data region. We then traverse sub trees of data region to extract individual data record and store them in required form such as XML. The main contribution of this paper is in developing a fully unsupervised algorithm for extracting both structured as well as semi-structured data records from multiple related web pages. Our proposed system can extract valuable data records from many commercial web sources more precisely. Hence it can serve as a tool for integrating information from various commercial websites. This integrated information can then be used for providing various value added services such as comparative shopping, market intelligence, meta-querying and search. Keywords - Data Record Detection, Information Extraction, Semi-Structured data, Wrapper Generation.