Paper Title :Modified N-Gram based Model for Identifying and Filtering Near-Duplicate Documents Detection
Author :Farheen Naaz, Farheen Siddiqui
Article Citation :Farheen Naaz ,Farheen Siddiqui ,
(2017 ) " Modified N-Gram based Model for Identifying and Filtering Near-Duplicate Documents Detection " ,
International Journal of Advance Computational Engineering and Networking (IJACEN) ,
pp. 55-59,
Volume-5,Issue-10
Abstract : During last three decades World Wide Web (WWW) has expanded exponentially. A great deal of the web is full
of duplicate or near-duplicate content. Documents that are served on the web are in different formats like PDF, HTML, excel
and text. Our proposed solution is created on a publicly available dataset files. The dataset consists of files which are tagged
as duplicate. Our work in this paper is based on the duplicate and near duplicate document detection using n-Gram based, a
low-dimensional demonstration(LSI-SVD) approach, implemented in c#.net.
Keywords - Duplicate document, N-gram, SVD (Singular Value Decomposition), LSI(Latent Semantic Indexing), Cosine
similarity etc.
Type : Research paper
Published : Volume-5,Issue-10
DOIONLINE NO - IJACEN-IRAJ-DOIONLINE-9663
View Here
Copyright: © Institute of Research and Journals
|
|
| |
|
PDF |
| |
Viewed - 67 |
| |
Published on 2017-12-30 |
|