Near Duplicate Document Detection Using
Document-Level Features and Supervised
Learning

Raveena.S; N; ini.V

Special Issue Article Open Access

Near Duplicate Document Detection Using Document-Level Features and Supervised Learning

Abstract

This paper addresses the problem of Near Duplicate document. Propose a new method to detect near duplicate document from a large collection of document set. This method is classified into three steps. Feature selection, similarity measures and discriminant function. Feature selection performs pre-processing; calculate the weight of each terms and heavily weighted term is selected as a features of input document. As a result, Feature selection helps to select a set of features from an input document. Similarity measure measures the similarity degree between two documents. Discriminant derivation use SVM classifier to determine the discriminate function from document set based on supervised learning. As a result of this method, discriminant function is to check whether the document is near duplicate or not based on similarity degree. These document-level feature selections provide better (or) more efficient result than sentence-level feature selection.

Raveena.S, Nandini.V

To read the full article Download Full Article | Visit Full Article