Correlated concept based dynamic document clustering algorithms for newsgroups and scientific literature
© Jayabharathy and Kanmani; licensee Springer. 2014
Received: 11 April 2013
Accepted: 5 September 2013
Published: 19 February 2014
Increase in the number of documents in the corpuses like News groups, government organizations, internet and digital libraries, have led to greater complexity in categorizing and retrieving them. Incorporating semantic features will improve the accuracy of retrieving documents through the method of clustering and which will also pave the way to organize and retrieve the documents more efficiently, from the large available corpuses. Even though clustering based on semantics enhances the quality of clusters, scalability of the system still remains complicated. In this paper, three dynamic document clustering algorithms, namely: Term frequency based MAximum Resemblance Document Clustering (TMARDC), Correlated Concept based MAximum Resemblance Document Clustering (CCMARDC) and Correlated Concept based Fast Incremental Clustering Algorithm (CCFICA) are proposed. From the above three proposed algorithms the TMARDC algorithm is based on term frequency, whereas, the CCMARDC and CCFICA are based on Correlated terms (Terms and their Related terms) concept extraction algorithm. The proposed algorithms were compared with the existing static and dynamic document clustering algorithms by conducting experimental analysis on the dataset chosen from 20Newsgroups and scientific literature. F-measure and Purity have been considered as metrics for evaluating the performance of the algorithms. The experimental results demonstrate that the proposed algorithm exhibit better performance, compared to the four existing algorithms for document clustering.
KeywordsStatic and dynamic document clustering MAximum resemblance data labeling (MARDL) technique Term frequency Inverse document frequency (TFIDF) Concepts Semantic similarity
Tremendous growth in the volume of text documents available from various sources like the Internet, digital libraries, news sources, and company-wide intranets has led to an increased interest in developing methods that can help users to effectively navigate, summarize, and organize information, with an ultimate goal of helping the users to find what they are looking for. In this context, fast and high-quality document clustering algorithms play an important role, as they have shown to provide both an intuitive navigation/browsing mechanism, by organizing large amounts of information into a small number of meaningful clusters, as well as to greatly improve the retrieval performance either by cluster-driven dimensionality reduction, term-weighting Tang et al. (2005), or by query expansion Sammut and Webb (2010). As today’s search engine does just string matching, documents retrieved may not be so relevant to the user’s query. Thus, a good document clustering approach if available and implemented will assist in organizing the document corpus automatically into a meaningful cluster hierarchy for efficient browsing and navigation. Further, it will also help to overcome the inherent deficiencies associated with traditional information retrieval methods.
Document clustering has been investigated for use in a number of different areas of text mining and information retrieval. Initially, document clustering was investigated for improving the precision or recall in information retrieval systems and as an efficient way of finding the nearest neighbors of a document Van Rijsbergen (1989 and Kowalski and Maybury 2002 Buckley and Lewit 1985). Then clustering was used in browsing a collection of documents or in organizing the results returned by a search engine in response to a user’s query Cutting et al. (1992; Zamir et al. 1997). Document clustering was also been used to automatically generate hierarchical clusters of documents Steinbach et al. (2000). For example, a web search engine often returns thousands of pages in response to a broad query, making it difficult for users to browse or to identify relevant information.
Clustering methods can be used to automatically group the retrieved documents into a list of meaningful categories, as is achieved by Enterprise Search engines such as: Northern Light and Vivisimo Andrews and Fox (2007). However, in this case scalability becomes a big issue as the number of documents increases day-by-day, thereby necessitating the need to cluster documents dynamically, without disturbing the formulated clusters. By clustering documents dynamically, the time and effort taken for clustering is drastically reduced, as dynamic algorithms processes the new document and assigns it into the meaningful clusters directly, instead of re-clustering the entire document in the corpus. Though some document clustering methods exist for clustering documents in a dynamic environment which are based on terms Wang et al. (2011) or Synonyms and Hypernyms Nadig et al. (2008), they are not best suited for documents that are technically related. To overcome to above limitations, a model for dynamic document clustering based on Term frequency and Correlated Terms (Terms and their related terms) as concepts in Scientific literature and Newsgroups data set, is proposed in this paper. The three new algorithms, namely, Term frequency based MAximum Resemblance Document Clustering (TMARDC), Correlated Concept based MAximum Resemblance Document Clustering (CCMARDC) and Correlated Concept based Fast Incremental Clustering Algorithm (CCFICA) are proposed and the performance of the above have been compared with four existing algorithms, namely, Semantic Similarity based Histogram based Incremental Document Clustering (SHC), Concept-Based Mining Model (CBM), Incremental Algorithm for Clustering Search Results (ICA) and Enhanced Similarity Histogram Clustering using Intra Centroid Vector Similarity (ESHC-IntraCVS) on the same datasets, and results are presented.
The remaining part of this paper is organized as follows. Section “Related work” reviews related work on static and dynamic document clustering. Section “Overview of existing document clustering considered for comparative analysis”, outlines the general model for dynamic document clustering, also, the need for considering correlated terms are briefly stated in that section. In section 4 presents, the new clustering algorithms, namely, TMARDC, CCMARDC and CCFICA clustering algorithms have been described in detail. In Section 5, the experimental setup and data set descriptions have been discussed, followed by analysis of results. Finally salient conclusions are presented in section “Experimental results”.
We have conducted systematic and structured reviews to identify the issues in the existing dynamic document clustering algorithms. To overcome the issues in the exiting work, three algorithms namely Term frequency based MAximum Resemblance Document Clustering (TMARDC), Correlated Concept based MAximum Resemblance Document Clustering (CCMARDC) and Correlated Concept based Fast Incremental Clustering Algorithm (CCFICA) have been proposed. To justify the potential of the proposed algorithm experiments are conducted on two dataset. The performance of the proposed algorithm shows better results compared to the existing algorithm.
Most of the existing document clustering methods are based on the Vector Space Model (VSM) which is a widely used data representation for text classification and clustering Aas and Eikvil (1999). In VSM the document is represented as a feature vector of the terms in the document. Each feature vector contains term-weights of the terms in the document. Term Frequency–Inverse Document Frequency (TF-IDF) is a weight used which is a statistical measure, is used as a weight to evaluate ‘how important a word is’ to a document in a collection or corpus Salton and Buckley (1998). The importance increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus. The similarity between the documents is measured by one of several similarity measures that are based on such a feature vector Huang (2008). Common ones include the cosine measure and the Jaccard measure.
Survey of recent document clustering algorithms
Algorithm name with author(s)
Data set used
Threshold Resilient Online Algorithm Chou and Chen (2008)
IPLSI(Incremental Probabilistic Latent Semantic Indexing)
Latent Semantic Variables
A latent variable is introduced between documents and terms, Cosine function
NIST TDT Corpora
Efficient Phrase Based Indexing Hammouda and Kamel (2004)
Uses DIG(Document Index Graph) for Web Clustering
Document Index Graph (Phrase Based Representation)
Phrase Based Similarity measure
USENET News Groups
Component-Based Clustering Algorithms Boris et al. (2012)
IR(Initial Representative), MD(Measure Distance), UR(Update Representatives), EC(Evaluate Clusters), SC(Stop Criterion)
Object-Based Software Representation
CITY,CORREL, COSINE, ELUCID
10 UCI Datasets
Temporal Queries and Version Management Zaniolo and Wang (2008)
V-Document (XML Document)
W3C, World Fact Book
Density –Based Methods for Hierarchical Clustering Chehreghani and Abolhassani (2008)
3-Phases: Insertion Phase, Extraction Phase, Combination Phase
Relative distance between objects
DMOZ, NEWS, REUTERS
XML Schema Matching Algorithm Alsayed et al. (2009)
NPS(Number Prufer Sequences), LPS(Label Prufer Sequences)
Prufer Sequences, Schema Trees
The distance between two nodes in the schema tree
Novel Web User Clustering Method Ling et al. (2009)
A 3Phase COWES Algorithm
A Web Session Subtree
DoC(Degree of Change), FoC(Frequency of Change) and SoC(Significance of Change)
Internet Traffic Archive
Multi-label Document Clustering Algorithm Chen et al. (2010)
FMDC(Fuzzy Based Multi-label Document Clustering) – Fuzzy Association Rule + Existing Ontology
Terms and Hypernyms Representation of documents
Membership Functions and Document Term Matrix
Classic, Re0, R8, and WebKB
Incremental Construction of Multilingual Topic Maps Ellouze et al. (2012)
CITOM(Construction Incremental Topic Map)
Topic Map Model Representation
Topic Map Pruning Process
Feature Extraction Algorithm Yan et al. (2011)
TOFA(Trace-Oriented Feature Analysis)
Bag Of Words Model(BOW)
Latent Semantic Indexing(LSI)
20NG, RVCI, ODP
Correlation Similarity Measure Space Zhang et al. (2011)
CPI(Correlation Preserving Index)
Terms and related terms
Contextual Document Cluster Rooney et al. (2006)
CDC(Contextual Document Cluster)
Term Document Representation
Adjacent Document Similarity
Framework of Wikipedia-Based Clustering Hu et al. (2009)
Exact-match and Relatedness-match
Concept feature vector and Category feature vector
Complete Linkage as cluster distance measure
20-newsgroup, TDT2, LA Times
Critical analysis of the recent document clustering algorithms, as presented in Table 1 reveals that document are represented (i) based on phrase or pair-wise concept, where in the similarity relationship between the sentences are identified as used Hammouda and Kamel (2004; Lam and Hwuang 2009); (ii) using tree representation and similarity between two objects or nodes are identified and clustered Zaniolo and Wang (2008; Chehreghani and Abolhassani 2008; Alsayed et al. 2009); (iii) component based clustering algorithm which makes use of object – based software representation for modeling the document and cosine and Euclid measure for document clustering Boris et al. (2012); (iv) identifying the semantic relations and representing the documents based on Terms and Related terms Zhang et al. (2011); (v) as concept and feature vector Hu et al. (2009). Most of the above works are based on web page information representation, tracking and retrieval.
Prathima and Supreethi (2011) presented a survey of concept based clustering algorithms, and concluded that most of the clustering techniques use TF-IDF method. This method has the following issues:
It fails to differentiate the degree of semantic importance of each term;
It assign weights without distinguishing between semantically important and unimportant words within the document and
It does not consider synonyms, polysemous, etc.
Based on the critical analysis of published literature, it is inferred that more than 60% of clustering techniques is based on term frequencies. About 30% of clustering techniques and annotation tools use synonyms and hypernyms for predicting the concepts. Moreover, the synonyms and Hypernyms are extracted by means of WordNet lexical database Miller (1995). Since scientific literature and many tracks of news documents consist of purely domain-specific technical terms, the performance of synonyms and hypernyms based clustering may not always yield better results. In order to enhance the quality of the cluster for the above mentioned document sets, the focus of the present study is on clustering the document based on terms and their technically related terms. In this regard, a domain- specific dictionary has been developed by the authors to extract the related terms as concepts.
Overview of existing document clustering considered for comparative analysis
Three existing algorithms that have been chosen for the comparative analysis (with that of the proposed algorithms) are briefly described below.
Semantic similarity histogram based incremental document clustering (SHC) algorithm
Gad and Kamel (2010) proposed an incremental clustering algorithm based on Phrase-Semantic Similarity Histogram (PSSM). This algorithm integrates the text semantic to the incremental clustering process. The clusters are represented using semantic histogram which measures the distribution of semantic similarities within each cluster. The PSSM which is based on single word analysis and phrase analysis, assigns and adjusts the term weight (word/phrase) based on its relationships with semantically similar terms that occur together in the document. As soon as the new document is incrementally added to the cluster, the semantic histogram ratio is calculated and the insertion order problem is addressed by making bad documents that reduce the cluster cohesiveness to leave, and reassign them to a more appropriate cluster.
Enhanced similarity histogram clustering using intra centroid vector similarity (ESHC-intra CVS) algorithm
Gavin and Yue (2009) proposed an enhanced incremental clustering approach to develop a better clustering algorithm that helps to organize the information available on the internet in an incremental fashion in a better way. This enhanced algorithm works with the idea that the cluster that contains a large number of similar documents to the current document being clustered will have a centroid vector that has a high similarity to the current document. Therefore, the cluster whose centroid vector is most similar to the document’s vector representation is the one that most likely to contain the maximum number of documents that are more similar to the current document. Adding the new document to this cluster (when possible) will probably give the greatest benefit to that cluster and the entire dataset. This approach uses the same pair-wise document similarity representation and distribution approach and also uses additional information about the cluster to determine the best cluster to place the new document.
Concept-based mining model (CBM)
Shehata et al. (2010) proposed a Concept- based Mining Model for Enhancing Text Clustering Mining model. The proposed concept-based mining model consists of sentence-based concept analysis, document-based concept analysis, corpus-based concept-analysis, and concept-based similarity measure. By combining the factors affecting the weights of concepts on the sentence, document, and corpus levels, a concept-based similarity measure that is capable of accurate calculation of pair-wise documents, is formulated. This allows performing concept matching and concept-based similarity calculations among documents in an accurate way. The quality of text clustering achieved by this model significantly surpasses the traditional single term- based approaches like: (i) Hierarchical Agglomerative Clustering (HAC), (ii) Single-Pass Clustering, and (iii) k-Nearest Neighbor (k-NN).
An incremental algorithm for clustering search results (ICA)
Liu et al. (2008) proposed an incremental clustering algorithm based on Cluster Average Similarity Area (CASA), which was used to score the degree of coherency of a cluster. The cohesiveness quality information of a cluster was computed based on its CASA. The above algorithm works by processing data objects one at a time, incrementally assigning data objects to their respective clusters while they progress.
A model for dynamic document clustering
Preprocessing involves: tokenization, removing stopwords and stemming.
Tokenization (Christopher et. al. http://nlp.stanford.edu/software/tokenizer.shtml), is the process of splitting the sentences into separate tokens. For example, “this is a paper about document clustering” is split as: this\is\paper\about\document\clustering. Stop words are frequently occurring words that have little or no discriminating power, such as: \a", \about", \all", etc., or other domain-dependent words. Stop words are often removed. Stemming is the process of removing the affixes in the words and producing the root word known as the stem Frakes and Fox (2003). Typically; the stemming process is performed to transform the words into their root form. For example: connected, connecting and connection would be transformed into ‘connect’. Most widely used stemming algorithms are the ones proposed by Porter (1998), Lovins (1968), and S-removal Harman (1991).
Static document clustering
The processed documents are clustered using a Bisecting K-means clustering algorithm in order to group similar documents. Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations of the same cluster are similar in some sense. The Bisecting K-means method will split a large cluster into two sub-clusters and this step will be repeated for several times, until the K numbers of clusters are formed with high similarity Steinbach et al. (2000).
Dynamic document clustering
Dynamic Document Clustering is the process of inserting the newly arrived documents to the appropriate existing cluster such that the formulated cluster will have a high intra- cluster similarity, and less inter-cluster similarity. At first the new documents are preprocessed and then it is clustered based on the dynamic technique. The issues that are to be addressed are:
Effectiveness: How accurately the newly arrived documents are inserted to the existing clusters.
Insertion Order Issue: Pattern of arrival of new documents should not affect the correctness of the clusters.
The new documents are assigned to the existing cluster, one by one in recursive steps. The new documents are assigned to a cluster dynamically at run time without the need for re-clustering. As a result the existing clusters are updated and the final clusters are obtained. In this study, the three newly proposed algorithms TMARDC, CCMARDC and CCFICA are experimented for clustering the documents dynamically. The details of these three algorithms are discussed in the next section.
Proposed algorithms for dynamic document clustering
This section describes the proposed Term frequency based MAximum Resemblance Document Clustering (TMARDC) algorithm, Correlated Concept based MAximum Resemblance Document Clustering (CCMARDC) algorithm, and Correlated Concept based Fast Incremental Clustering Algorithm (CCFICA) for dynamic clustering.
Term frequency based maximum resemblance document clustering (TMARDC)
This algorithm adopts the core concept of MARDL i.e. Maximum Resemblance technique Chen et al. (2008). This algorithm is purely based on a bag of words representation. This dynamic algorithm starts with the set of clusters which is obtained as the result of bisecting K-Means clustering. Initially, the sample set is constructed for each cluster set. One third of the documents are chosen randomly as samples from the set of documents in each cluster. The samples chosen should be unique and should not be replica’s of documents in samples. The new documents are preprocessed first which includes stop word removal process and stemming process. The new documents are stemmed using a stemming algorithm. After preprocessing of the new document, the new document is compared with samples based on Sentence Importance computation (SIC), Cluster set Importance computation (CIC) and the influence of the new document in each cluster termed as Frequency Value (FV) is calculated. The CIC should be normalized to obtain the FV, because the number of documents in each sample may vary.
Then the dynamic algorithm assigns the new document to the cluster with the high FV, provided, the FV is within the threshold value. The threshold value is maintained for clustering process to make a document to form a new cluster or assigning a document to the appropriate cluster. If all the clusters result in FV less than the threshold value, then, the new document forms a separate cluster. The threshold value is calculated through a series of experiments on all worst, average and best case inputs and it is termed as Threshold value (Tmax). A newly arrived document, if it’s FV falls less than the Tmax it forms a separate cluster, thus ensuring that no document goes without clustering, even it doesn’t patches with any of the existing clusters.
Correlated concept based maximum resemblance document clustering (CCMARDC)
Incorporation of semantic features, improve the quality of document clustering and also the accuracy of information extraction techniques. In this study, concept extraction algorithm introduced by Jayabharathy et al. (2011), which itself is a modification of the existing semantic-based model proposed by Shehata (2009) has been adopted. The model proposed by Shehata (2009) aims to cluster documents by meaning. The semantic-based similarity measure is used for the two CCMARDC and CCFICA algorithms, proposed in this study. In order to extract concepts, a domain-specific dictionary consisting of scientific terms and terms related to newsgroup tracks are created unlike the work of Shehata (2009), where in Word Net lexical database Miller (1995) was used for Synonyms/Hypernyms extraction. Domain-specific dictionary for scientific and Newsgroups are used for concept extraction, as it eliminates the need for word sense disambiguation (WSD) Banerjee and Pedersen (2002; http://en.wikipedia.org/wiki/Word_sense_disambiguation), which is not the scope of the present study.
Why correlated terms?
There are many existing clustering algorithms that take synonyms and hypernyms for vector representation. In this study, the authors have considered crtv as concepts for clustering to improve the efficiency of clustering the documents both statically and dynamically. The idea of considering terms and related terms as concepts based on semantic similarity has been carried out for extracting topic from the clustered documents Jayabharathy et al. (2011). The proposed technique CCMARDC takes this idea of considering crtv as concepts for static clustering and applies the same concept for clustering the document dynamically. Considering terms or synonyms and hypernyms for information extraction leads the following issues:
Case 1: Words have multiple meanings, hence diversifies the information extraction.
E.g. Bat : represents the cricket bat or a kind of a bird.
Case 2: Considering terms or synonyms of the terms limits the search space of the domain.
E.g. wireless: first sense medium of communication.
Similarly, synonyms of the term “wireless” is extracted from WordNet as: “first sense medium of communication”, whereas, taking related terms like “wireless”, “communication”, “protocol” “mobile communication” etc. will be extracted as concepts, which gives better accuracy and improves the efficiency of information extraction. For example, sports article contains terms like: a ball, bat, wicket, run, batsman, over etc. Taking synonyms/hypernyms as concept, will not give better performance since the meaning of these terms are not literally same. If we consider the technically related terms i.e. crtv, all the above mentioned terms will be grouped together as a single concept which refers sports related to the concept – cricket. Similarly the synonym for the term “farmer” from WordNet is extracted as: “a person Title who operates a farm”. But using the proposed model the concept will be extracted as “farmer”, “crops”, “fertilizer”, “land” and “farm”. Clustering the document using this extraction procedure would improve the performance of the resulting cluster, than that of the cluster generated by existing works.
Concept extraction algorithm: description
Considering the extraction of Synonyms/Hypernyms as concepts degrades the efficiency of the results in the case of scientific literature and news group dataset because of the fact that the documents speak more about scientific or technical terms. Concept extraction is based on our previous work Jayabharathy et al. (2011), where Correlated concepts are nothing but the terms and their related terms. For Concept extraction, domain specific dictionary is used where terms related to each domain is kept along with the meaning of the term. For e.g. the terms A and B are taken as a concept; if term A is in the definition of term B or vice versa combines A and B as a single concept else add the definition of A and B as separate concept to the concept list. E.g. Considering share market as the term in the news documents, the related terms are share, shareholder, money, market. The documents containing these words are grouped together as share market which forms the cluster.
The framework of the proposed correlated concept based maximum resemblance document clustering (CCMARDC)
The number of matching concepts, (mc), in each document (d);
The total number of the labeled verb-argument structures (v), in each sentence (st);
The ctf i of each concept c i in st for each document d, where i = 1, 2, …, mc and
The c f i of each concept c i in each document d, where i = 1, 2, …, mc
Where cn is the total number of concepts which have a conceptual term frequency value in document d.
Correlated concept based fast incremental clustering algorithm (CCFICA)
Xiaoke et al. (2009) proposed Fast Incremental Clustering Algorithm (FICA) an increment data clustering algorithm for mushroom data set. The main objective of this algorithm is to cluster the categorical data into the K number of clusters using incremental method. The existing algorithm uses dissimilarity measure for finding the distance between the new object and the existing cluster. The core idea of the above algorithm is considered in the CCFICA proposed here. The FICA algorithm is modified for clustering the documents for dynamic document corpuses, based on semantic similarity. For every cluster, the top correlated concepts from each document are extracted and are maintained as a concept pool. Instead of computing the dissimilarity between document clusters and the new document, the semantic similarity between the new document and the concept pool is computed, which reduces the computation overhead.
The data set used for the experimental analysis contains 500 abstract articles collected from the Science Direct digital library. The articles are classified according to the Science Direct classification system into four major categories: computer networks and communications, nuclear and high energy physics, economics and econometrics, and civil and structural engineering. In addition, to that 20 Newgroups is considered as another data, set for the result analysis which consists of more than 1000 news articles related to Sports, Political and Share market tracks.
Initially, text documents which have been collected from various sources were accumulated in a database. Then, pre-processing was carried out by considering the various stages like: tagging by means of Stanford POS tagger tool, stop word removal and stemming, based on Porter Stemmer algorithm and morphological capabilities of WordNet. The above preprocessing is common for both existing and proposed algorithms considered in this study. Then the documents are represented as VSM. These documents are clustered using Bisecting K-means algorithm which generates K number of clusters.
For implementing the existing algorithms the preprocessing as outlined in this work along with dataset chosen for the study were used. The algorithms as originally proposed by the various authors were implemented in the above environment. However, for CBM, the entire model as originally proposed was not considered. Instead, the CBA algorithm and clustering- based concept semantic similarity alone is implemented. For uniformity, only the ICA clustering algorithm as originally proposed by the authors, were used in this study, even though the original ICA algorithm starts with query retrieval and then proceeds to clustering. By varying the number of documents the results of the proposed and existing algorithms are measured. These algorithms are implemented in JDK 1.7 environment using Net Beans IDE.
Results and discussion
Techniques adopted in existing and proposed algorithms
SHC Gad and Kamel (2010)
Term weight (word/phrase relationship)
Reuters-21578 and 20-Newsgroups
ESHC-IntraCVS Gavin and Yue (2009)
UW-CAN dataset, 314 web pages from University of Waterloo
Verb argument structure
Concept similarity Measure
ACM abstract articles, Reuters, Brown corpus, Usenet newsgroups
ICA Liu et al. (2008)
MARDL, sentence similarity
ACM abstract articles, 20Newsgroup
ACM abstract articles, 20Newsgroup
ACM abstract articles, 20Newsgroup
Based on F-measure and Purity analysis for Scientific Literature;
ii) Based on F-measure and Purity analysis for Newsgroup and
iii) Based on pair-wise performance analysis (one to one comparison) for both datasets
F-measure and purity analysis for scientific literature dataset
The proposed algorithms perform better than the existing algorithms, as they consider the semantic relation between the documents. In CBA, the comparison is solely based on the semantic structure (subject verb argument) of each sentence only. Though it extracts the most prominent terms in sentences, it fails to capture technical correlation of terms between the sentences and the documents. The other reason is that CBA is a static clustering technique which applies clustering process for all the document clusters including the new document (s). Clustering the entire document set is a time consuming process. Also, extraction of semantic structure (subject verb argument) from the entire document set leads to information loss; as only top sentences are extracted. As the proposed CCMARDC captures the correlated concepts through the concept extraction algorithm, and as it is also devised as a dynamic algorithm, the problem of information loss has been overcome. Hence,the proposed CCMARDC algorithm gives better results, compared to the existing CBA algorithm.
F-measure and purity analysis for newsgroup dataset
Pair-wise performance analysis (one to one comparison) for both datasets
Clustering based on term frequency (TMARDC, ICA, SHC, ESHC)
ii) Clustering based on concepts (CCMARDC, CCFICA, CBA)
The quality of the clustering could be judged properly only when the algorithms of same category are evaluated and analyzed. To justify this statement a comparative analysis between the following pairs CBA&CCMARDC, CBA&CCFICA algorithms have been made, as CBA treats Synonyms and Hypernyms as concepts. Then, the performance evaluation between the term frequency based algorithms (i.e. TMARDC&ICA, TMARDC&SHC, TMARDC&ESHC) were analyzed. In the Figure 7 for simplicity the above pairs as: C1, C2, T1, T2 and T3, where
C1 = CBA &CCMARDL C2 = CBA& CCFICA,
Result outcome improvement classes of proposed algorithms
CCMARDC & CBA
CFICA & CBA
TMARDC & ICA
TMARDC & ESHC
TMARDC & SHC
Scientific literature dataset
CCMARDC & CBA
CCFICA & CBA
TMARDC & ICA
TMARDC & SHC
TMARDC & ESHC
The term based algorithms are also experimented with the same set of document collections and the results obtained are summarized in Table 3. It can be clearly stated that the quality of clustering based on TMARDC gives appreciable performance compared to the existing term based SHC, ESHC and ICA algorithms. This is because of identifying the prominence of each sentence of the newly arrived document with the documents of the samples using SIC and the relevancy of the new document against the each sample set, using CIC and NCIC, thus leading to better quality improvement. Computing the similarity between the samples and the new document(s) helps to choose a prominent cluster for inserting the newly arrived document, rather than re-clustering the entire set. Whereas, most of the incremental clustering algorithm works based on applying similarity measure on the entire cluster and on the new document, the proposed algorithms basically compute the similarity between the samples and new document(s) top concepts or terms. The computation overhead is thus minimized to a greater extend, as these parameters are computed against the new document and the sample set only, but not for the entire cluster. Instead of choosing random samples, choosing the documents around cluster centroid may also improve the quality.
The emphasis of the present work is Dynamic Document Clustering based on Term frequency and Correlated based Concept algorithms, using semantic-based similarity measure. The core idea of Data mining algorithms MARDL and FICA is adopted for the proposed algorithms TMARDC, CCMARDC and CCFICA. In general the documents are represented as TF-IDF, whereas, in this study the documents are represented by means of correlated term vector (crtv). This representation helps the user to capture the technical correlation between the documents. The proposed algorithms are compared with the existing term frequency and synonyms/hypernyms based incremental document clustering algorithms considering scientific literature and newsgroup dataset. From the comparative analysis it can concluded that considering crtv representation for dynamic document clustering leads to promising results especially for scientific literature. Sometimes the results from the Newsgroup dataset are not promising, due to the need for relatively more English literary terms, rather technical terms. In future, it is proposed to extend concept extraction based on significant phrases in documents, and also by incorporating semantic relations like hyponymy, holonymy, and meronymy.
Jayabharathy received her B.Tech (CSE) from Pondicherry Engineering College, Puducherry, India and M.Tech (CSE) from Pondicherry University, Puducherry, India. She is currently working as Assistant Professor in the Department of Computer Science & Engineering at Pondicherry Engineering College. She has published nearly 15 research papers. She is currently pursuing her Ph.D in Document Mining. Her areas of interests include Data mining and Distributed Computing.
Dr. S. Kanmani received her B.E (CSE) and M.E (CSE) from Bharathiar University, Coimbatore, India and Ph.D from Anna University, Chennai, India. She is working as Professor in the Department of Information Technology at Pondicherry Engineering College. She has published nearly 63 research papers. She is currently a supervisor guiding 8 Ph.D scholars. She is an expert in Software Testing. Her areas of interests include Software Engineering, Genetic algorithms and Data Mining.
We appreciate the insightful comments from the three anonymous reviewers. Their comments were very helpful for us to improve the paper. We also express our thanks to Pondicherry Engineering College for their support in performing this research.
- Aas K, Eikvil L: Text Categorisation: A Survey. Technical Report 941. Oslo Norway: Norwegian Computing Center; 1999. iteseer.ist.psu.edu/aas99text.htmlGoogle Scholar
- Alsayed A, Eike S, Saake G: Improving XML schema matching performance using prüfer sequences. Data Knowledge Engineering 2009, 68: 728–747. 10.1016/j.datak.2009.01.001View ArticleGoogle Scholar
- Andrews NO, Fox EA: Recent developments in document clustering. Technicalreport: Published by Citeseer; 2007:1–25.Google Scholar
- Baghel R, Dhir R: A frequent concept based document clustering algorithm. International Journal of computer Applications 2010,4(5):0975–8887.View ArticleGoogle Scholar
- Banerjee S, Pedersen T: Adapted lesk algorithm for word sense disambiguation using WordNet. In Computational linguistics and intelligent text processing. London: Springer; 2002:136–145.View ArticleGoogle Scholar
- Bharathi G, Vengatesan D: Improving information retrieval using document clusters and semantic synonym extraction. Journal of Theoretical and Applied Information Technology 2012,36(2):167–173.Google Scholar
- Boris D, Milan V, Milos J, Kathrin K: An architecture for component-based design of representative-based clustering algorithms. Data Knowledge Engineering 2012, 75: 78–98.View ArticleGoogle Scholar
- Buckley C, Lewit AF: Optimizations of inverted vector searches, SIGIR’85. 1985, 97–110.Google Scholar
- Chehreghani MH, Abolhassani H: Improving density-based methods for hierarchical clustering of Web pages. Data & Knowledge Engineering 2008, 67: 30–50. 10.1016/j.datak.2008.06.006View ArticleGoogle Scholar
- Chen HL, Chuang KT, Chen MS: On data labeling for clustering categorical data. IEEE Transactions On Knowledge And Data Engineering 2008,20(11):1458–1472.View ArticleGoogle Scholar
- Chen CL, Tseng FSC, Liang T: An integration of WordNet and fuzzy association rule mining for multi-label document clustering. Science Direct Data & Knowledge Engineering 2010, 69: 1208–1226. 10.1016/j.datak.2010.08.003View ArticleGoogle Scholar
- Chou TC, Chen MC: Using incremental PLSI for threshold-resilient online event analysis. IEEE Transactions on Knowledge And Data Engineering 2008,20(3):289–299.View ArticleGoogle Scholar
- Cutting DR, Karger DR, Pedersen JO, Tukey JW: Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, SIGIR ‘92. 1992, 318–329.Google Scholar
- Danushka B, Yutaka M, Ishizuka M: A Web search engine-based approach to measure semantic similarity between words. IEEE Transactions on Knowledge And Data Engineering 2011,23(7):977–990.View ArticleGoogle Scholar
- Ellouze N, Lammari N, Métaism E: CITOM: an incremental construction of multilingual topic maps. Data & Knowledge Engineering 2012, 74: 46–62.View ArticleGoogle Scholar
- Frakes WB, Fox CJ: Strength and Similarity of Affix Removal Stemming Algorithms. ACMSIGIR Forum; 2003:26–30.Google Scholar
- Gad WK, Kamel MS: Incremental clustering algorithm based on phrase- semantic similarity histogram. Proceedings of the Ninth International Conference on Machine Learning and Cybernetics 2010,11(14):2088–2093.Google Scholar
- Gavin S, Yue X: Enhancing an incremental clustering algorithm for Web page collections. IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies 2009, 81–84.Google Scholar
- Hammouda KM, Kamel MS: Efficient phrase-based document indexing for web document clustering. IEEE Transactions on Knowledge And Data Engineering 2004,16(10):1279–1296. 10.1109/TKDE.2004.58View ArticleGoogle Scholar
- Harman D: How effective is suffixing. Journal of the American Society for Information Science 1991,42(1):7–15. 10.1002/(SICI)1097-4571(199101)42:1<7::AID-ASI2>3.0.CO;2-PView ArticleGoogle Scholar
- Hu X, Zhang X, Lu C, Park EK, Zhou X: Exploiting Wikipedia as External Knowledge for Document Clustering. France: Proceedings of the 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’09); 2009:389–396.Google Scholar
- Huang A Proceedings of the New Zealand Computer Science Research Student Conference (NZCRSC’08). Similarity Measures for Text Document Clustering 2008, 49–56.Google Scholar
- Jayabharathy J, Kanmani S, AyeshaaParveen A: Document clustering and topic discovery based on semantic similarity in scientific literature. 2nd International Conference on Data Storage and Data Engineering (DSDE 2011), 2 2011, 425–429.Google Scholar
- Kaiser F, Schwarz H, Jakob M Proceedings of Third International Conference on Digital Society , IEEE. Using Wikipedia-Based Conceptual Contexts to Calculate Document Similarity 2009, 322–327.Google Scholar
- Kowalski G, Maybury MT: Information Retrieval Systems – Theory and Implementation. II edition. Kluwer Academic Publishers; 2002. ebook ISBN: 0–306–47031–4Google Scholar
- Kumar N, Srinathan K Proceedings of the Advanced computing Conference , IEEE. A New Approach for Clustering Variable Length Documents 2009, 982–987.Google Scholar
- Lam W, Hwuang R: An active learning framework for semi-supervised document clustering with language modeling. Data & Knowledge Engineering 2009, 68: 49–67. 10.1016/j.datak.2008.08.008View ArticleGoogle Scholar
- Li F, Zhu Q: Document clustering in research literature based on NMF and testor theory. Journal of Software 2011,6(1):78–82.Google Scholar
- Li Y, Chung SM, Holt JD: Text document clustering based on frequent word meaning sequences. Journal on Data & Knowledge Engineering 2008,64(1):381–404. 10.1016/j.datak.2007.08.001View ArticleGoogle Scholar
- Ling C, Bhomwick SS, Wolfgang J: COWES: Web user clustering based on evolutionary Web sessions. Data & Knowledge Engineering 2009, 68: 867–885. 10.1016/j.datak.2009.05.002View ArticleGoogle Scholar
- Liu Y, Ouyang Y, Sheng H, Xiong Z: An Incremental Algorithm for Clustering Search Results, IEEE International Conference on Signal Image Technology and Internet Based Systems. 2008, 112–117.Google Scholar
- Lovins JB: Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 1968, 11: 22–31.Google Scholar
- Luo C, Li Y, Chung SM: Text document clustering based on neighbors. Journal of Data Knowledge and Engineering 2009,68(11):1271–1288. 10.1016/j.datak.2009.06.007View ArticleGoogle Scholar
- Miller GA: WordNet: a lexical database for English, communication. ACM 1995,38(11):39–41. 10.1145/219717.219748View ArticleGoogle Scholar
- Nadig R, Ramanand J, Bhattacharyya P: Automatic evaluation of WordNet synonyms and hypernyms. India: Proceedings of ICON-2008, 6th International Conference on Natural Language Processing; 2008.Google Scholar
- Ni X, Quan X, Wenyin L: Short text clustering by finding core terms. Journal of Knowledge and Information Systems,Springer Link 2010,27(3):345–365.View ArticleGoogle Scholar
- Pessiot JF, Kim YM, Amini MR, Gallinari P: Improving document clustering in a learned concept space. Journal of Information Processing and Management, Elsevier 2010, 26: 182–192.Google Scholar
- Porter MF: An algorithm for suffix stripping program. (1998,14(3):130–137.Google Scholar
- Prathima Y, Supreethi KP: A survey paper on concept based text clustering. International Journal of Research in IT & Management 2011,1(3):45–60.Google Scholar
- Rooney N, Patterson D, Galushka M, Dobrynin V: A scaleable document clustering approach for large document corpora. Information Processing and Management 2006, 42: 1163–1175. 10.1016/j.ipm.2005.10.003View ArticleGoogle Scholar
- Salton G, Buckley C: Term-weighting approaches in automatic text retrieval. Information Processing & Management 1998,24(5):513–523.View ArticleGoogle Scholar
- Sammut C, Webb G: Encyclopedia of machine learning: Springer reference. I edition. 2010. ISBN 978–0-387–34558–1View ArticleGoogle Scholar
- Shehata S: AWordNet-based Semantic Model for Enhancing Text Clustering. IEEE International Conference on Data Mining Workshops. 2009, 477–482. 6 Dec. 2009Google Scholar
- Shehata S: An efficient concept-based mining model for enhancing text clustering. Journal of Knowledge and Data Engineering 2010,22(10):1360–1371.View ArticleGoogle Scholar
- Shehata S, Fakhri K, Mohamed S S: An efficient concept-based mining model for enhancing text clustering. IEEE Transactions On Knowledge And Data Engineering 2010,22(10):1360–137.View ArticleGoogle Scholar
- Steinbach M, Karypis G, Kumar V: A Comparison of Document Clustering Techniques. International Conference on Data Mining: Knowledge Discovery and Data Mining (KDD) Workshop on Text Mining; 2000:1–2.Google Scholar
- Tang B, Shepherd M, Milios E, Heywood MI Proceedings of Canadian Conference on AI. Comparing and Combining Dimension Reduction Techniques for Efficient TextClustering 2005, 292–296.Google Scholar
- Van Rijsbergen CJ: Information Retrieval. Second edition. London: Buttersworth; 1989.Google Scholar
- Wang X, Tang J, Liu H: Document clustering via matrix representation. 11th IEEE International Conference on Data Mining ICDM 2011 2011, 804–813.View ArticleGoogle Scholar
- Xiaoke S, Yang L, Renxia W, Yuming Q Proceedings of the 2009 International Symposium on Information processing (ISIP’09 ). In A Fast Incremental Clustering algorithm. Academy Publisher; 2009:17–178.Google Scholar
- Yan J, Liu N, Yan S, Yang Q, Fan WP, Wei W, Chen Z: Trace-oriented feature analysis for large-scale text data dimension reduction. IEEE Transactions on Knowledge and Data Engineering 2011,23(7):1103–1117.View ArticleGoogle Scholar
- Zamir O, Etzioni O, Madani O, Karp RM Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD-97 ). Fast and Intuitive Clustering of Web Documents 1997, 287–290.Google Scholar
- Zaniolo C, Wang F: Temporal queries and version management in XML-based document archives. Dataand Knowledge Engineering 2008,65(04–324):2008.Google Scholar
- Zhang T, Member YY, Tang BF, Xiang Y: Document clustering in correlation similarity measure space. IEEE Transactions on Knowledge And Data Engineering 2011,24(6):1002–1013.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.