عنوان مقاله
بهبود دقت و کارایی الگوریتم خوشه بندی میانگین K
فهرست مطالب
مقدمه
الگوریتم خوشه بندی میانگینK
کارهای وابسته
شیوه اصلاح شده
نتایج آزمایشی
نتیجه گیری
بخشی از مقاله
شیوه اصلاح شده
در روش خوشه بندی بهبود یافته هر دو فاز الگوریتم میانگین k اصلی به منظور بهبود دقت و کارایی اصلاح شده اند. در فاز اول، مراکز اولیه به صورت سیستماتیک و به منظور تولید خوشه هایی با دقت بهتر تعیین شده اند. فاز دوم از نوعی روش خوشه بندی استفاده می کند که کار خود را با تشکیل خوشه های اولیه بر اساس فاصله نسبی هر نقطه داده از مراکز اولیه شروع می کند.
کلمات کلیدی:
rgence of modern techniques for scientific data collection has resulted in large scale accumulation of data pertaining to diverse fields. Conventional database querying methods are inadequate to extract useful information from huge data banks. Cluster analysis is one of the major data analysis methods and the k-means clustering algorithm is widely used for many practical applications. But the original k-means algorithm is computationally expensive and the quality of the resulting clusters heavily depends on the selection of initial centroids. Several methods have been proposed in the literature for improving the performance of the k-means clustering algorithm. This paper proposes a method for making the algorithm more effective and efficient, so as to get better clustering with reduced complexity. Index Terms—Data Analysis, Clustering, k-means Algorithm, Enhanced k-means Algorithm I. INTRODUCTION Advances in scientific data collection methods have resulted in the large scale accumulation of promising data pertaining to diverse fields of science and technology. Owing to the development of novel techniques for generating and collecting data, the rate of growth of scientific databases has become tremendous. Hence it is practically impossible to extract useful information from them by using conventional database analysis techniques. Effective mining methods are absolutely essential to unearth implicit information from huge databases. Cluster analysis [6] is one of the major data analysis methods which is widely used for many practical applications in emerging areas like Bioinformatics [1, 3]. Clustering is the process of partitioning a given set of objects into disjoint clusters. This is done in such a way that objects in the same cluster are similar while objects belonging to different clusters differ considerably, with respect to their attributes. The k-means algorithm [6, 7, 8, 10, 11] is effective in producing clusters for many practical applications. But the computational complexity of the original k-means algorithm is very high, especially for large data sets. Moreover, this algorithm results in different types of clusters depending on the random choice of initial centroids. Several attempts were made by researchers for improving the performance of the k-means clustering algorithm. This paper deals with a method Manuscript received December 9, 2008. K. A. Abdul Nazeer is with National Institute of Technology Calicut, Kozhikode, India- 673 601. (phone: 91-495-2286818; fax: 91-495-2287250; e-mail: nazeer@ nitc.ac.in). M. P. Sebastian is with National Institute of Technology Calicut, Kozhikode, India- 673 601. (e-mail: sebasmp@ nitc.ac.in). for improving the accuracy and efficiency of the k-means algorithm. II. THE K-MEANS CLUSTERING ALGORITHM This section describes the original k-means clustering algorithm. The idea is to classify a given set of data into k number of disjoint clusters, where the value of k is fixed in advance. The algorithm consists of two separate phases: the first phase is to define k centroids, one for each cluster. The next phase is to take each point belonging to the given data set and associate it to the nearest centroid. Euclidean distance is generally considered to determine the distance between data points and the centroids. When all the points are included in some clusters, the first step is completed and an early grouping is done. At this point we need to recalculate the new centroids, as the inclusion of new points may lead to a change in the cluster centroids. Once we find k new centroids, a new binding is to be created between the same data points and the nearest new centroid, generating a loop. As a result of this loop, the k centroids may change their position in a step by step manner. Eventually, a situation will be reached where the centroids do not move anymore. This signifies the convergence criterion for clustering. Pseudocode for the k-means clustering algorithm is listed as Algorithm 1 [7]. Algorithm 1: The k-means clustering algorithm Input: D = {d1, d2,......,dn} //set of n data items. k // Number of desired clusters Output: A set of k clusters. Steps: 1. Arbitrarily choose k data-items from D as initial centroids; 2. Repeat Assign each item di to the cluster which has the closest centroid; Calculate new mean for each cluster; Until convergence criteria is met. Improving the Accuracy and Efficiency of the