Submission Deadline-23rd October 2025
October Issue of 2025 : Publication Fee: 30$ USD Submit Now
Submission Deadline-04th November 2025
Special Issue on Economics, Management, Sociology, Communication, Psychology: Publication Fee: 30$ USD Submit Now
Submission Deadline-19th November 2025
Special Issue on Education, Public Health: Publication Fee: 30$ USD Submit Now

An Improvement on Initial Centre Points of K-Means Clustering Algorithm

  • Akinwale Adio
  • Alabi Orobosade
  • Onafowora Temitope
  • 1383-1398
  • May 24, 2025
  • Computer Science

An Improvement on Initial Centre Points of K-Means Clustering Algorithm

Akinwale Adio, Alabi Orobosade and Onafowora Temitope

Department of Computer Science, Federal University of Agriculture, Abeokuta, Nigeria

DOI: https://doi.org/10.51244/IJRSI.2025.12040155

Received: 21 April 2025; Accepted: 25 April 2025; Published: 24 May 2025

ABSTRACT

A large volume of new data is generated every moment of the day by various domains of social media, bank transactions, and online websites. One of the vital means in dealing with this data is to classify it into a set of clusters.  At present, many clustering algorithms have been proposed and improved due to the face of large-scale and high-dimensional data, however, they have not achieved better effects and do not generalize well. This paper improves the k-means algorithm, where its results are sensitive to the initial clustering centre, and the clustering results of different similarity measurement methods would be different. The incorporation of in-built polygon sizes with their respective fuzzy linguistic hedge satisfactions for the classification of data points made clusters distinct from each other. To verify the effectiveness of the proposed methods, two sets of data were used as experimental data sets, which produced a good clustering solution with low processing times.

Keywords: k-means, Inertia method, inter-cluster, fuzzy linguistic, polygon

INTRODUCTION

Clustering is a type of unsupervised learning where data points are grouped into different sets based on their degree of similarity, such as Euclidean based distance, correlation-based distance, Pearson correlation distance, Spearman correlation distance, etc. There are various types of clustering such as hierarchical clustering (agglomerative clustering with bottom-up approach, divisive clustering with top down approach for examples, Clustering using representative, Balanced Iterative Reducing Clustering), partitioning clustering (k-means clustering for example Clustering Large Application based upon Randomized Search, fuzzy c-means clustering, mean shift clustering). Other types are density-based such as Density Spatial Clustering of Application with Noise, Ordering Points to Identify Clustering Structure, grid clustering such as Statistical Information Grid, Clustering in Quest, etc.

Clustering has been applied in various sectors such as market segmentation, where customers are found in terms of behaviour or attributes or similar products, image segmentation or compression in terms of similar regions or colour separation, and document clustering based on topics or contents. Other areas include data summarization, such as image processing, vector quantization, similar trends detection in dynamic data, social network analysis of generating sequences in images, videos, or audio, and biological data analysis. Clustering could be used for anomaly detection for fraud or intrusion, a recommender system for products or services based on past purchases or performances. One of the properties of cluster states is that the points within a cluster should be similar to each other. This paper aims to minimize the distance between the points within a cluster, focusing on the k-means clustering method.

K-means is very fast, robust, and easier to understand. It works by iteratively assigning data points to the nearest cluster centroid and updating the centroid until they stabilize. It is easy to implement, can work with a large number of variables, and can quickly change clusters when the centroids are recomputed. The algorithm requires a prior specification of the number of cluster centres.  The random initialization step causes the k-means algorithm to be non-deterministic, meaning that cluster assignment will vary if you run the same algorithm twice on the same dataset. The algorithm might create clusters that do not quite match the actual data distribution, leading to some clusters being too small or too large compared to others. This might lead to clusters that do not accurately reflect the true structure of the data.   The weakness of the k-means clustering is that we do not know how many clusters we need by just running the model. We need to test ranges of values to decide on the best value of k. The k value in the k-means algorithm is a crucial parameter that determines the number of clusters to be formed in the dataset. It is important to choose the right value of k, as a small value can result in under-clustered data, and a large value can cause over-clustering. In the k-means algorithm, the processing mode of abnormal data and the similarity calculation method will affect the clustering division.

K-means++ algorithm, an improved k-means, ensures a smarter initialization of the centroids and improves the quality of the clustering. The improved algorithm minimizes the chance of picking centres that are close to one another while avoiding the systematic pitfall of the farthest-first traversal method.  Inertia method makes sure that the first property of all the data points in a cluster should be similar to each other is satisfied. Calculation of intra-cluster distance and inter-cluster distance makes sure that the second property of the data points from different clusters should be as different as possible to have more meaningful clusters. The time complexity of the k-means algorithm is O(NTK), where N is the total number of data sets, K is the total number of partitions, and T is the number of iterations in the relatively high clustering process. This paper improves the initial centre points of the k-means clustering algorithm so that the clustering result has a low fluctuation.

LITERATURE REVIEW

Many clustering techniques have been developed in different application fields, but they have not achieved a better effect due to at times datasets with different data types. Although researchers have proposed variants of the k-means algorithm to overcome these impediments of assignment of centroids and number of clusters and ability to handle various data types, they are, however domain domain-specific and do not generalize well.

[1] defined the concepts of neighbourhood coupling and separation of objects by using the upper and lower approximations of object neighbourhood and neighbourhood model. They proposed an initial clustering centre selection algorithm, which showed significant improvement. [2] proposed an improvement of the k-means++ algorithm to improve the way of the initial centroids were selected. The first centroid was selected in a random number and all the remaining centroids were picked by proportional probability to the shortest distance from all the existing centroids. [3] proposed a clustering algorithm based on maximum and minimum distance and weighted similarity calculation (MW-k-means).  In the MW-k-means, the initial cluster centre is selected according to the idea of the maximum and minimum distance, and then the weighted similarity is used to divide each data point into corresponding clusters. This method avoided the clustering seeds from being close to each other in the initial value of the selection of the k-means algorithm. The MW-k-means integrates the advantages of weighted k-means (w-k-means) and maximum k-means (m-k-means) with faster convergence speed, more accurate clustering results, and higher stability, which can be used for practical clustering.

[4] described the k-means algorithm of separating data and objects into different clusters. They noted that objects with similar properties are put in one cluster and objects with different properties are put in a separate cluster.  This permitted large datasets to be easily scaled with k-means clustering system learning techniques, which work in an unsupervised manner. This gave a strong guarantee of the convergence and generalized the clusters for the various shapes and structures. [5] compared many k-means algorithms with a series of experiments. Experimental results showed that the method was reasonable for the selection of the center point, and the calculation of similarity produced a better clustering effect. [6] proposed initial cluster based on the relative distance of each data point from the initial centroids. These clusters were subsequently fine-tuned by using a heuristic approach.

[7] worked on the variants for solving the problem of initialization of the k-means algorithm. [8] proposed a simple and efficient clustering algorithm to solve the problem of the cluster centre point not being well-determined. The modified algorithm was built on a kd-tree data structure for the data points. The modified algorithm could avoid entering the local optimal solution. [9] presented a modified k-means algorithm based on self-paced learning theory. This method of self-paced learning theory was used to select a competing training subset that helps the k-means algorithm to build an initial cluster model. The generalization ability of the algorithm was then improved by subsequently adding training subsets found by self-paced learning until the model reached an optimal performance. The authors proposed this algorithm and demonstrated its performance on several real datasets.

[10] proposed a co-clustering algorithm for dealing with sparse and high-dimensional data that outperformed the basic k-means. [11] incorporated cuckoo search along with the k-means algorithm. The cuckoo search algorithm was modified by the authors, which helped to reach a better solution since the search step size factor was changed. The modified cuckoo search algorithm was robust enough to reach a near-optimal solution. [12] proposed an initial estimation method for computing the covariance matrix for the k-means algorithm using Mahalanobis distance. It involved finding a group of points comprised of neighbours with high density that represent the centroids of the selected clusters. These provided an approximate estimate of the covariance matrix, which was updated successively using the proposed method. [13] analysed the stability of the k-means clustering algorithm in a more practical scenario, the parameter of cluster number was chosen by stability-based model selection.

[14] reported a k-means algorithm that generated the initial centroids using pair-wise computed distances, which put all the closer points based on the threshold value for the minimum number of clusters required to build a cluster. [15] worked on k-means algorithm that is characterized by sorting all the dataset records using a table in which the sorted table was divided into multiple sub-tables that equalled the number of the desired initial centroids. [16] presented the k-means algorithm that monitored the centroids, which would not change anymore while the algorithm was still running. [17] proposed a novel k-means clustering algorithm, which reformulated the classical k-means objective function as a trace maximization problem and then replaced it with a new formulation. The proposed algorithm does not need to calculate the cluster centers in each iteration and requires fewer additional intermediate variables during the optimization process. In addition, the authors proposed an efficient iterative re-weighted algorithm to solve the involved optimization problem and provided the corresponding convergence analysis. The proposed algorithm keeps a consistent computational complexity as Lloyd’s algorithm, O(ndk) , but shows a faster convergence rate in experiments. Extensive experimental results on real-world benchmark datasets showed the effectiveness and efficiency of the proposed algorithm.

[18] presented a process that utilizes the farthest first and canopy algorithms to reduce the number of iterations in the k-means clustering. This improvement focused on enhancing the centroid initialization method for each cluster. The research study utilized ten datasets from the UCI machine learning repository to test the effectiveness of improving the k-means clustering algorithm. It compared the number of iterations and the sum of squares error as performance metrics. The results showed that the farthest first and canopy algorithms significantly reduced the number of iterations in the k-means clustering when compared to random centroid initialization. These research findings confirmed the hypothesis that the method of centroid initialization for each cluster impacted the iteration process of k-means clustering.

Due to their popularity and ease of use, k-means clustering methods were used in conjunction with deep learning for tasks as image segmentation and handwriting recognition. A more recent work used a fully connected deep convolutional neural network along with k-means, and was used to perform pixel matching between a segmented image and a convoluted image.  [19] proposed the k-means clustering algorithm based on the optimization of artificial fish swarm, which can obtain the global optimal division. This paper aims to present an improvement of the initial value of the k-means algorithm for good quality of clusters’ results with low processing time, where the optimal picking value of k from k-means is subjective for interpretation with the underlying structure of the dataset.

TYPE OF DISTANCE MEASURE

Several studies have suggested a wide range of distance metrics for data clustering. This section briefly describes seven widely used distance measures.

Euclidean Distance Measure

The Euclidean distance between the points x and y in two-dimensional plane is given by:

In case there is an extension to n dimensions and the points x and y are of the form

then the Euclidean distance is given as follows:

Manhattan distance

The Manhattan distance, also called taxicab distance or city block distance, is another popular distance metric. The Manhattan distance between the points x and y is given by:

Though the Manhattan distance does not give the shortest distance between any two given points, it is often preferred in applications where the feature points are located in a high-dimensional space.

Minkowski distance

Minkowski distance is named after a German mathematician, Henmann Minkowski and is in a normed vector space which is given by:

It is to note that for p = 1, the Minkowski distance equation takes the same form as that of Manhattan distance and when p = 2, it is equivalent to the Euclidean distance:

Squared Euclidean distance

The squared Euclidean distance between the point’s p and q in two-dimensional plane is given by:

In case there is an extension to n dimensions at the point’s p and q then the squared Euclidean distance is given as follows:

Eisen cosine correlation distance

Eisen cosine correlation distance is given as:

Mahalanobis distance

Mahalanobis distance is the distance between two points in multivariate space. For uncorrelated variables, the Euclidean distance equals the Mahalanobis distance. The Mahalanobis distance quantities relative distance from the centroid, in which the centroid is the point in multivariate space where all means from all variables intersect.  The Mahalanobis distance between two objects is formally defined as:

where  xB and  xA are a pair of objects and c   is the sample covariance matrix. T is the transpose operation, which flips a matrix over its diagonal. A different version of the formula uses distances from each observation to the central mean, and the equation is given as follows:

where x is an object vector and x   is the arithmetic mean vector

Chebyshev distance

Chebyshev distance, also known as maximum value distance, is used to examine the absolute magnitude of the differences between the coordinates of a pair of objects and the equation is given as follows:

Properties of distance measure      

             Distances are positive, except for the distance from a point to itself

iii. Distance is symmetric

 If x and y are two points in a metric space, then the distance between  x and y should be   equal to the distance between  y and x

iv. Triangle inequality

Given three points x, y and z, the distance metric should satisfy the triangle inequality

 k-means and k-means++ algorithms

  1. randomly select ‘c’ cluster centre
  2. calculate the distance between each data point and the cluster centre
  3. assign the data point to the cluster centre whose distance from the cluster centre is the minimum of all the cluster centres
  4. recalculate the distance between each data point and the newly obtained cluster centres
  5. if no data point was reassigned, then stop; otherwise, repeat from step c

k-means algorithm aims at minimizing an objective function known as the squared error function, given by

METHODOLOGY

Selection of k in k-means or k-means++

Steps 1 and 2 of k-means or k-means++ are about choosing the number of clusters (k). These two steps are modified using the sizes of polygons as centroids and fuzzy linguistic hedges for clustering satisfactions. Table 1 illustrates the procedure for the sizes of the polygon and fuzzy linguistic hedges.

 Table 1: Sizes of the polygon and fuzzy linguistic hedges

Polygon size Vertices number Centroid/cluster number Fuzzy linguistic hedges
Straight line =2 2 2 (high, low)
Triangle = 3 3 3 (high, average, low)
Quadrilateral = 4 4 4 (excellence, good, fair, poor)
Pentagon = 5 5 5 (very high, high, average, low, very low)
 —-
Decagon = 10 10 10 (extremely good, very very good, very good, good, more or less good, average, more or less low, low, very low, extremely low )
  1. Read the variables and visualize the points in x and y axis.
  2. The first cluster is chosen uniformly at the first data points of x and y axis as ‘a’
  3. Using the first point, intra-cluster is computed for grouping all points together as well as monitoring the distance d(x) of each data point that has been chosen uniformly.
  4. The second cluster is chosen by computing squared distance (d(x))2 as ‘b’, which is the farthest from the first cluster a. The first and second would form a straight line with two vertices or two centroids. Inter-cluster is computed using the first and second clusters by applying fuzzy linguistic hedges.
  5. The third cluster is chosen by computing squared distance (d(x))2 as ‘c’, which is farthest from the second cluster ‘b’. The first, second, and third clusters would form a triangle with three vertices or three centroids. Inter-cluster is computed using the first, second, and third clusters by applying fuzzy linguistic hedges.
  6. The fourth cluster is chosen by computing squared distance (d(x))2 as ‘d’ which is farthest from the third cluster ‘c’. The first, second, third, and fourth clusters would form a quadrilateral with four vertices or four centroids. Inter-cluster is computed using the first, second, third, and fourth clusters by applying fuzzy linguistic hedges.
  7. The fifth cluster is chosen by computing squared distance (d(x))2 as ‘e’, which is farthest from the fourth cluster ‘d’. The first, second, third, fourth, and fifth clusters would form a pentagon with five vertices or five centroids. Inter-cluster is computed using the first, second, third, fourth, and fifth clusters by applying fuzzy linguistic hedges. This process would continue for other polygon sizes of hexagon, heptagon, octagon, nonagon, decagon, hendecagon, etc, with their respective fuzzy linguistic hedges.

Fuzzy linguistic hedges

Let the centroid be the universe of discourse, C = [0, 10]. Then, the fuzzy sets ‘high’ and ‘low’ may subjectively be defined as follows:

 where   are the membership functions of the fuzzy sets “low” and “high” respectively. This translates into high and low centroids of certain data points.

These two fuzzy linguistic hedges of high and low could be extended into three as low, average, and high centroids of other certain data points. The same is also true for four fuzzy linguistic hedges of poor, fair, good, and excellent, centroids of data points after visualization. For instance, fuzzy linguistic hedges of five centroids could be constructed as matching techniques. Let Fi and Fj be two different fuzzy points represented by fuzzy sets of the universe of discourse C, respectively, and let VH (very high), H (high), A (average), L (low), and VL (very low) be standard fuzzy sets of the universe  of discourse of C, where

where

For example, a company with variable income data points could be grouped into five as very high income, high income, average income, low income, and very low income. This is applied to polygon size six (hexagon), size seven (heptagon), size eight (octagon), size nine (nonagon) and etc,. The satisfaction level of fuzzy linguistic hedges for a polygon of size ten (decagon) is illustrated in Table 2.

Table 2: Satisfaction level of fuzzy linguistic hedges for decagon

Fuzzy linguistic hedges Degrees of clustering
extremely good income 90%-100% ( 0.90 – 1.00)
very good income 80% – 89% (0.80 – 0.89)
good income 70% – 79% (0.70 – 0.79)
more or less good income 60% – 69% (0.60 – 0.69)
average income 50% – 59% (0.50 – 0.59)
more or less Low income 40% – 49% (0.40 – 0.49)
low income 30% – 39% (0.30 – 0.39)
very low income 20% – 29% (0.20 – 0.29)
very very low income 10% – 19% (0.10 – 0.19)
extremely low income 0% – 9%       (0.0 – 0.9)

This would be translated into ten groups of certain data points of a variable in visualization mode and could be applied to other polygon sizes, eleven (hendecagon), and so on.

After all data points have been assigned to clusters based on the fuzzy linguistic hedges using degrees of satisfaction, recalculate the centroids of the clusters by taking weighted means, variation, co-variation, and standard deviation of all data points assigned to each cluster.

Repeat steps e and f of the above k-means or k-means++ algorithm in section 4 until convergence occurs when the centroids are no longer changed significantly. Once the convergence is reached, the algorithm produces the final cluster centroids. These sizes of polygons with their fuzzy linguistic hedges could be computed as built-in functions and called upon during the applications of the k-means clustering algorithm. The time complexity of the k-means algorithm, O(NTK), where N is the total number of data sets, K is the total number of partitions, and T is the number of iterations in the clustering process, would be drastically reduced as k is now O(1). In this case, k would now represent the calling in-built functions that would reduce the time complexity greatly.

COLLECTION OF DATASETS

The datasets used for this research work are health analytics dataset that was downloaded from https://www.kaggle.com/jinalgada/starter-health-analytics-b8146c6b-2 and rain in Australia dataset that fetched from https://www.kaggle.com/jsphyg/weather-dataset-rattle-package. The health analytics dataset has 284 rows and 644 columns of data, and the rain in Australia dataset also has 145461 rows and 23 columns of data

Centering

Centering as a data preprocessing method involves moving the coordinate system within the dataset to a new reference point, which is usually the origin of the coordinate system in a k-dimensional space. Proper centering removes the postulated offsets without changing the structural model of the data. When it is performed, interval-scale data behave as though they are ratio-scale data. The weighted median centre which allows to specify a weight field as the number of trips associated with each feature is employed. This permits to locate Euclidean distance to all weighted features in the dataset. The median centre is calculated as The median center only returns a single point, although there may be more than one location (solution) that would minimize the distance to all features.

Principal Component Analysis

Principal Component Analysis (PCA) employed comprises of five stages as follows:

  1. Subtract the mean of each variable.

Before the PCA, the standardization was performed  which transformed all the variables to the same scale.  This is done through

  1. Calculate the Covariance Matrix

The covariance matrix was computed to remove redundant information and make sure that all the possible pairs of variables are correlated. For example, a 4-dimenssional data set of 4 variables x, y, z, w of 4 X 4 data matrix is as follows:

  1. Compute the Eigenvalues and Eigenvectors

The eigenvector and eigenvalue are the linear algebra concepts to determine the principal components of the data. Every eigenvector has an eigenvalue which means they are pairs. The eigenvalues are simply the coefficients attached to eigenvectors, which give the amount of variance carried in each principal component.

  1. Sort Eigenvectors by corresponding Eigenvalues in descending order and select a subset from the rearranged Eigenvalue matrix
  2. Recast data along the principal components

The final data set was done by multiplying the transpose of the original data set by the transpose of the feature vector as follows:

After computing the eigenvectors and ordering them by their eigenvalues in descending order, allow us to find the principal components in order of significance

Outlier removal

Interquartile Range (IQR), a statistical dispersion representing the middle 50% of a dataset is used. It calculates by finding the difference between the 75th percentile (Q3) and the 25th percentile (Q1). The IQR method identifies outliers by setting boundaries based on Q1 and Q3: Lower Bound = Q1 – (1.5 * IQR) and Upper Bound = Q3 + (1.5 * IQR). Any data point outside these bounds is considered an outlier.

Applications of weighted Euclidean distances

Standardized Euclidean distance

The standardized Euclidean distance between two J-dimensional vectors can be written as

There is compensatory effect produced by standardization of the weight of the variable. This is done by attaching weight of means, variation and co-variance.

Experiment and results

To verify the effectiveness of the sizes of polygons with their corresponding fuzzy linguistic hedge satisfaction, two sets of data, as mentioned in section 5, were used as experimental data sets alongside the following methods:

  1. Euclidean distance with regular selection of the value of k
  2. Euclidean distance with random selection of the value of k
  3. Minkowski distance measure
  4. Weighted Euclidean distance using standard deviation
  5. Weighted Euclidean distance using means
  6. Weighted Euclidean distance using variance
  7. Weighted Euclidean distance using co-variance

Euclidean distance will work fine as long as the dimensions are equally weighted and are independent of each other. This can technically be overcome by scaling the variables of computing the standard deviation, covariance, which can make it vary within a particular range between 0 and 1.

The experiments were performed many times following the modification of the initialization of the k value in the k-means using the sizes of polygons with their fuzzy linguistic hedges. The functions of the sizes of polygons with their corresponding fuzzy linguistic hedges have been computed and invoked to visualize the fitness of the data points. The experimental results for each run, one, two, and three, for the health analytics dataset with their respective methods are shown in Table 3. The average clustering processing time for each method was calculated as indicated in Table 3. The same process was also done with the Australia rain dataset, which is illustrated in Table 4. The fastest clustering time for each method was illustrated in Table 5 using the health analytics dataset. Information retrieval times using both health analytics and Australia rain datasets with various search terms were done. The results of information retrieval times for the two datasets and their methods were illustrated in Tables 6 and 7. Table 8 illustrates the ranking of the methods based on the retrieval time of the search item using the Australia rain dataset. As depicted in tables 5 and 8,  weighted Euclidean distance using standard deviation determined the fastest clustering time, following by weighted Euclidean distance using co-variance, weighted Euclidean distance using variance, Minkowski distance measure, weighted Euclidean distance using means, Euclidean distance with random selection of the value of k and Euclidean distance with regular selection of the value of k while Minkowski distance measure determined the fastest retrieval time followed by Euclidean distance with regular selection of the value of k, weighted Euclidean distance using co-variance, Euclidean distance with random selection of the value of k, weighted Euclidean distance using variance, weighted Euclidean distance using standard deviation and weighted Euclidean distance using means.

The results revealed that the data clustered with the Minkowski method in place of Euclidean distance, returned the searched item in the least time.

The results in each of the scenarios reveal that clustering performed with weighted Euclidean distance using standard deviation consistently took the least time, making it the fastest method of getting experimental data sets clustered with the k-means clustering algorithm.

Table 3: Clustering time with the “Health Analytics” Dataset

1ST RUN
Euclidean Distance Minkowski Weighted Euclidean Distance
Regular selection Random Selection Minkowski Standard Deviation Means Variance Co-variance
Time taken (ms) 676 513 509 503 508 509 504
503 505 503 503 503 503 504
502 501 502 502 503 501 502
504 501 502 502 502 502 501
AVERAGE 546.25 505 504 502.5 504 503.75 502.75
2ND RUN
Time taken (ms) 553 504 504 502 507 504 502
501 502 502 501 501 501 505
501 502 503 502 501 502 502
501 501 502 501 502 502 502
AVERAGE 514 502.25 502.75 501.5 502.75 502.25 502.75
3RD RUN
Time taken (ms) 516 505 504 501 506 505 501
504 502 501 502 502 502 502
502 502 502 502 501 501 502
503 501 501 501 501 501 501
AVERAGE 506.25 502.5 502 501.5 502.5 502.25 501.25
Total Average (ms) 544.58 506 507.6 503.4 506 505.9 505.5

Table 4: Clustering time with the “Rain in Australia” Dataset

IST RUN
Euclidean Distance Minkowski Weighted Euclidean Distance
Regular selection Random Selection Minkowski Standard Deviation Means Variance Co-variance
Time taken (ms) 900 538 510 503 509 510 521
503 503 507 503 506 503 505
503 503 508 503 503 503 503
511 504 505 503 503 503 503
AVERAGE 604.25 512 507.5 503 505.25 504.75 508.75
2ND RUN
Time taken (ms) 566 509 510 505 514 513 504
505 506 503 503 505 503 504
503 503 505 505 503 505 503
504 505 503 503 504 505 504
AVERAGE 519.5 505.75 505.25 504 506.5 506.5 503.75
3RD RUN
Time taken (ms) 531 509 508 504 514 515 503
503 506 503 503 503 504 505
503 503 506 503 503 504 504
503 503 504 503 506 503 504
AVERAGE 510 505.25 505.25 503.25 506.5 506.5 504
Total Average 522.17 502.9 503.2 501.8 503.1 502.75 502.3

Table 5: Ranking of the methods to determine clustering time in health analytics dataset

Euclidean Distance Minkowski Random Selection Standard Deviation Means Variance Co-variance
GRAND 553.4 504.5 505.4 502.6 504.6 504.3 503.9
Average (ms)
Rank 7th 4th 6th 1st 5th 3rd 2nd

Table 6 : Information Retrieval time with the “Health Analytics” Dataset using different search terms

Euclidean Distance(ms) Minkowski (ms) Weighted Euclidean Distance (ms)
Search term Regular selection Random Selection Minkowski Standard Deviation Means Variance Co-variance
18501 500 502 500 501 502 502 502
500 504 500 501 501 501 503
500 502 501 502 502 502 502
AVERAGE 500.75 502.25 501 501.75 501.75 501.75 502.25
Euclidean Distance(ms) Minkowski (ms) Weighted Euclidean Distance (ms)
Search term Euclidean Distance Random Selection Minkowski Standard Deviation Means Variance Co-variance
500 502 501 502 501 502 501
501 501 500 501 501 504 502
501 502 500 501 502 503 502
5769 501 502 500 502 502 503 502
AVERAGE 500.75 501.75 500.25 501.5 501.5 503 501.75
Total Average 500.75 502 500.625 501.625 501.625 502.375 502

Table 7: Information Retrieval time with the “Rain in Australia” Dataset using different search terms

Euclidean Distance(ms) Minkowski (ms) Weighted Euclidean Distance (ms)
Search term Euclidean Distance Random Selection Minkowski Standard Deviation Means Variance Co-variance
504 502 503 502 502 502
500 502 500 502 502 502
501 502 501 533 533 502
19.8 500 502 500 501 501 502
AVERAGE 501.5 502 501 509.5 509.5 502
Euclidean Distance(ms) Minkowski (ms) Weighted Euclidean Distance (ms)
Search term Euclidean Distance Random Selection Minkowski Standard Deviation Means Variance Co-variance
500 502 501 501 502 502 502
501 502 500 502 502 501 503
501 504 500 501 502 502 502
21.5 500 502 501 502 502 502 502
AVERAGE 500.5 502.5 500.5 501.5 502 501.75 502.25
Total Average 501 502.25 500.75 505.5 505.75 501.875 501.125

 Table 8: Ranking of the methods to determine retrieval time in Australia rain datasets

Euclidean Distance Random selection Minkowski Standard Deviation Means Variance Co-variance
GRAND 500.875 502.125 500.688 503.563 503.688 505.125 502.063
Average (ms)
Rank 2nd 4th 1st 6th 7th 5th 3rd

The Stand-Alone Classic k-means algorithm (SaCKmeans) used in the work of Lu, W. (2020) was selected for the benchmark with the weighted Euclidean distance using standard deviation (SDKmeans) since it produced the best clustering result among all the others. The SDKmeans took a shorter time to cluster the dataset in each of the 3 runs as indicated in table 9. Hence, it could be conclusively verified that the SDKmeans outperformed the SaCKmeans in terms of the time taken to cluster the two experimental data sets

Table 9 : Comparison of SDKmeans with SaCKmeans

Runs SDKMeans SaCKmeans
Time taken (ms)
529.2 524.9
504.1 522.1
1 504.4 521
504.5 530
AVERAGE 510.6 524.5
504.1 522.1
503 521.6
2 503.3 522.2
502.3 521.6
AVERAGE 503.2 521.9
503.9 521.8
503.6 522
3 504 521.9
502.8 522.4
AVERAGE 503.6 522
TOTAL AVERAGE 505.8 522.8

CONCLUSION

The choice of the first point along the first x and y axis does not permit the value of the first centroid to vary in each run time of the experiments. The values of k in the k-means or k-means++ were strictly adjusted to the sizes of the polygon with their corresponding fuzzy linguistic hedges during the running times of the experiments. Every time the experiment was conducted, the number of clusters generated was equally the same with the sizes of the polygon alongside the satisfaction degrees of fuzzy linguistic hedges. The combination of sizes polygon, and fuzzy linguistic hedge handled millions of data points in a matter of seconds and allowed the k-means algorithm to converge rapidly.

 REFERENCES

  1. Johannes Schneider and Michail Vlachos, Fast parameter less density-based clustering via random projections , Association for computing machinery, New York, ACM press, pp 861-866, 2013
  2. Ioan-Daniel Borka, Radu-Emil Precup and Alexandra-Bianca Borlea, Improvement of k-means cluster quality by post processing resulted clusters, Precedia. Computer science , 199, pp 63 -70, 2022
  3. YanPing Zhao and XiaoLai Zhou, K-means clustering algorithm and its improvement research, Journal of Physics: Conference series, vol. 1873, 2021 2nd International workshop on electronic communication and artificial intelligence (IWECAI 2021), 12-14 March 2021, Nanjing, China
  4. Manish Suyal and Sanjay Sharma, A review of analysis of k-means clustering machine learning algorithm based on unsupervised learning, Journal of artificial intelligence and system, 6 85 – 95, 2024
  5. Mohiuddin Ahmed, Raihan Seraj and Syed Mohanned Shamsul Islam, The k-means algorithm: A comprehensive survey and performance evaluation, Journal of Electronics 9(1295) pp 1-12, 2020, doi:10.3390/electronics9081295
  6. Fahim A. M., Salem A. M., Turkey F. A., and Ramadam M. A., An efficient enhanced k-means clustering algorithm, Journal of Zhejiang University, China, 10(7), 1626 – 1633, 2006
  7. Rai P., and Singh S., A survey of clustering techniques, Int. computing appl, 7, 1-5, 2010
  8. Kanungo T., Mount D. M., Netenyahu N. S., Piatko C. D., and Silverman R., An efficient k-means clustering algorithm: analysis and implementation, IEEE trans. Pattern and marching intell. 24, 870 – 892, 2002
  9. Yu H., Wen G., Gan J., Zheng W., and Lei C., Self-paced learning for k-means clustering algorithm, Journal of pattern recognition, 2018
  10. Hussain S. F. and Harris M. A., K-means based co-clustering algorithm for sparse, high dimensional data, Expert system, 118, 20-34, 2019
  11. Ye S., Huang X., Teng Y., and Li Y., k-means clustering algorithm based on improved cuckoo search algorithm and its application, In proceedings of the 2018 IEEE 3rd international conference on big data analysis, Shanghai, China, 9 – 12, March 2018, pg 422 – 426
  12. Melnykov I., and Melnykov V., on k-means algorithm with the use of Mahalanobis distance, Stat. probab letter, 84, pp 88 – 95, 2014
  13. Bubeck S., Meila M., and Von Luxburg U., How the initialization affects the stability of the k-means algorithm, ESAIM probab. Stat 16, pp 436 – 452, 2012
  14. Nasser S, Alkhaldi R, Vert G. A modified fuzzy K-means clustering using expectation maximization. In Proc 2006 IEEE Int Conf Fuzzy Syst, Vancouver, BC, Canada, p. 231–235, 2006
  15. Nazeer KAA, Kumar SDM, Sebastian MP. Enhancing the K-means clustering algorithm by using a O(n logn) heuristic method for finding better initial centroids. In Proc 2011 2nd Int Conf Emerg Appl Inf Technol, Washington, DC, USA;  261–264, 2011.
  16. Kaukoranta T, Franti P, Nevalainen O. A fast exact GLA based on code vector activity detection. IEEE Trans Image Process
  17. Ziheng Li, Rong Wang and Xuelong Li, An effective and efficient algorithm for k’means clustering with new formulation, IEEE transaction on knowledge and data engineering vol 75 isuue 4, 2023
  18. Narongsak Chayangkoon Chatchai Kasemtaweechok, Improving the performance of k-mean clustering to reduce work by addressing the gap in finding, International computer Science and Engineering, 2023 27th (ICEC)
  19. Das S., Abraham A., and Koner A., Automatic clustering using an improved differential evaluation algorithm, IEEE transaction on system, Man cybernetics-part system humans, 38(1), pp 218-237, 2008
  20. Lu, W. Improved k-means clustering algorithm for big data mining under hadoop parallel framework, Journal of grid computing 18, pp 239 – 250, 2020

Article Statistics

Track views and downloads to measure the impact and reach of your article.

0

PDF Downloads

38 views

Metrics

PlumX

Altmetrics

Track Your Paper

Enter the following details to get the information about your paper

GET OUR MONTHLY NEWSLETTER