Unsupervised learning clustering 1D array

Multi tool use

Clash Royale CLAN TAG#URR8PPP

Unsupervised learning clustering 1D array

I am faced with the following array:

y = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]

What I would like to do is extract the cluster with the highest scores. That would be

best_cluster = [200,297,275,243]

I have checked quite a few questions on stack on this topic and most of them recommend using kmeans. Although a few others mention that kmeans might be an overkill for 1D arrays clustering.
However kmeans is a supervised learnig algorithm, hence this means that I would have to pass in the number of centroids. As I need to generalize this problem to other arrays, I cannot pass the number of centroids for each one of them. Therefore I am looking at implementing some sort of unsupervised learning algorithm that would be able to figure out the clusters by itself and select the highest one.
In array y I would see 3 clusters as so [1,2,4,7,9,5,4,7,9],[56,57,54,60],[200,297,275,243].
What algorithm would best fit my needs, considering computation cost and accuracy and how could I implement it for my problem?

K-means is inherently an unsupervised learning algorithm. Your data are not supplied w/ classes, therefore the k-means clustering algorithm is left to classify the data. This article might provide you some insight into determining the number of clusters: pythonprogramminglanguage.com/kmeans-elbow-method
– rahlf23
Jul 23 at 21:48

Possible duplicate of How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?
– MoxieBall
Jul 23 at 21:52

@MoxieBall, it's not the same. What you have there is supervised, there are 3 clusters set up
– dre_84w934
Jul 23 at 22:03

4 Answers
4

Try MeanShift. From the sklean user guide of MeanShift:

MeanShift

The algorithm automatically sets the number of clusters, ...

Modified demo code:

import numpy as np from sklearn.cluster import MeanShift, estimate_bandwidth # ############################################################################# # Generate sample data X = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243] X = np.reshape(X, (-1, 1)) # ############################################################################# # Compute clustering with MeanShift # The following bandwidth can be automatically detected using # bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=100) ms = MeanShift(bandwidth=None, bin_seeding=True) ms.fit(X) labels = ms.labels_ cluster_centers = ms.cluster_centers_ labels_unique = np.unique(labels) n_clusters_ = len(labels_unique) print("number of estimated clusters : %d" % n_clusters_) print(labels)

Output:

number of estimated clusters : 2 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1]

Note that MeanShift is not scalable with the number of samples. The recommended upper limit is 10,000.

BTW, as rahlf23 already mentioned, K-mean is an unsupervised learning algorithm. The fact that you have to specify the number of clusters does not mean it is supervised.

搜尋此網誌

Ciugk