Unsupervised learning clustering 1D array

Multi tool use
Multi tool use
The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP


Unsupervised learning clustering 1D array



I am faced with the following array:


y = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]



What I would like to do is extract the cluster with the highest scores. That would be


best_cluster = [200,297,275,243]



I have checked quite a few questions on stack on this topic and most of them recommend using kmeans. Although a few others mention that kmeans might be an overkill for 1D arrays clustering.
However kmeans is a supervised learnig algorithm, hence this means that I would have to pass in the number of centroids. As I need to generalize this problem to other arrays, I cannot pass the number of centroids for each one of them. Therefore I am looking at implementing some sort of unsupervised learning algorithm that would be able to figure out the clusters by itself and select the highest one.
In array y I would see 3 clusters as so [1,2,4,7,9,5,4,7,9],[56,57,54,60],[200,297,275,243].
What algorithm would best fit my needs, considering computation cost and accuracy and how could I implement it for my problem?





K-means is inherently an unsupervised learning algorithm. Your data are not supplied w/ classes, therefore the k-means clustering algorithm is left to classify the data. This article might provide you some insight into determining the number of clusters: pythonprogramminglanguage.com/kmeans-elbow-method
– rahlf23
Jul 23 at 21:48





Possible duplicate of How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?
– MoxieBall
Jul 23 at 21:52





@MoxieBall, it's not the same. What you have there is supervised, there are 3 clusters set up
– dre_84w934
Jul 23 at 22:03




4 Answers
4



Try MeanShift. From the sklean user guide of MeanShift:


MeanShift



The algorithm automatically sets the number of clusters, ...



Modified demo code:


import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth

# #############################################################################
# Generate sample data
X = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]
X = np.reshape(X, (-1, 1))

# #############################################################################
# Compute clustering with MeanShift

# The following bandwidth can be automatically detected using
# bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=100)

ms = MeanShift(bandwidth=None, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_

labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)

print("number of estimated clusters : %d" % n_clusters_)
print(labels)



Output:


number of estimated clusters : 2
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1]



Note that MeanShift is not scalable with the number of samples. The recommended upper limit is 10,000.



BTW, as rahlf23 already mentioned, K-mean is an unsupervised learning algorithm. The fact that you have to specify the number of clusters does not mean it is supervised.



See also:



Overview of clustering methods



Choosing the right estimator





yes, sorry, I'll have to change that, Kmeans is an unsupervised learning algo ;)
– dre_84w934
Jul 24 at 6:41





so then, would you say MeanShift is computationally more efficient on large data then kmeans?
– dre_84w934
Jul 24 at 6:43





so I tried it out and it works fine, however one problem I am facing now is that it does not distinguish between negative and positive values. Therefore from example array [-100,-20,-50, 55, 30, 50], it will see the -100 as the best option, which is not correct, it is in fact the lowest. Does this come from the fitting?
– dre_84w934
Jul 24 at 13:13





@dre_84w934 It is actually the other way around: KMeans with minibatch is scalable, whereas 10K is the recommended upper limit for MeanShift. Clustering algorithms only tell you what the clusters are. You will have to figure out the 'highest' one afterwards.
– Eason983
Jul 25 at 4:51





Right, I've noticed. However, when the are negative value, if the number after - is bigger than the positive, it is taking the negative as higher order batch
– dre_84w934
Jul 25 at 5:35



HDBSCAN is the best clustering algorithm and you should always use it.



Basically all you need to do is provide a reasonable min_cluster_size, a valid distance metric and you're good to go.


min_cluster_size


metric



For min_cluster_size I suggest using 3 since a cluster of 2 is lame and for metric the default euclidean works great so you don't even need to mention it.


min_cluster_size


metric


euclidean



Don't forget that distance metrics apply to vectors and here we have scalars so some ugly reshaping is in order.



To put it all together and assuming by "cluster with the highest scores" you mean the cluster that includes the max value we get:


from hdbscan import HDBSCAN
import numpy as np

y = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]
y = np.reshape(y, (-1, 1))

clusterer = HDBSCAN(min_cluster_size=3)
cluster_labels = clusterer.fit_predict(y)

best_cluster = clusterer.exemplars_[cluster_labels[y.argmax()]].ravel()
print(best_cluster)



The output is [297 200 275 243]. Original order is not preserved. C'est la vie.


[297 200 275 243]



Just compute the differences of subsequent elements. I.e. look at x[i]-x[i-1].


x[i]-x[i-1]



Choose the largest differences as split points. Or define a threshold on when to split. E.g. 20.



This is O(n), much faster than all the others mentioned. Also very understandable and predictable.



On one dimensional ordered data, any method that doesn't use the order will be slower than necessary.



@anony-mousse and why is clustering an overkill here? Is it because the sample size is small or is it because you think clustering should be applied only to huge datasets? How does ordered or un-ordered data make a difference? My point is, "your insights are like shots in the dark"? Can you prove your answer by providing either a mathematical or coding example which proves that O(n) faster then all others?






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

sx2jcv40,lAG220R7GdA2SXSbr
18aep6

Popular posts from this blog

Makefile test if variable is not empty

Will Oldham

Visual Studio Code: How to configure includePath for better IntelliSense results