Find the cluster number corresponding to cluster labels

Find the cluster number corresponding to cluster labels

Problem Description:

Based on the result of a clustering I have the following data:

    timestamp  consumption  cluster_number
0           0    35.666667               1
1           1    29.352222               1
2           2    24.430000               1
3           3    21.756667               1
4           4    20.345556               1
5           5    19.763333               1
6           6    19.874444               1
7           7    22.078889               1
8           8    28.608889               1
9           9    33.827778               2
10         10    36.414444               2
11         11    38.340000               2
12         12    43.305556               2
13         13    43.034444               2
14         14    39.076667               2
15         15    36.378889               2
16         16    36.171111               2
17         17    40.381111               2
18         18    48.692222               0
19         19    52.330000               0
20         20    50.154444               0
21         21    46.491111               0
22         22    44.014444               0
23         23    40.628889               0

With this clustering, the maximum value (and values close to the maximum value) of the column consumption is in cluster_number 0, the minimum value (and values close to the minimum value) of the column consumption is in cluster_number 1 and the rest in cluster_number 2. However, I cannot know beforehand which ‘consumption’ values correspond to which cluster_number, so I need to find a way to first connect the cluster_number with high, low and middle class and then come up with a list of the column timestamp for each cluster.

Specifically, I want to come up with three lists:

  1. high = [18, 19, 20, 21, 22, 23]
  2. low = [0, 1, 2, 3, 4, 5, 6, 7, 8]
  3. middle = [9, 10, 11, 12, 13, 14, 15, 16, 17, 18]

Any idea of how can I achieve this?

Solution – 1

You can use a groupby.min + rank to identify the order of the clusters, then groupby.agg on the index and rename using the cluster order:

order = ['low', 'middle', 'high']

g = df.reset_index().groupby('cluster_number')
mapper = (g['consumption'].min() # min for the demo, you can use any function mean/sum/…
          .rank(method='dense')
          .map(dict(enumerate(order, start=1)))
         )
out = g['index'].agg(list).rename(mapper)

Output:

cluster_number
high                 [18, 19, 20, 21, 22, 23]
low               [0, 1, 2, 3, 4, 5, 6, 7, 8]
middle    [9, 10, 11, 12, 13, 14, 15, 16, 17]
Name: index, dtype: object
Rate this post
We use cookies in order to give you the best possible experience on our website. By continuing to use this site, you agree to our use of cookies.
Accept
Reject