Split sorted array into list with sublists

Multi tool use
Multi tool use
The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP


Split sorted array into list with sublists



I have a sorted array of float32 Values, I want to split this array into a list of lists containing only the same Values like this:


>>> split_sorted(array) # [1., 1., 1., 2., 2., 3.]
>>> [[1., 1., 1.], [2., 2.], [3.]]



My current approach is this Function


def split_sorted(array):
split = [[array[0]]]

s_index = 0
a_index = 1
while a_index < len(array):
while a_index < len(array) and array[a_index] == split[s_index][0]:
split[s_index].append(array[a_index])
a_index += 1
else:
if a_index < len(array):
s_index += 1
a_index += 1
split.append([array[a_index]])



My Question now is, is there a more Pythonic way to do this? maybe even with numpy? And is this the most performant way?



Thanks a lot!





What's the typical length of the input array and number of unique elements?
– Divakar
19 hours ago




4 Answers
4



Approach #1



With a as the array, we can use np.split -


a


np.split


np.split(a,np.flatnonzero(a[:-1] != a[1:])+1)



Sample run -


In [16]: a
Out[16]: array([1., 1., 1., 2., 2., 3.])

In [17]: np.split(a,np.flatnonzero(a[:-1] != a[1:])+1)
Out[17]: [array([1., 1., 1.]), array([2., 2.]), array([3.])]



Approach #2



Another more performant way would be to get the splitting indices and then slicing the array and zipping -


zipping


idx = np.flatnonzero(np.r_[True, a[:-1] != a[1:], True])
out = [a[i:j] for i,j in zip(idx[:-1],idx[1:])]



Approach #3



If you have to get a list of sublists as output, we could re-create with list duplication -


mask = np.r_[True, a[:-1] != a[1:], True]
c = np.diff(np.flatnonzero(mask))
out = [[i]*j for i,j in zip(a[mask[:-1]],c)]



Timings for vectorized approaches on 1000000 elements with 10000 unique elements -


1000000


10000


In [145]: np.random.seed(0)
...: a = np.sort(np.random.randint(1,10000,(1000000)))

In [146]: x = a

# Approach #1 from this post
In [147]: %timeit np.split(a,np.flatnonzero(a[:-1] != a[1:])+1)
100 loops, best of 3: 10.5 ms per loop

# Approach #2 from this post
In [148]: %%timeit
...: idx = np.flatnonzero(np.r_[True, a[:-1] != a[1:], True])
...: out = [a[i:j] for i,j in zip(idx[:-1],idx[1:])]
100 loops, best of 3: 5.18 ms per loop

# Approach #3 from this post
In [197]: %%timeit
...: mask = np.r_[True, a[:-1] != a[1:], True]
...: c = np.diff(np.flatnonzero(mask))
...: out = [[i]*j for i,j in zip(a[mask[:-1]],c)]
100 loops, best of 3: 11.1 ms per loop

# @RafaelC's soln
In [149]: %%timeit
...: v,c = np.unique(x, return_counts=True)
...: out = [[a]*b for (a,b) in zip(v,c)]
10 loops, best of 3: 25.6 ms per loop





Well, this is something I can work with :) But I'll wait for more answers before I set this Question as solved
– Biskit1943
23 hours ago



You can use numpy.unique and zip


numpy.unique


zip


v,c = np.unique(x, return_counts=True)
[[a]*b for (a,b) in zip(v,c)]



Outputs


[[1.0, 1.0, 1.0], [2.0, 2.0], [3.0]]



Timings for a 6,000,000 sized array


%timeit v,c = np.unique(x, return_counts=True); [[a]*b for (a,b) in zip(v,c)]
18.2 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit np.split(x,np.flatnonzero(x[:-1] != x[1:])+1)
424 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit [list(group) for value, group in itertools.groupby(x)]
180 ms ± 4.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)





Share the setup for x for the timings?
– Divakar
19 hours ago


x





@Divakar Had simply x = np.array([1., 1., 1., 2., 2., 3.]*100000). Timing with np.array([np.random.randint(50) for _ in range(1000000)]) yields similar results, but testing with sorted np.array(sorted([np.random.randint(50) for _ in range(1000000)])) your solutions rises to be faster :)
– RafaelC
17 hours ago




x = np.array([1., 1., 1., 2., 2., 3.]*100000)


np.array([np.random.randint(50) for _ in range(1000000)])


np.array(sorted([np.random.randint(50) for _ in range(1000000)]))





Well the first two setups are wrong as OP mentioned the inputs are sorted ones and my approaches are specifically leveraging it for performance. Also, whenever you are posting a speedup comparison, please list the setup used. Without the setup, nobody can reproduce it.
– Divakar
19 mins ago



The function itertools.groupby has this exact behavior.


itertools.groupby


>>> from itertools import groupby
>>> [list(group) for value, group in groupby(array)]
[[1.0, 1.0, 1.0], [2.0, 2.0], [3.0]]


>>> from itertools import groupby
>>> a = [1., 1., 1., 2., 2., 3.]
>>> for k, g in groupby(a) :
... print k, list(g)
...
1.0 [1.0, 1.0, 1.0]
2.0 [2.0, 2.0]
3.0 [3.0]



You may join the lists, if you like:


>>> result =
>>> for k, g in groupby(a) :
... result.append( list(g) )
...
>>> result
[[1.0, 1.0, 1.0], [2.0, 2.0], [3.0]]






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

3,2cq75RXufbLx7 sjX uH
SN4vKVJL5IuQHu,ZySf Ki6Xcf4gWir9MuzR1vMq9YQiLug9EX

Popular posts from this blog

Makefile test if variable is not empty

Visual Studio Code: How to configure includePath for better IntelliSense results

Will Oldham