Split sorted array into list with sublists

Multi tool use


Split sorted array into list with sublists
I have a sorted array of float32 Values, I want to split this array into a list of lists containing only the same Values like this:
>>> split_sorted(array) # [1., 1., 1., 2., 2., 3.]
>>> [[1., 1., 1.], [2., 2.], [3.]]
My current approach is this Function
def split_sorted(array):
split = [[array[0]]]
s_index = 0
a_index = 1
while a_index < len(array):
while a_index < len(array) and array[a_index] == split[s_index][0]:
split[s_index].append(array[a_index])
a_index += 1
else:
if a_index < len(array):
s_index += 1
a_index += 1
split.append([array[a_index]])
My Question now is, is there a more Pythonic way to do this? maybe even with numpy? And is this the most performant way?
Thanks a lot!
4 Answers
4
Approach #1
With a
as the array, we can use np.split
-
a
np.split
np.split(a,np.flatnonzero(a[:-1] != a[1:])+1)
Sample run -
In [16]: a
Out[16]: array([1., 1., 1., 2., 2., 3.])
In [17]: np.split(a,np.flatnonzero(a[:-1] != a[1:])+1)
Out[17]: [array([1., 1., 1.]), array([2., 2.]), array([3.])]
Approach #2
Another more performant way would be to get the splitting indices and then slicing the array and zipping
-
zipping
idx = np.flatnonzero(np.r_[True, a[:-1] != a[1:], True])
out = [a[i:j] for i,j in zip(idx[:-1],idx[1:])]
Approach #3
If you have to get a list of sublists as output, we could re-create with list duplication -
mask = np.r_[True, a[:-1] != a[1:], True]
c = np.diff(np.flatnonzero(mask))
out = [[i]*j for i,j in zip(a[mask[:-1]],c)]
Timings for vectorized approaches on 1000000
elements with 10000
unique elements -
1000000
10000
In [145]: np.random.seed(0)
...: a = np.sort(np.random.randint(1,10000,(1000000)))
In [146]: x = a
# Approach #1 from this post
In [147]: %timeit np.split(a,np.flatnonzero(a[:-1] != a[1:])+1)
100 loops, best of 3: 10.5 ms per loop
# Approach #2 from this post
In [148]: %%timeit
...: idx = np.flatnonzero(np.r_[True, a[:-1] != a[1:], True])
...: out = [a[i:j] for i,j in zip(idx[:-1],idx[1:])]
100 loops, best of 3: 5.18 ms per loop
# Approach #3 from this post
In [197]: %%timeit
...: mask = np.r_[True, a[:-1] != a[1:], True]
...: c = np.diff(np.flatnonzero(mask))
...: out = [[i]*j for i,j in zip(a[mask[:-1]],c)]
100 loops, best of 3: 11.1 ms per loop
# @RafaelC's soln
In [149]: %%timeit
...: v,c = np.unique(x, return_counts=True)
...: out = [[a]*b for (a,b) in zip(v,c)]
10 loops, best of 3: 25.6 ms per loop
Well, this is something I can work with :) But I'll wait for more answers before I set this Question as solved
– Biskit1943
23 hours ago
You can use numpy.unique
and zip
numpy.unique
zip
v,c = np.unique(x, return_counts=True)
[[a]*b for (a,b) in zip(v,c)]
Outputs
[[1.0, 1.0, 1.0], [2.0, 2.0], [3.0]]
Timings for a 6,000,000 sized array
%timeit v,c = np.unique(x, return_counts=True); [[a]*b for (a,b) in zip(v,c)]
18.2 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.split(x,np.flatnonzero(x[:-1] != x[1:])+1)
424 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit [list(group) for value, group in itertools.groupby(x)]
180 ms ± 4.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Share the setup for
x
for the timings?– Divakar
19 hours ago
x
@Divakar Had simply
x = np.array([1., 1., 1., 2., 2., 3.]*100000)
. Timing with np.array([np.random.randint(50) for _ in range(1000000)])
yields similar results, but testing with sorted np.array(sorted([np.random.randint(50) for _ in range(1000000)]))
your solutions rises to be faster :)– RafaelC
17 hours ago
x = np.array([1., 1., 1., 2., 2., 3.]*100000)
np.array([np.random.randint(50) for _ in range(1000000)])
np.array(sorted([np.random.randint(50) for _ in range(1000000)]))
Well the first two setups are wrong as OP mentioned the inputs are sorted ones and my approaches are specifically leveraging it for performance. Also, whenever you are posting a speedup comparison, please list the setup used. Without the setup, nobody can reproduce it.
– Divakar
19 mins ago
The function itertools.groupby
has this exact behavior.
itertools.groupby
>>> from itertools import groupby
>>> [list(group) for value, group in groupby(array)]
[[1.0, 1.0, 1.0], [2.0, 2.0], [3.0]]
>>> from itertools import groupby
>>> a = [1., 1., 1., 2., 2., 3.]
>>> for k, g in groupby(a) :
... print k, list(g)
...
1.0 [1.0, 1.0, 1.0]
2.0 [2.0, 2.0]
3.0 [3.0]
You may join the lists, if you like:
>>> result =
>>> for k, g in groupby(a) :
... result.append( list(g) )
...
>>> result
[[1.0, 1.0, 1.0], [2.0, 2.0], [3.0]]
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
What's the typical length of the input array and number of unique elements?
– Divakar
19 hours ago