Split sorted array into list with sublists

Split sorted array into list with sublists

I have a sorted array of float32 Values, I want to split this array into a list of lists containing only the same Values like this:

>>> split_sorted(array) # [1., 1., 1., 2., 2., 3.] >>> [[1., 1., 1.], [2., 2.], [3.]]

My current approach is this Function

def split_sorted(array): split = [[array[0]]] s_index = 0 a_index = 1 while a_index < len(array): while a_index < len(array) and array[a_index] == split[s_index][0]: split[s_index].append(array[a_index]) a_index += 1 else: if a_index < len(array): s_index += 1 a_index += 1 split.append([array[a_index]])

My Question now is, is there a more Pythonic way to do this? maybe even with numpy? And is this the most performant way?

Thanks a lot!

What's the typical length of the input array and number of unique elements?
– Divakar
19 hours ago

4 Answers
4

Approach #1

With a as the array, we can use np.split -

a

np.split

np.split(a,np.flatnonzero(a[:-1] != a[1:])+1)

Sample run -

In [16]: a Out[16]: array([1., 1., 1., 2., 2., 3.]) In [17]: np.split(a,np.flatnonzero(a[:-1] != a[1:])+1) Out[17]: [array([1., 1., 1.]), array([2., 2.]), array([3.])]

Approach #2

Another more performant way would be to get the splitting indices and then slicing the array and zipping -

zipping

idx = np.flatnonzero(np.r_[True, a[:-1] != a[1:], True]) out = [a[i:j] for i,j in zip(idx[:-1],idx[1:])]

Approach #3

If you have to get a list of sublists as output, we could re-create with list duplication -

mask = np.r_[True, a[:-1] != a[1:], True] c = np.diff(np.flatnonzero(mask)) out = [[i]*j for i,j in zip(a[mask[:-1]],c)]

Timings for vectorized approaches on 1000000 elements with 10000 unique elements -

1000000

10000

In [145]: np.random.seed(0) ...: a = np.sort(np.random.randint(1,10000,(1000000))) In [146]: x = a # Approach #1 from this post In [147]: %timeit np.split(a,np.flatnonzero(a[:-1] != a[1:])+1) 100 loops, best of 3: 10.5 ms per loop # Approach #2 from this post In [148]: %%timeit ...: idx = np.flatnonzero(np.r_[True, a[:-1] != a[1:], True]) ...: out = [a[i:j] for i,j in zip(idx[:-1],idx[1:])] 100 loops, best of 3: 5.18 ms per loop # Approach #3 from this post In [197]: %%timeit ...: mask = np.r_[True, a[:-1] != a[1:], True] ...: c = np.diff(np.flatnonzero(mask)) ...: out = [[i]*j for i,j in zip(a[mask[:-1]],c)] 100 loops, best of 3: 11.1 ms per loop # @RafaelC's soln In [149]: %%timeit ...: v,c = np.unique(x, return_counts=True) ...: out = [[a]*b for (a,b) in zip(v,c)] 10 loops, best of 3: 25.6 ms per loop

Well, this is something I can work with :) But I'll wait for more answers before I set this Question as solved
– Biskit1943
23 hours ago

You can use numpy.unique and zip

numpy.unique

zip

v,c = np.unique(x, return_counts=True) [[a]*b for (a,b) in zip(v,c)]

Outputs

[[1.0, 1.0, 1.0], [2.0, 2.0], [3.0]]

Timings for a 6,000,000 sized array

%timeit v,c = np.unique(x, return_counts=True); [[a]*b for (a,b) in zip(v,c)] 18.2 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit np.split(x,np.flatnonzero(x[:-1] != x[1:])+1) 424 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit [list(group) for value, group in itertools.groupby(x)] 180 ms ± 4.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Share the setup for x for the timings?
– Divakar
19 hours ago

x

@Divakar Had simply x = np.array([1., 1., 1., 2., 2., 3.]*100000). Timing with np.array([np.random.randint(50) for _ in range(1000000)]) yields similar results, but testing with sorted np.array(sorted([np.random.randint(50) for _ in range(1000000)])) your solutions rises to be faster :)
– RafaelC
17 hours ago

x = np.array([1., 1., 1., 2., 2., 3.]*100000)

np.array([np.random.randint(50) for _ in range(1000000)])

np.array(sorted([np.random.randint(50) for _ in range(1000000)]))

Well the first two setups are wrong as OP mentioned the inputs are sorted ones and my approaches are specifically leveraging it for performance. Also, whenever you are posting a speedup comparison, please list the setup used. Without the setup, nobody can reproduce it.
– Divakar
19 mins ago

The function itertools.groupby has this exact behavior.

itertools.groupby

>>> from itertools import groupby >>> [list(group) for value, group in groupby(array)] [[1.0, 1.0, 1.0], [2.0, 2.0], [3.0]]

>>> from itertools import groupby >>> a = [1., 1., 1., 2., 2., 3.] >>> for k, g in groupby(a) : ... print k, list(g) ... 1.0 [1.0, 1.0, 1.0] 2.0 [2.0, 2.0] 3.0 [3.0]

You may join the lists, if you like:

>>> result = >>> for k, g in groupby(a) : ... result.append( list(g) ) ... >>> result [[1.0, 1.0, 1.0], [2.0, 2.0], [3.0]]

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Ciugk