Cython: size attribute of memoryviews

Multi tool use


Cython: size attribute of memoryviews
I'm using a lot of 3D memoryviews in Cython, e.g.
cython.declare(a='double[:, :, ::1]')
a = np.empty((10, 20, 30), dtype='double')
I often want to loop over all elements of a
. I can do this using a triple loop like
a
for i in range(a.shape[0]):
for j in range(a.shape[1]):
for k in range(a.shape[2]):
a[i, j, k] = ...
If I do not care about the indices i
, j
and k
, it is more efficient to do a flat loop, like
i
j
k
cython.declare(a_ptr='double*')
a_ptr = cython.address(a[0, 0, 0])
for i in range(size):
a_ptr[i] = ...
Here I need to know the number of elements (size
) in the array. This is given by the product of the elements in the shape
attribute, i.e. size = a.shape[0]*a.shape[1]*a.shape[2]
, or more generally size = np.prod(np.asarray(a).shape)
. I find both of these ugly to write, and the (albeit small) computational overhead bothers me. The nice way to do it is to use the builtin size
attribute of memoryviews, size = a.size
. However, for reasons I cannot fathom, this leads to unoptimized C code, as evident from the annotations html file generated by Cython. Specifically, the C code generated by size = a.shape[0]*a.shape[1]*a.shape[2]
is simply
size
shape
size = a.shape[0]*a.shape[1]*a.shape[2]
size = np.prod(np.asarray(a).shape)
size
size = a.size
size = a.shape[0]*a.shape[1]*a.shape[2]
__pyx_v_size = (((__pyx_v_a.shape[0]) * (__pyx_v_a.shape[1])) * (__pyx_v_a.shape[2]));
where the C code generated from size = a.size
is
size = a.size
__pyx_t_10 = __pyx_memoryview_fromslice(__pyx_v_a, 3, (PyObject *(*)(char *)) __pyx_memview_get_double, (int (*)(char *, PyObject *)) __pyx_memview_set_double, 0);; if (unlikely(!__pyx_t_10)) __PYX_ERR(0, 2238, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_10);
__pyx_t_14 = __Pyx_PyObject_GetAttrStr(__pyx_t_10, __pyx_n_s_size); if (unlikely(!__pyx_t_14)) __PYX_ERR(0, 2238, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_14);
__Pyx_DECREF(__pyx_t_10); __pyx_t_10 = 0;
__pyx_t_7 = __Pyx_PyIndex_AsSsize_t(__pyx_t_14); if (unlikely((__pyx_t_7 == (Py_ssize_t)-1) && PyErr_Occurred())) __PYX_ERR(0, 2238, __pyx_L1_error)
__Pyx_DECREF(__pyx_t_14); __pyx_t_14 = 0;
__pyx_v_size = __pyx_t_7;
To generate the above code, I have enabled all possible optimizations through compiler directives, meaning that the unwieldy C code generated by a.size
cannot be optimized away. It looks to me as though the size
"attribute" is not really a pre-computed attribute, but actually carries out a computation upon lookup. Furthermore, this computation is quite a bit more involved than simply taking the product over the shape
attribute. I cannot find any hint of an explanation in the docs.
a.size
size
shape
What is the explanation of this behavior, and do I have a better choice than writing out a.shape[0]*a.shape[1]*a.shape[2]
, if I really care about this micro optimization?
a.shape[0]*a.shape[1]*a.shape[2]
6 downvotes? are there even 6 people knowing cython out there?
– Jean-François Fabre
Apr 19 at 11:30
Downvotes seem a bit excessive to me - while the example isn't quite complete, it isn't a debugging problem anyway, and we have enough information to know what OP is trying to do.
– DavidW
Apr 19 at 13:19
Sometimes people downvote simply based on the wording. It doesn't mean that they understand the problem, or even the coding language.
– hpaulj
Apr 19 at 16:18
For what it's worth the downvotes seem to have been removed and I wonder if it was related to this (the meta question refers to a different question, but it's a similar time and the same number of votes)
– DavidW
Apr 20 at 13:44
2 Answers
2
Already by looking at the produced C-code, you can already see that size
is a property and not a simple C-member. Here is the original Cython-code for memory-views:
size
@cname('__pyx_memoryview')
cdef class memoryview(object):
...
cdef object _size
...
@property
def size(self):
if self._size is None:
result = 1
for length in self.view.shape[:self.view.ndim]:
result *= length
self._size = result
return self._size
It is easy to see, that the product is calculated only once and then cached. Clearly it doesn't play a big role for 3 dimensional arrays, but for a higher number of dimensions caching could become pretty important (as we will see, there are at most 8 dimensions, so it is not that clearly cut, whether this caching is really worth it).
One can understand the decision to lazily calculate the size
- after all, size
is not always needed/used and one doesn't want to pay for it. Clearly, there is a price to pay for this laziness if you use the size
a lot - that is the trade off cython makes.
size
size
size
I would not dwell too long on the overhead of calling a.size
- it is nothing compared to the overhead of calling a cython-function from python.
a.size
For example, the measurements of @danny measure only this python-call overhead and not the actual performance of the different approaches. To show this, I throw a third function into the mix:
%%cython
...
def both():
a.size+a.shape[0]*a.shape[1]*a.shape[2]
which does double amount of the work, but
>>> %timeit mv_size
22.5 ns ± 0.0864 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit mv_product
20.7 ns ± 0.087 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>>%timeit both
21 ns ± 0.39 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
is just as fast. On the other hand:
%%cython
...
def nothing():
pass
isn't faster:
%timeit nothing
24.3 ns ± 0.854 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In a nutshell: I would use a.size
because of the readability, assuming that optimizing that would not speed up my application, unless profiling proves something different.
a.size
The whole story: the variable a
is of type __Pyx_memviewslice
and not of type __pyx_memoryview
as one could think. The struct __Pyx_memviewslice
has the following definition:
a
__Pyx_memviewslice
__pyx_memoryview
__Pyx_memviewslice
struct __pyx_memoryview_obj;
typedef struct {
struct __pyx_memoryview_obj *memview;
char *data;
Py_ssize_t shape[8];
Py_ssize_t strides[8];
Py_ssize_t suboffsets[8];
} __Pyx_memviewslice;
that means, shape
can be accessed very efficiently by the Cython-code, as it is a simple C-array (btw. I ask my self, what happens if there are more than 8 dimensions? - the answer is: you cannot have more than 8 dimensions).
shape
The member memview
is where the memory is hold and __pyx_memoryview_obj
is the C-Extension which is produce from the cython-code we saw above and looks as follows:
memview
__pyx_memoryview_obj
/* "View.MemoryView":328
*
* @cname('__pyx_memoryview')
* cdef class memoryview(object): # <<<<<<<<<<<<<<
*
* cdef object obj
*/
struct __pyx_memoryview_obj {
PyObject_HEAD
struct __pyx_vtabstruct_memoryview *__pyx_vtab;
PyObject *obj;
PyObject *_size;
PyObject *_array_interface;
PyThread_type_lock lock;
__pyx_atomic_int acquisition_count[2];
__pyx_atomic_int *acquisition_count_aligned_p;
Py_buffer view;
int flags;
int dtype_is_object;
__Pyx_TypeInfo *typeinfo;
};
So, Pyx_memviewslice
is not really a Python object -it is kind of convenience wrapper, which caches important data, like shape
and stride
so this information can be accessed fast and cheap.
Pyx_memviewslice
shape
stride
What happens when we call a.size
? First, __pyx_memoryview_fromslice
is called which does some additional reference counting and some further stuff and returns the member memview
from the __Pyx_memviewslice
-object.
a.size
__pyx_memoryview_fromslice
memview
__Pyx_memviewslice
Then the property size
is called on this returned memoryview, which accesses the cached value in _size
as have been shown in the Cython code above.
size
_size
It looks as if the python-programmers introduced a shortcut for such important information as shape
, strides
and suboffsets
, but not for the size
which is probably not so important - this is the reason for cleaner C-code in the case of shape
.
shape
strides
suboffsets
size
shape
Thank you!
size
being a pure Python property fits my observations. From your link, it would seem as though shape
is also such a property, and that it returns a tuple
. Though in practice, a.shape
returns a Py_ssize_t
pointer.– jmd_dk
Apr 19 at 16:43
size
shape
tuple
a.shape
Py_ssize_t
The generated C code for a.size
looks fine.
a.size
It has to interface with Python because memory views are python extension types. size
on the memory view is a python attribute and gets converted to ssize_t
. That is all the C code does. The conversion can be avoided by typing the size
variable as Py_ssize_t
rather than ssize_t
.
size
ssize_t
size
Py_ssize_t
ssize_t
So there is not anything in the C code that looks unoptimised - it's just looking up an attribute on a python object, size on a memory view in this case.
Here are results of micro-benchmark for the two methods.
Setup:
cimport numpy as np
import numpy as np
cimport cython
cython.declare(a='double[:, :, ::1]')
a = np.empty((10, 20, 30), dtype='double')
def mv_size():
return a.size
def mv_product():
return a.shape[0]*a.shape[1]*a.shape[2]
Results:
%timeit mv_size
10000000 loops, best of 3: 23.4 ns per loop
%timeit mv_product
10000000 loops, best of 3: 23.4 ns per loop
Performance is pretty much identical.
The product method is purely C code which matters if it needs to be executed in parallel, but otherwise there is no performance benefit over memory view size
.
size
I will test the performance difference later myself. If there truly are no difference, there is of course no problem. Why
a.shape[0]
produces much cleaner C code than a.size
remains a mystery to me though.– jmd_dk
Apr 19 at 13:55
a.shape[0]
a.size
Also, I do not get why it matters if the code is to be executed in parallel. And it is indeed...
– jmd_dk
Apr 19 at 14:01
The
a.size
C code is pretty clean - as clean as any C code that Cython generates :) There is more of it because of the Python interaction and error checking that Cython does.– danny
Apr 19 at 14:37
a.size
For 'parallelisation', I meant outside of python, ie with Cython's
prange
and openmp interaction. That requires that the parallel code is C only, no python interaction. That means memory views cannot be used so whether size is a product of memory view share or an attribute does not matter - memory views cannot be used in prange
. If using python threading then there are no restrictions.– danny
Apr 19 at 14:39
prange
prange
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Why all the down votes?
– jmd_dk
Apr 19 at 11:27