Python, Cython, and then 14000x speed up - weixingsun/perf_tuning_results GitHub Wiki

Cython has been used in almost all projects, because of the performance benefit of translating Python to C. So let's have a look at the performance metrics

Again, a very simple fibo benchmark written in python:

`def fibonacci_cy(n):

a,b = 0,1

for _ in range (1,n):

    a,b = b,a+b

return b`

Environment:

CPU: Intel 8180 2.5GHz (only bind to 1 lcore)

OS: RHEL7.5 (Meltdown & Spectre variant 3 patched)

test/python/fib.pyx #define func

test/python/main.py #main.py

Test result:

Python: 8.02360 seconds 1.000x

Cython: 7.90431 seconds 1.015x

Profile:

root@dr1:/perf_tuning_results/test/python# perf stat -e $PERF_OPTS numactl --physcpubind=2 --localalloc python main.py cython Performance counter stats for 'numactl --physcpubind=2 --localalloc python main.py cython':

 7,182,336,770      cycles
29,333,087,481      instructions              #    **4.08  insn per cycle**
     2,099,512      cache-references
        38,691      cache-misses              #    1.843 % of all cache refs

   2.561079277 seconds time elapsed

root@dr1:/perf_tuning_results/test/python# perf stat -e $PERF_OPTS numactl --physcpubind=2 --localalloc python main.py python Python: 2.54462

Performance counter stats for 'numactl --physcpubind=2 --localalloc python main.py python':

 7,193,481,405      cycles
29,438,580,777      instructions              #    **4.09  insn per cycle**
     2,299,854      cache-references
        36,819      cache-misses              #    1.601 % of all cache refs

   2.569538297 seconds time elapsed

Wait a minute, Why so many instructions, what is on my CPU?

perf malloc perf eval

x_add function takes all cpu cycles, but the root cause is creating objects PyObject_Malloc, and eval objects _PyEval_EvalFrameDefault

See what's in heap

heap_top_objects

Let's try, replace python object by c variable

`def fibonacci_cy_styping(int n):

cdef int _

cdef int a=0, b=1

for _ in range(1, n):

    a, b = b, a + b

return b`

Test again:

Python: 8.02360 seconds 1.000x

Cython: 7.90431 seconds 1.015x

Static: 0.00057 seconds 14133.495x

`root@dr1:/perf_tuning_results/test/python# perf stat -e $PERF_OPTS numactl --physcpubind=2 --localalloc python main.py cython Performance counter stats for 'numactl --physcpubind=2 --localalloc python main.py cython':

 7,182,336,770      cycles
29,333,087,481      instructions              #    4.08  insn per cycle
     2,099,512      cache-references
        38,691      cache-misses              #    1.843 % of all cache refs

   2.561079277 seconds time elapsed

root@dr1:/perf_tuning_results/test/python# perf stat -e $PERF_OPTS numactl --physcpubind=2 --localalloc python main.py python Python: 2.54462

Performance counter stats for 'numactl --physcpubind=2 --localalloc python main.py python':

 7,193,481,405      cycles
29,438,580,777      instructions              #    4.09  insn per cycle
     2,299,854      cache-references
        36,819      cache-misses              #    1.601 % of all cache refs

   2.569538297 seconds time elapsed

root@dr1:/perf_tuning_results/test/python# perf stat -e $PERF_OPTS numactl --physcpubind=2 --localalloc python main.py cython_typing Cython: 0.00039

Performance counter stats for 'numactl --physcpubind=2 --localalloc python main.py cython_typing':

    73,403,574      cycles
    86,205,462      instructions              #    **1.17  insn per cycle**
     1,468,203      cache-references
        15,874      cache-misses              #    1.081 % of all cache refs

   0.026663483 seconds time elapsed`

Conclusion:

I never thought objects operations are slow as this, is that meaning we should get rid of objects in python ?