Python, Cython, and then 14000x speed up - weixingsun/perf_tuning_results GitHub Wiki
Cython has been used in almost all projects, because of the performance benefit of translating Python to C. So let's have a look at the performance metrics
Again, a very simple fibo benchmark written in python:
`def fibonacci_cy(n):
a,b = 0,1
for _ in range (1,n):
a,b = b,a+b
return b`
Environment:
CPU: Intel 8180 2.5GHz (only bind to 1 lcore)
OS: RHEL7.5 (Meltdown & Spectre variant 3 patched)
test/python/fib.pyx #define func
test/python/main.py #main.py
Test result:
Python: 8.02360 seconds 1.000x
Cython: 7.90431 seconds 1.015x
Profile:
root@dr1:/perf_tuning_results/test/python# perf stat -e $PERF_OPTS numactl --physcpubind=2 --localalloc python main.py cython Performance counter stats for 'numactl --physcpubind=2 --localalloc python main.py cython':
7,182,336,770 cycles
29,333,087,481 instructions # **4.08 insn per cycle**
2,099,512 cache-references
38,691 cache-misses # 1.843 % of all cache refs
2.561079277 seconds time elapsed
root@dr1:/perf_tuning_results/test/python# perf stat -e $PERF_OPTS numactl --physcpubind=2 --localalloc python main.py python Python: 2.54462
Performance counter stats for 'numactl --physcpubind=2 --localalloc python main.py python':
7,193,481,405 cycles
29,438,580,777 instructions # **4.09 insn per cycle**
2,299,854 cache-references
36,819 cache-misses # 1.601 % of all cache refs
2.569538297 seconds time elapsed
Wait a minute, Why so many instructions, what is on my CPU?
x_add function takes all cpu cycles, but the root cause is creating objects PyObject_Malloc, and eval objects _PyEval_EvalFrameDefault
See what's in heap
Let's try, replace python object by c variable
`def fibonacci_cy_styping(int n):
cdef int _
cdef int a=0, b=1
for _ in range(1, n):
a, b = b, a + b
return b`
Test again:
Python: 8.02360 seconds 1.000x
Cython: 7.90431 seconds 1.015x
Static: 0.00057 seconds 14133.495x
`root@dr1:/perf_tuning_results/test/python# perf stat -e $PERF_OPTS numactl --physcpubind=2 --localalloc python main.py cython Performance counter stats for 'numactl --physcpubind=2 --localalloc python main.py cython':
7,182,336,770 cycles
29,333,087,481 instructions # 4.08 insn per cycle
2,099,512 cache-references
38,691 cache-misses # 1.843 % of all cache refs
2.561079277 seconds time elapsed
root@dr1:/perf_tuning_results/test/python# perf stat -e $PERF_OPTS numactl --physcpubind=2 --localalloc python main.py python Python: 2.54462
Performance counter stats for 'numactl --physcpubind=2 --localalloc python main.py python':
7,193,481,405 cycles
29,438,580,777 instructions # 4.09 insn per cycle
2,299,854 cache-references
36,819 cache-misses # 1.601 % of all cache refs
2.569538297 seconds time elapsed
root@dr1:/perf_tuning_results/test/python# perf stat -e $PERF_OPTS numactl --physcpubind=2 --localalloc python main.py cython_typing Cython: 0.00039
Performance counter stats for 'numactl --physcpubind=2 --localalloc python main.py cython_typing':
73,403,574 cycles
86,205,462 instructions # **1.17 insn per cycle**
1,468,203 cache-references
15,874 cache-misses # 1.081 % of all cache refs
0.026663483 seconds time elapsed`
Conclusion:
I never thought objects operations are slow as this, is that meaning we should get rid of objects in python ?