Mele A2000 (Allwinner A10) - ssvb/tinymembench GitHub Wiki

Processor       : ARMv7 Processor rev 2 (v7l)
BogoMIPS        : 1001.88
Features        : swp half thumb fastmult vfp edsp neon vfpv3 
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x3
CPU part        : 0xc08
CPU revision    : 2

Hardware        : sun4i
Revision        : 0000
Serial          : 0000000000000000

2x16-bit DDR3 is clocked at 480MHz (up from the default 360MHz)

tinymembench v0.2 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :    322.9 MB/s (1.3%)
 C copy                                               :    320.9 MB/s (1.2%)
 C copy prefetched (32 bytes step)                    :    730.2 MB/s
 C copy prefetched (64 bytes step)                    :    730.4 MB/s
 C 2-pass copy                                        :    291.5 MB/s (1.1%)
 C 2-pass copy prefetched (32 bytes step)             :    560.8 MB/s (1.6%)
 C 2-pass copy prefetched (64 bytes step)             :    560.9 MB/s (1.7%)
 C fill                                               :   1461.8 MB/s (0.4%)
 ---
 standard memcpy                                      :    570.6 MB/s
 standard memset                                      :   1460.8 MB/s
 ---
 NEON read                                            :   1326.3 MB/s
 NEON read prefetched (32 bytes step)                 :   1055.6 MB/s
 NEON read prefetched (64 bytes step)                 :   1059.0 MB/s (0.2%)
 NEON copy                                            :    862.3 MB/s
 NEON copy prefetched (32 bytes step)                 :    917.8 MB/s
 NEON copy prefetched (64 bytes step)                 :    920.6 MB/s (0.1%)
 NEON unrolled copy                                   :    862.6 MB/s
 NEON unrolled copy prefetched (32 bytes step)        :    807.9 MB/s
 NEON unrolled copy prefetched (64 bytes step)        :    808.4 MB/s (0.1%)
 NEON copy backwards                                  :    865.2 MB/s
 NEON copy backwards prefetched (32 bytes step)       :    918.1 MB/s
 NEON copy backwards prefetched (64 bytes step)       :    918.9 MB/s (0.1%)
 NEON 2-pass copy                                     :    621.0 MB/s
 NEON 2-pass copy prefetched (32 bytes step)          :    622.2 MB/s
 NEON 2-pass copy prefetched (64 bytes step)          :    639.1 MB/s
 NEON unrolled 2-pass copy                            :    637.9 MB/s
 NEON unrolled 2-pass copy prefetched (32 bytes step) :    590.8 MB/s
 NEON unrolled 2-pass copy prefetched (64 bytes step) :    600.2 MB/s
 NEON fill                                            :   1466.8 MB/s
 NEON fill backwards                                  :   1466.8 MB/s
 ARM fill (STRD)                                      :   1462.0 MB/s
 ARM fill (STM with 8 registers)                      :   1463.2 MB/s
 ARM fill (STM with 4 registers)                      :   1462.2 MB/s
 ARM copy prefetched (incr pld)                       :    774.0 MB/s
 ARM copy prefetched (wrap pld)                       :    775.7 MB/s
 ARM 2-pass copy prefetched (incr pld)                :    567.4 MB/s
 ARM 2-pass copy prefetched (wrap pld)                :    543.2 MB/s

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with total 3 requests to SDRAM for almost every      ==
== memory access (though 64MiB is not large enough to experience this   ==
== effect to its fullest).                                              ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 1: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : read access time (single random read / dual random read)
         2 :    0.0 ns  /     0.0 ns 
         4 :    0.0 ns  /     0.0 ns 
         8 :    0.0 ns  /     0.0 ns 
        16 :    0.0 ns  /     0.0 ns 
        32 :    0.0 ns  /     0.0 ns 
        64 :    0.0 ns  /     0.0 ns 
       128 :    0.0 ns  /     0.0 ns 
       256 :    0.0 ns  /     0.0 ns 
       512 :    0.0 ns  /     0.0 ns 
      1024 :    0.0 ns  /     0.0 ns 
      2048 :    0.0 ns  /     0.0 ns 
      4096 :    0.0 ns  /     0.0 ns 
      8192 :    0.0 ns  /     0.0 ns 
     16384 :    0.0 ns  /     0.0 ns 
     32768 :    0.0 ns  /     0.0 ns 
     65536 :    5.0 ns  /     9.7 ns 
    131072 :    7.5 ns  /    14.7 ns 
    262144 :   36.5 ns  /    70.6 ns 
    524288 :  106.1 ns  /   213.7 ns 
   1048576 :  144.4 ns  /   292.4 ns 
   2097152 :  164.1 ns  /   332.8 ns 
   4194304 :  174.8 ns  /   354.5 ns 
   8388608 :  181.8 ns  /   369.0 ns 
  16777216 :  188.8 ns  /   385.4 ns 
  33554432 :  200.1 ns  /   418.0 ns 
  67108864 :  221.8 ns  /   471.5 ns

Driving a 1920x1080-32@60Hz HDMI monitor

tinymembench v0.2 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :    317.4 MB/s
 C copy                                               :    316.4 MB/s
 C copy prefetched (32 bytes step)                    :    525.9 MB/s
 C copy prefetched (64 bytes step)                    :    526.1 MB/s
 C 2-pass copy                                        :    267.2 MB/s (1.2%)
 C 2-pass copy prefetched (32 bytes step)             :    516.8 MB/s
 C 2-pass copy prefetched (64 bytes step)             :    517.0 MB/s
 C fill                                               :    555.6 MB/s
 ---
 standard memcpy                                      :    518.0 MB/s (2.1%)
 standard memset                                      :    555.8 MB/s
 ---
 NEON read                                            :   1224.1 MB/s
 NEON read prefetched (32 bytes step)                 :    985.5 MB/s
 NEON read prefetched (64 bytes step)                 :    985.5 MB/s (3.0%)
 NEON copy                                            :    530.4 MB/s
 NEON copy prefetched (32 bytes step)                 :    532.7 MB/s (1.6%)
 NEON copy prefetched (64 bytes step)                 :    533.2 MB/s (1.6%)
 NEON unrolled copy                                   :    530.8 MB/s (0.1%)
 NEON unrolled copy prefetched (32 bytes step)        :    528.8 MB/s (1.6%)
 NEON unrolled copy prefetched (64 bytes step)        :    528.7 MB/s
 NEON copy backwards                                  :    603.2 MB/s
 NEON copy backwards prefetched (32 bytes step)       :    625.0 MB/s
 NEON copy backwards prefetched (64 bytes step)       :    625.1 MB/s
 NEON 2-pass copy                                     :    520.2 MB/s
 NEON 2-pass copy prefetched (32 bytes step)          :    520.3 MB/s
 NEON 2-pass copy prefetched (64 bytes step)          :    521.0 MB/s
 NEON unrolled 2-pass copy                            :    521.2 MB/s
 NEON unrolled 2-pass copy prefetched (32 bytes step) :    519.0 MB/s
 NEON unrolled 2-pass copy prefetched (64 bytes step) :    519.0 MB/s
 NEON fill                                            :    556.5 MB/s
 NEON fill backwards                                  :   1040.1 MB/s
 ARM fill (STRD)                                      :    556.5 MB/s
 ARM fill (STM with 8 registers)                      :    556.5 MB/s
 ARM fill (STM with 4 registers)                      :    556.5 MB/s
 ARM copy prefetched (incr pld)                       :    528.0 MB/s
 ARM copy prefetched (wrap pld)                       :    526.2 MB/s
 ARM 2-pass copy prefetched (incr pld)                :    516.2 MB/s
 ARM 2-pass copy prefetched (wrap pld)                :    490.5 MB/s

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with total 3 requests to SDRAM for almost every      ==
== memory access (though 64MiB is not large enough to experience this   ==
== effect to its fullest).                                              ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 1: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : read access time (single random read / dual random read)
         2 :    0.0 ns  /     0.0 ns 
         4 :    0.0 ns  /     0.0 ns 
         8 :    0.0 ns  /     0.0 ns 
        16 :    0.0 ns  /     0.0 ns 
        32 :    0.0 ns  /     0.0 ns 
        64 :    0.0 ns  /     0.0 ns 
       128 :    0.0 ns  /     0.0 ns 
       256 :    0.0 ns  /     0.0 ns 
       512 :    0.0 ns  /     0.0 ns 
      1024 :    0.0 ns  /     0.0 ns 
      2048 :    0.0 ns  /     0.0 ns 
      4096 :    0.0 ns  /     0.0 ns 
      8192 :    0.0 ns  /     0.0 ns 
     16384 :    0.0 ns  /     0.0 ns 
     32768 :    0.0 ns  /     0.0 ns 
     65536 :    5.0 ns  /     9.7 ns 
    131072 :    7.5 ns  /    14.8 ns 
    262144 :   36.9 ns  /    72.1 ns 
    524288 :  108.5 ns  /   218.3 ns 
   1048576 :  147.4 ns  /   298.8 ns 
   2097152 :  167.8 ns  /   340.0 ns 
   4194304 :  178.6 ns  /   362.2 ns 
   8388608 :  185.6 ns  /   376.3 ns 
  16777216 :  191.8 ns  /   388.9 ns 
  33554432 :  201.7 ns  /   411.9 ns 
  67108864 :  221.1 ns  /   463.4 ns

Driving two 1920x1080-32@60Hz monitors (HDMI and VGA) at the same time

tinymembench v0.2.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :    209.6 MB/s
 C copy                                               :    217.8 MB/s
 C copy prefetched (32 bytes step)                    :    296.2 MB/s (0.5%)
 C copy prefetched (64 bytes step)                    :    296.1 MB/s
 C 2-pass copy                                        :    167.3 MB/s
 C 2-pass copy prefetched (32 bytes step)             :    247.4 MB/s
 C 2-pass copy prefetched (64 bytes step)             :    247.3 MB/s
 C fill                                               :    519.0 MB/s
 ---
 standard memcpy                                      :    315.7 MB/s
 standard memset                                      :    519.0 MB/s
 ---
 NEON read                                            :    814.3 MB/s (1.2%)
 NEON read prefetched (32 bytes step)                 :    589.4 MB/s
 NEON read prefetched (64 bytes step)                 :    590.4 MB/s
 NEON copy                                            :    351.6 MB/s
 NEON copy prefetched (32 bytes step)                 :    351.2 MB/s
 NEON copy prefetched (64 bytes step)                 :    349.7 MB/s
 NEON unrolled copy                                   :    358.1 MB/s
 NEON unrolled copy prefetched (32 bytes step)        :    313.4 MB/s
 NEON unrolled copy prefetched (64 bytes step)        :    348.1 MB/s
 NEON copy backwards                                  :    340.4 MB/s (0.5%)
 NEON copy backwards prefetched (32 bytes step)       :    341.3 MB/s
 NEON copy backwards prefetched (64 bytes step)       :    341.3 MB/s
 NEON 2-pass copy                                     :    304.6 MB/s
 NEON 2-pass copy prefetched (32 bytes step)          :    301.0 MB/s
 NEON 2-pass copy prefetched (64 bytes step)          :    303.8 MB/s
 NEON unrolled 2-pass copy                            :    305.0 MB/s
 NEON unrolled 2-pass copy prefetched (32 bytes step) :    268.4 MB/s (0.4%)
 NEON unrolled 2-pass copy prefetched (64 bytes step) :    296.1 MB/s
 NEON fill                                            :    519.1 MB/s
 NEON fill backwards                                  :    495.5 MB/s
 ARM fill (STRD)                                      :    519.1 MB/s
 ARM fill (STM with 8 registers)                      :    519.0 MB/s
 ARM fill (STM with 4 registers)                      :    519.1 MB/s
 ARM copy prefetched (incr pld)                       :    347.6 MB/s (0.5%)
 ARM copy prefetched (wrap pld)                       :    317.4 MB/s
 ARM 2-pass copy prefetched (incr pld)                :    281.2 MB/s
 ARM 2-pass copy prefetched (wrap pld)                :    268.8 MB/s

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with total 3 requests to SDRAM for almost every      ==
== memory access (though 64MiB is not large enough to experience this   ==
== effect to its fullest).                                              ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : read access time (single random read / dual random read)
         2 :    0.0 ns  /     0.0 ns 
         4 :    0.0 ns  /     0.0 ns 
         8 :    0.0 ns  /     0.0 ns 
        16 :    0.0 ns  /     0.0 ns 
        32 :    0.0 ns  /     0.0 ns 
        64 :    0.0 ns  /     0.0 ns 
       128 :    0.0 ns  /     0.0 ns 
       256 :    0.0 ns  /     0.0 ns 
       512 :    0.0 ns  /     0.0 ns 
      1024 :    0.0 ns  /     0.0 ns 
      2048 :    0.0 ns  /     0.0 ns 
      4096 :    0.0 ns  /     0.0 ns 
      8192 :    0.0 ns  /     0.0 ns 
     16384 :    0.0 ns  /     0.0 ns 
     32768 :    0.0 ns  /     0.0 ns 
     65536 :    5.0 ns  /     9.7 ns 
    131072 :    7.5 ns  /    14.8 ns 
    262144 :   37.5 ns  /    74.4 ns 
    524288 :  118.6 ns  /   241.0 ns 
   1048576 :  164.5 ns  /   336.4 ns 
   2097152 :  188.2 ns  /   385.8 ns 
   4194304 :  201.1 ns  /   412.5 ns 
   8388608 :  209.4 ns  /   429.6 ns 
  16777216 :  217.6 ns  /   445.9 ns 
  33554432 :  229.5 ns  /   469.9 ns 
  67108864 :  254.8 ns  /   522.1 ns

Mele A2000 (Allwinner A10) - ssvb/tinymembench GitHub Wiki

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️