Memory Tips - yszheda/wiki GitHub Wiki

Flame Graph

brk() syscall

brk() syscall sets the program break point: the end of the heap segment (aka the process data segment). brk() isn't called by the application directly, but rather the user-level allocator which provides the malloc() / free() interface. Such allocators typically don't give memory back the OS, keeping freed memory as a cache for future allocations. And so, brk() is typically for growth only (not shrinks).

What brk() tracing can tell us is the code paths that lead to heap expansion. This could be either:

  • A memory growth code path
  • A memory leak code path
  • An innocent application code path, that happened to spill-over the current heap size
  • Asynchronous allocator code path, that grew the application in response to diminishing free space

mmap() syscall

The mmap() syscall may be explicitly used by the application for loading data files or creating working segments, especially during initialization and application start. In this context, we're interested in creeping application growth, which may occur via mmap() if the allocator uses it instead of brk(). glibc does this for larger allocations, which can be returned to the system using munmap().

Unlike brk(), mmap() calls don't necessarily mean growth, as they may be freed shortly after using munmap(). And so tracing mmap() may show many new mappings, but most or all of them are neither growth nor leaks. If your system has frequent short-lived processes (eg, doing a software build), the mmap()s as part of process initialization can flood the trace.

Profile

What every programmer should know about memory

3.3.3 Write Behavior

Write-combining is a limited caching optimization more often used for RAM on devices such as graphics cards. Since the transfer costs to the devices are much higher than the local RAM access it is even more important to avoid doing too many transfers. Transferring an entire cache line just because a word in the line has been written is wasteful if the next operation modifies the next word. One can easily imagine that this is a common occurrence, the memory for horizontal neighboring pixels on a screen are in most cases neighbors, too. As the name suggests, write-combining combines multiple write accesses before the cache line is written out. In ideal cases the entire cache line is modified word by word and, only after the last word is written, the cache line is written to the device. This can speed up access to RAM on devices significantly.

4 Virtual Memory

4.1 Simplest Address Translation

4.2 Multi-Level Page Tables

page tree walking

It is therefore good for performance and scalability if the memory needed by the page table trees is as small as possible. The ideal case for this is to place the used memory close together in the virtual address space; the actual physical addresses used do not matter.

4.3 Optimizing Page Table Access

Prefetching code or data through software or hardware could implicitly prefetch entries for the TLB if the address is on another page. This cannot be allowed for hardware prefetching because the hardware could initiate page table walks that are invalid. Programmers therefore cannot rely on hardware prefetching to prefetch TLB entries. It has to be done explicitly using prefetch instructions. // TODO

4.3.1 Caveats Of Using A TLB

The TLB is a processor-core global resource. All threads and processes executed on the processor core use the same TLB.

  • The TLB is flushed whenever the page table tree is changed.
  • The tags for the TLB entries are extended to additionally and uniquely identify the page table tree they refer to.

Flushing the TLB is effective but expensive.

One possibility to optimize the cache flushes is to individually invalidate TLB entries.

A much better solution is to extend the tag used for the TLB access. If, in addition to the part of the virtual address, a unique identifier for each page table tree (i.e., a process’s address space) is added, the TLB does not have to be completely flushed at all. The only issue with this scheme is that the number of bits available for the TLB tag is severely limited, while the number of address spaces is not. This means some identifier reuse is necessary.

4.3.2 Influencing TLB Performance
  1. Use large page sizes
  2. Reduce the number of TLB entries needed by moving data which is used at the same time to fewer pages.

4.4 Impact Of Virtualization

Instead, the VMM creates its own page table tree for each guest domain and hands out memory using these data structures. The good thing is that access to the administrative information of the page table tree can be controlled.

Intel has Extended Page Tables (EPTs) and AMD calls it Nested Page Tables (NPTs). Basically both technologies have the page tables of the guest OSes produce “host virtual addresses” from the “guest virtual address”. The host virtual addresses must then be further translated, using the perdomain EPT/NPT trees, into actual physical addresses. This will allow memory handling at almost the speed of the no-virtualization case since most VMM entries for memory handling are removed. It also reduces the memory use of the VMM since now only one page table tree for each domain (as opposed to process) has to be maintained.

Overall, programmers must be aware that, with virtualization used, the cost of cache misses (instruction, data, or TLB) is even higher than without virtualization. Any optimization which reduces this work will pay off even more in virtualized environments.

5 NUMA Support

5.1 NUMA Hardware

5.2 OS Support for NUMA

SRAM DRAM

Page

Working Set