videofilt_cpudependentoptimization - shekh/VirtualDub2 GitHub Wiki
VirtualDub Plugin SDK 1.2
CPU dependent optimization
Depending on the CPU in the user's machine, it may be possible to use certain instruction set extensions to accelerate execution of the filter, such as MMX or SSE2. Good use of these extensions can result in significant speedups, as much as 2-4x. However, optional extensions must be checked for before they are used.
The video filter API exports several entry points that allow the filter
to query for optional CPU features. Although it is possible for the
filter to query the CPU directly, using the host callbacks has the
benefit that the filter tracks any CPU feature override UI that is in
the host. The isFPUEnabled()
call returns true if FPU (x87)
optimizations should be used; isMMXEnabled()
returns true if MMX
should be used. For more advanced features, getCPUFlags()
also reports
support for integer SSE, SSE, SSE2, 3DNow!, and 3DNow! Professional.
Keep in mind when writing code that the video filter API offers no guarantees with regard to any CPU extensions — only the bare instruction set is supported. In particular, on x86, neither MMX nor P6 conditional moves (CMOVcc/FCMOVcc) should be used before checking feature flags. On x64, SSE2 is standard, but 3DNow! is not.
Note: The host can change the value of flags on the fly in response to a change the user preferences. A filter does not have to support all dynamic changes — it can cache the state of the flags when required.
Note: If you are using compiler options to generate code that uses
instruction set extensions, such as /QxW
on Intel C/C++ or
/arch:SSE2
with Microsoft Visual C++, you must ensure that such code
is not executed until support for the extensions is verified. This can
be done either by instructing the compiler to check for the extensions
(ex: /QaxW
in Intel C/C++), or selectively compiling the startup code
for the filter with CPU-specific optimizations disabled. A
sledgehammer-like method would be to compile only the module
initialization routine in this manner and have it add filter entries to
the host only if the host reports that the necessary CPU extensions are
available.
VirtualDub specific: VirtualDub relies on 80486 instructions to be supported, but does not guarantee support for instructions introduced in later CPUs.
When working with vector instruction sets it is frequently advantageous to have aligned data, or data that is placed at addresses in memory that are a multiple of some alignment size. For MMX instructions, this is 4 or 8 bytes, and for SSE and above, it is 16 bytes. By default, it is only guaranteed that filter image buffers are aligned by natural alignment, or the usual alignment for the pixel type. For 32-bit RGB frame buffers this is 4 bytes, and for most others, it is only byte alignment. This complicates vectorization as vector instruction sets often have awkward and slow handling of unaligned data and require fixup routines to handle odd pixel counts.
Starting with the V14 API, it is possible to request 16 byte alignment
of all scanlines by having paramProc
return the
FILTERPARAM_ALIGN_SCANLINES
flag. This flag modifies the allocation of
frame buffers so that scanlines are always aligned to a 16 byte boundary
and are a multiple of 16 bytes long. This simplifies vectorization since
routines can read vectors directly from memory and not have to worry
about unaligned loads or crashing due to reading beyond the end of a
scanline.
The FILTERPARAM_ALIGN_SCANLINES
flag also has another effect, which is
that it also pads out scanlines in the output buffer so that the filter
can write multiples of 16 bytes. This also reduces the complexity of
the filter as it often makes it unnecessary to have fixup code for
images that are an odd number of pixels in width. Padding applies to all
planes, so the quarter-size chroma planes in a 4:1:0 YCbCr buffer will
also be padded to a multiple of 16 bytes.
Note that there are a couple of gotchas to aligned scanline support. The first is that, although the padding at the end of output scanlines is ignored and can be written with any value, the extra padding on source scanlines is not guaranteed to have any particular value. This means that filters still need to be carefully written so that those undefined values don't affect the output, which might otherwise show up as noise on the right side of the image. The second and more minor gotcha is that in some cases using this flag can impose a small performance penalty as the host may have to realign buffers that aren't aligned appropriately from the source.
Copyright (C) 2007-2012 Avery Lee.