On EGFX architecture and Dynamic Monitors - neutrinolabs/xrdp GitHub Wiki

NOTE

This document is a work-in-progress. Some of the information here may be incomplete or incorrect.

A note on authorship

Nearly all the code referenced here was written by jsorg71. Nexarian wrote this wiki to help communicate the architecture and get reviewed to enable it to be checked in. @Nexarian also added the pieces to enable EGFX to work with Microsoft clients other than MSTSC as well as dynamic resizing, as well as misc fixes to enable the branched EGFX code to work rebased off of the devel branches of xrdp, xorgxrdp, and librfxcodec. But the core algorithms and code to implement them is @jsorg71's.

Introduction

The entire point of this is to hyper-accelerate the speed at which XRDP remoting sessions can work. The original version was useful for simple IT administrative tasks and maybe light development work, but after that the latency and the experience degraded significantly. When this project is finished, we should be able to have it all: Fast sessions, dynamic resizing, hardware acceleration, and easy installation. About the only think you won't be able to do with it is 4K gaming!

Right now the target audience is those who have some familiarity with Remote Desktop technologies. Some terms are not clearly defined, and some references are not directly called out. If this document needs more investment, please mention it on the XRDP developer Gitter channel

Background and history

In the mid-late 2000 aughts, from the research I can find, it seems that there was a debate about how to develop good video compression technologies. The first popular experiment was known as M-JPEG, or essentially the JPEG 2000 codec, optimized for video. XRDP itself has a JPEG compression mode that was much faster than the original protocol's raster protocol ancestor. However, it never caught on fully.

The main revelation is that JPEG, while highly compressed, is still a full frame every time. What if, instead, we could leverage high quality compression tech AND not send unnecessary redundant data? Experiments show that the main limiting factor of remote desktop speed is bandwidth. Even if the server and clients have to do a lot of work compressing/decompressing respectively, that's still a win in overall speed and smoothness of the experience. This is even true with high latency (EGFX H264 with XRDP seems to do well up to 100 ms of latency) if you have good bandwidth, so reducing the bandwidth is critical.

Enter the MPEG compression technologies, which DVDs use, and were later adopted by remote desktop and streaming services. The current version that most Remote Desktop systems use is H264. H264 sends a keyframe, and then a "Group of Pictures (GOP)" in series that contain diffs from the key frame. Video streaming services like Netflix will send a key frame periodically to make sure everything is in sync, but currently XRDP doesn't do this. Tuning when to send a key frame refresh is a subject of debate. The details of how these encoders work is out of scope for this document, and it has taken years of R&D to build them.

In 2011 Intel released their Quick Sync GPU H264 encoder, and in 2013 Nvidia followed suit and released NVENC, their hardware based H264 encoder for their GeForce GPUs. This tells us that the industry had doubled down on the technology, and shortly after Microsoft released research on how to add H264 to their RDP protocol, using the GFX extension (more on this can be found in the references at the end of this article). AMD has an encoder, but it's not nearly as good as these other two.

There are successors to H264, such as H265. Google has VP8 (fork of H264) and VP9 (fork of H265). H265 compression is EXTREMELY computationally intensive and requires even more tuning than H264, and therefore is not useful for Remote Desktop at this time. The next big shift looks to be AV1, which is based on H265 but more efficient (though not yet efficient enough to be used for streaming). Of the above, only H264 has widespread adoption in terms of hardware support, meaning it's going to be the gold standard for many years to come.

Architecture

There are several codecs currently in use for XRDP prototypes for GFX integration:

Microsoft Progressive RFX
- Based on Microsoft's RFX protocol, this is an adaptation that incorporates frame diffing into the JPEG2000 protocol to mimic H264. It is not as fast as H264 (It is merged into the devel branch as https://github.com/neutrinolabs/librfxcodec)
H264: X264 CPU accelerated
- When a supported hardware encoder is not available, this relies on the common Linux X264 library, which does encoding in CPU and system RAM.
H264: Nvidia-NVENC accelerated.
- If you own a relatively recent Nvidia GPU and can enable the standard Nvidia drivers on your session, this is blazing fast. It's also one of the preferred encoders for streamers that use software like OBS.
H264: Intel libva accelerated
- Anecdotally the fastest experience available with XRDP right now, but it is also the hardest to install and configure. You'll need an Intel iGPU or presumably the up-and-coming Intel ARK GPUs for this. Note that for Enterprise servers most Xeon CPUs do not have the necessary encoder, also known as QuickSync.

Note that while AMD appears to be working on an encoder for their upcoming RDNA 3 architecture, current versions of AMD's H264 encoders are considered inferior and not broadly adopted in the industry.

Dynamic session resizing (aka resize-on-the-fly) as of 6-25-2022

See the diagram of how this currently works here

Initial implementation of the ability to resize the XRDP session dynamically was relatively straightforward, protocol-wise, at least. However, when combined with the GFX pipeline, things became much more complex. With non-GFX use-cases, the entire session can be resized in a single-threaded function and performed nearly instantly with blocking calls, including with RemoteFX.

This isn't possible given the way Microsoft has combined the two features in their clients. Microsoft gives us little to go on for how to orchestrate the resizing, and a lot of this information is derived from trial and error. What we do have is this quote:

"Restart the graphics pipeline using the surface management commands (specified in MS-RDPEGFX(https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-rdpegfx/da5c75f9-cd99-450c-98c4-014a496942b0) section 1.3) if the Remote Desktop Protocol: Graphics Pipeline Extension is being used to remote session graphics." Reference

Essentially: The GFX Extended Virtual Channel has to be shut down, and then everything has to be re-incarnated using the new monitor descriptions. Ugh! This makes the resizing workflow MUCH more complex. Now, instead of a simple function that re-sizes everything, we have to re-write it such that we close the channel, resize the session, then re-establish the virtual channel with the new size. To do this, we need a state machine. To figure out the proper sequence in that state machine, we have to experiment, as it's unclear how the Microsoft clients will respond to that.

Anecdotally, the only easy-to-access client that can handle H264/EGFX and resize-on-the-fly is the Mac OS one. Windows is a fragmented mess. MSTSC appears to only support resize-on-the-fly in limited circumstances (but it does support H264/EGFX) and the Microsoft Store client supports resize-on-the-fly, but not H264. There is another Windows Desktop Client (MSRDC) that I haven't been able to test due to the fact that it only works on Azure systems with Active Directory. So, nearly all of my investigating and testing is done with FreeRDP and the Microsoft Mac OS client.

First revelation is that the note from Ogon is correct. You MUST delete and re-create the EGFX connection, and you must MAKE SURE the connection is shut down before you re-create a new one. It currently has a bug where sometimes, if the EGFX connection is not dropped properly, the session is now dead. There is a zombie connection on the Mac OS side and if a new connection is attempted to be created, the client can go into an infinite loop and leak nearly a gigabyte of memory. Not fun.

Second is that H264 is a delicate protocol. The Microsoft Mac OS client must receive the H264 keyframe before any incremental frames. If it doesn't receive the keyframe first after the EGFX connection is re-created, the session is again dead and can't be salvaged; you have to reconnect. Thus, the resizing workflow takes great pains to make sure that Frame 0 from the xorgxrdp backend is the first frame encoded (as a keyframe), that said frame is labeled as a keyframe, and that it's the first frame that is sent from the xrdp server to the client. This doesn't always work, and is continually being optimized. Currently, the "invalidate" signal is used to trigger this.

Third is that monitor descriptions matter. The XRDP server has to be very clear what monitor descriptions and surface ids it sends to the Microsoft client. Currently, everything is done with surface 0 and no client-side composition is happening. With Wayland, that could change.

The XRDP architecture above poses a problem, then. While the EGFX connection is established, we also have an asynchronous encoder running. Worse, the encoder's settings are resolution specific. What do we do with that encoder when the resizing state machine begins?

Special Topic: Deleting and re-creating the encoder in the EGFX workflow for resizing

The current implementation deletes the encoder, waits for xrdp and its modules to resize, and then restarts the encoder. This only applies if we have resizing in a state machine, which we've proven we must have above for EGFX compatibility. For non-EGFX use-cases, resizing can be done in one single-threaded shot, including deleting and re-creating the encoder. However, forking the resizing code to be simpler in one mode and more complex in another is an opportunity to introduce more bugs. Instead, we should create one flow that supports all use-cases, which is good engineering.

To show why we need to delete the encoder and re-create it in this way, and that it's the best option, let us use proof by contradiction:

Let's assume that we delete and re-create the encoder in a one, single-threaded step. X is single-threaded, so this should work, right? To make this easier, let's also say we're resizing from full screen to smaller screen.

We have several choices of where to place this in the state machine:

Before we resize xrdp/xorgxrdp
Somewhere in the middle of resizing xrdp/xorgxrdp
After resizing xrdp/xorgxrdp.

Try (1): The encoder will now receive, at least temporarily, frames that do not match the size it's created for because it is now created for a screen that is small, but xorgxrdp is still sending it larger screens while waiting for the EGFX connection to the client to be reconstituted. This is wrong, and could cause the encoder to send garbled data to the client in the best case, or crash in the worst case.

Try (2): Maybe between when xorgxrdp is resized and when xrdp is resized. Now the encoder may match what is being sent from xorgxrdp, but XRDP itself hasn't been resized, so if the encoder tries to use XRDP to send a frame of a different size into the XRDP network buffer that hasn't been resized yet, we could have issues.

Try (3): xrdp/xorgxrdp are now consistent, but all the while the encoder has been running, it's now received both a full screen and a smaller screen frame, so the issue isn't that the recreating shouldn't happen here, it's that we left it exposed to shifting memory earlier. To double check, I just tried switching this in my mainline_merge branch with EGFX and my torture test, it's not stable.

So, there is no place in the state machine where we can delete and re-create the encoder in ONE STEP.

Furthermore, H264 and EGFX with NVENC and Intel QuickSync (hardware acceleration) make this more complicated.

In the current master of xrdp, xorgxrdp captures software-video-driver's the memory into a shared memory buffer, it's wrapped by XRDP into a PDU, and then sent over the wire with the RDP protocol. On my Ubuntu system, when I query for this I use glxinfo | grep vendor and the video driver vendor is "SGI." This amounts to a RAM->RAM memory copy from xorgxrdp -> xrdp, not a huge order of magnitude change in performance.

Ah but this doesn't work with hardware accelerated video memory, at least, not if you want performance. What needs to happen is all the encoding needs to happen in a separate video memory buffer, everything from RGB -> YUV conversion and then H264 encoding. Jay wrote a version that used NVENC where the frame was copied from video memory -> RAM, back to video memory for Nvidia NVENC encoding, and then back to RAM for sending over the network, and it was slowing than CPU accelerated X264!

So, that's OK, just leave the data in the video memory and encode it there using OpenGL/DRI3 and NVENC/QuickSync. Not so fast. Where do we do this encoding? XRDP needs to be compatible with multiple backends and needs to be agnostic to the details of the drivers it accesses. Ok, so move this to XORGXRDP! Also not ideal. XORGXRDP is a driver whose job really is to capture video from the X server and then send it to XRDP. It should not have encoding functionality, that's XRDP's job to determine the proper way to do that. We need another option.

Enter the xorgxrdp_helper (I would like to rename this to the acceleration assistant but this is alpha software so we can change that later). This is a process that man-in-the-middles the Unix sockets that XRDP and XORGXRDP use to communicate, and does the encoding in a separate thread within that process. When using xorgxrdp_helper, the encoder that lives inside XRDP is now a dummy. All it does is check to see that the shared memory buffer is already encoded, and then passes that data on to be wrapped up as a PDU in the EGFX virtual channel.

Now, we have to synchronize both the encoder that lives in the xorgxrdp_helper and the encoder that lives inside XRDP. The one that is inside XRDP should not be doing ANYTHING until XRDP is signaled that the xorgxrdp_helper's encoder is alive again (that signal is known as "create pixmap"), but this should also not be done until the EGFX virtual channel is re-created, too.

This is why re-creating the encoder is done as the last significant step of the workflow, everything else has to be ready before the invalidate signal is sent and the encoder can handle it.

Above, I neglected to mention a 4th option. We could "disable/pause" the encoder. In this way, we don't have to deal with the risk of a null-pointer-exception during the time between when the encoder is destroyed and re-created. I do not prefer this option for the following reasons:

XRDP is designed to handle a NULL encoder. Some video modes, such as fastpath, don't need it. This is a supported situation.
This adds complexity to the code. Now we have to check for a NULL encoder or a paused encoder, and certain parts of the code will have to address those differently. For instance, frame acknowledgement. Now we have to disable frame acknowledgement in both cases. It adds another mode to the code that increases the complexity unnecessarily given (1).
This has been extensively tested with multiple RDP clients (FreeRDP, Microsoft Remote Desktop App Store, MSTSC, and Microsoft Mac OS client). It works, and the problems with a NULL encoder were small (I think only a single condition was added to address this case).

Summary

We have to re-create the encoder on resize because for at least some encoders, the size of the screen is set read-only on encoder creation.
If we didn't have to use a state machine for the resizing workflow, re-creating the encoder in one step is preferred. I just checked that into devel here
However, due to the impending integration of EGFX work, we need a state machine. That makes life harder, because now resizing happens in multiple asynchronous steps.
During that async workflow, there is no easy way to prevent invalid data from being sent to the encoder. We could add some flag/state management to XRDP, but that is an additional opportunity for bugs.
Thus, the solution is to shut down and delete the encoder during resize, and bring it back only after everything has settled. A NULL encoder is a supported state, and always has been as not all session types use it.
Microsoft clients are super sensitive to the "wrong" data being sent to them, so we have to be extremely careful as to how the state machine is constructed.
This involves juggling, with EGFX and HW accel integrated, the states of 3 different processes. It gets a little crazy

State Machine/Sequence Diagram

I've attempted to dust off my UML skills to create a sequence diagram of the workflow, the main implementation of which is here Resizing Sequence Diagram

Special Topic: Nvidia Driver Integration with the Xorg server system as of 6-25-2022

Historically, Nvidia has been a trouble spot for Linux integration because their drivers aren't open source. When attempting to load Nvidia drivers into the virtual sessions, it would crash. jsorg71@ found a way around this. In his words:

Xorg has a built-in RandR support and there is something called an 'output'
In the output, the driver can store its own private pointer for its
own specific implementation.
Normally when NVidia driver defects a monitor, they create the randr
output and set up their private pointers.  All cool, when something
gets the private pointer, it's valid and they dereference it and do
whatever.
When xorgxrdp adds an output, I don't create this private.  I don't
even know what it is.  It's some internal structure they use.  Anyway
then later something dereferences it and they don't check for nil
because it should never be nil normally.  I can not fix this behavior,
NVidia would have to.

My work around.
Start up Xorg with the RandR extension disabled.  In fact, never use
any of the built- in RandR functions in Xorg.  Then after screen init
and the X session is up, add an extension called RANDR.  There is a
function for that.  It's the exact same name.  In this state, the
NVidia driver thinks RandR is disable but X client apps see the
extension and work as normal.  I just have to complete this
alternative RANDR extension.  We don't need most of the randr
functionality, just output list and those notifications.

That alternative extension is called "lrandr" (Local RandR) and is present in Nexarian's mainline merge branch. RandR is short for "Resize and Rotate," and is documented better elsewhere. The lrandr extension allows Gnome 3 to load. To my knowledge, this is the first time this type of integration has ever been possible and is a major win for the XRDP ecosystem. Before this, Nvidia only worked with window managers such as Mate and Cinnamon, which were not nearly as optimized for 3D acceleration. Anecdotally, Gnome 3 is 2x as fast as MATE in SxS 3D rendering.

The other way around this is a system known as VirtualGL, which is what most remote desktop systems currently use, from NoMachine to Chrome Remote Desktop to X2Go. This version is superior: No emulation, the driver is DIRECTLY loaded into the Xsession.

Special Topic: Intel libva Integration as of 6-25-2022

Intel wrote libva for people to use QuickSync, and then didn't document it well. However, there is another library out there that does use it called libyami. To use, here are instructions from jsorg71:

yami_inf is a wrapper around libyami which is a wrapper around libva. Not saying it's going to stay this way, maybe we can later use libyami or libva directly. I was trying to follow the LoadLibrary / GetProcAddress thing that the NVidia encoder does.
Easiest way to get yami_inf is to clone
git clone https://github.com/jsorg71/builders.git
cd into builders/yami/omatic/ and run
./buildyami.sh --prefix=/opt/yami --disable-x11
It will put everything in /opt/yami
If you run buildyami.sh as regular user you might have to create /opt/yami as root and give yourself rights.
It will build and patch libva, libyami, etc
It would be great if you can get it working so far it's just me.

Note, this Intel xorgxrdp_helper is using EGL and dmabufs to pass video to the encoder where as the NVidia xorgxrdp_helper is using GLX and NVEnc accepts an OpenGL texture. In both cases, there is zero copy to system memory.

Future work to merge these technologies into devel

Logging needs to be updated to the latest paradigm used by XRDP.
Intel's libva integration needs to be streamlined -- Right now this is very complex to install and set up.
Need to select better names for xorgxrdp_helper and lrandr
Dynamic resizing needs to be tested with the xorgxrdp_helper and Intel integration.
YUV444 streaming needs to be enabled for all encoders as an option for higher-bandwidth, lower-latency connections.

Appendices

RDP client testing notes

Microsoft Clients

MSTSC: This was the gold-standard of RDP for many years, but Microsoft is shifting this. MSTSC supports nearly everything, except resize-on-the-fly, with one exception. If you have MSTSC maximized on a multi-monitor system, and then restore it, and then re-maximize it on a different monitor that has a different resolution, the DISPLAYCONTROL_MONITOR_LAYOUT_PDU is sent. Finding application logs for this is easy if you know where to look.
Microsoft Remote Desktop Windows App Store Client: This is one one of the worst Microsoft RDP clients. It's notoriously unstable, doesn't always create connections correctly even to Windows servers, leaks memory like a sieve, and has to be restarted a lot. It does, however, support sending the DISPLAYCONTROL_MONITOR_LAYOUT_PDU whenever a session is resized, but it doesn't do it with H264, which it doesn't support. We can still test EGFX resizing compatibility with it; however, by using Progressive RemoteFX. Finding application logs for it is also a huge pain.
Microsoft Remote Desktop Mac OS App Store Client: This is the best Microsoft RDP experience I've seen. It supports everything, but it does have some quirks too. If the EGFX connection is not re-established correctly, you'll get a black screen and have to reconnect. If the keyframe is not sent properly, you'll also need to reconnect as it'll be stuck on a black screen. Finding application logs for it is easy. Just use the Console app in Mac OS!

Non-Microsoft Clients

FreeRDP generally handles everything written above with ease (EGFX + Resizing). While it is the most flexible client, this means it generally can't be relied upon to be a reference implementation as it's so tolerant of what might otherwise be a protocol error that would be enforced by other Microsoft clients. It works well on Linux with ffmpeg (OpenH264 only works well some of the time). ** However, FreeRDP on Mac OS is poorly supported. It relies on XQuartz, which runs a small Xserver inside of MacOS, which does not integrate with Mac OS spaces. The maintainers of FreeRDP don't maintain the Mac OS variation with high priority. This is OK; however, as the Microsoft version is very good. ** FreeRDP also doesn't have good builds on Windows. There are ways to get a Windows build, but they are slapdash. YMMV

References

Neutrinolabs (XRDP) references:

Microsoft references:

[MS-RDPEGFX]: Remote Desktop Protocol: Graphics Pipeline Extension
3.3.5.14 Processing an RDPGFX_RESET_GRAPHICS_PDU message
Protocol Overview for MS-RDEDISP
- This provides a brief description of how the EGFX pipeline and resizing intersect
3.3.8.3.2 YUV420p Stream Combination for YUV444v2 mode
3.3.8.2 RemoteFX Progressive Codec Compression
3.3.8.3.1 Color Conversion
Tunneling High-Resolution Color Content through 4:2:0 HEVC and AVC Video Coding Systems

FreeRDP references:

Misc

https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/solutions/resources/documents1/NVIDIA-vgpu-choosing-the-right-remoting-protocol-whitepaper.pdf