XMPP over WebSockets - pvgupta24/Jitsi-Meet-Concepts GitHub Wiki

Introduction

Till date, applications using the Extensible Messaging and Presence Protocol (XMPP) on the Web have made use of Bidirectional-streams Over Synchronous HTTP (BOSH), an XMPP binding to HTTP. BOSH is based on the HTTP "long polling" technique, and it suffers from high transport overhead compared to XMPP's native binding to TCP. In addition, there are a number of other known issues with long polling that have an impact on BOSH-based systems. The WebSocket protocol [RFC6455] exists to solve these kinds of problems and is a bidirectional protocol that provides a simple message-based framing layer, allowing for more robust and efficient communication in web applications.
The WebSocket protocol enables two-way communication between a client and a server, effectively emulating TCP at the application layer and, therefore, overcoming many of the problems with existing long-polling techniques for bidirectional HTTP.

Handshake

The XMPP subprotocol is used to transport XMPP over a WebSocket connection. If a client receives a handshake response that does not include 'xmpp' in the 'Sec-WebSocket-Protocol' header, then an XMPP subprotocol WebSocket connection was not established and the client MUST close the WebSocket connection.

The following is an example of a WebSocket handshake, followed by opening an XMPP stream:

   C:  GET /xmpp-websocket HTTP/1.1
       Host: example.com
       Upgrade: websocket
       Connection: Upgrade
       Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
       Origin: http://example.com
       ...
       Sec-WebSocket-Protocol: xmpp
       Sec-WebSocket-Version: 13

   S:  HTTP/1.1 101 Switching Protocols
       Upgrade: websocket
       Connection: Upgrade
       ...
       Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
       Sec-WebSocket-Protocol: xmpp

WebSocket connection established

   C:  <open xmlns="urn:ietf:params:xml:ns:xmpp-framing"
             to="example.com"
             version="1.0" />

   S:  <open xmlns="urn:ietf:params:xml:ns:xmpp-framing"
             from="example.com"
             id="++TR84Sm6A3hnt3Q065SnAbbk3Y="
             xml:lang="en"
             version="1.0" />

XMPP Framing

The framing method for the binding of XMPP to WebSocket differs from the framing method for the TCP binding, in particular, the WebSocket binding adopts the message framing provided by WebSocket to delineate the stream open and close headers, stanzas, and other top-level stream elements.

Framed XML Stream

The start of a framed XML stream is marked by the use of an opening "stream header", which is an element with the appropriate attributes and namespace declarations. The attributes of the element are the same as those of the element defined for the 'http://etherx.jabber.org/streams' namespace [RFC6120] and with the same semantics and restrictions. The end of a framed XML stream is denoted by the closing "stream header", which is a element with its associated attributes and namespace declarations.

Stream Frames

The individual frames of a framed XML stream have a one-to-one correspondence with WebSocket messages and MUST be parsable as standalone XML documents, complete with all relevant namespace and language declarations.

Example of a WebSocket message that contains an independently parsable XML document:

   <message xmlns="jabber:client" xml:lang="en">
     <body>Every WebSocket message is parsable by itself.</body>
   </message>

Stream Initiation

The first message sent after the WebSocket opening handshake MUST be from the initiating entity and MUST be an element qualified by the 'urn:ietf:params:xml:ns:xmpp-framing' namespace. An example of a successful stream initiation exchange:

   C:  <open xmlns="urn:ietf:params:xml:ns:xmpp-framing"
             to="example.com"
             version="1.0" />

   S:  <open xmlns="urn:ietf:params:xml:ns:xmpp-framing"
             from="example.com"
             id="++TR84Sm6A3hnt3Q065SnAbbk3Y="
             xml:lang="en"
             version="1.0" />

Clients MUST NOT multiplex XMPP streams over the same WebSocket.

Stream Errors

Stream-level errors in XMPP are fatal. Should such an error occur, the server MUST send the stream error as a complete element in a message to the client. If the error occurs during the opening of a stream, the server MUST send the initial open element response, followed by the stream-level error in a second WebSocket message frame. The server MUST then close the connection.

Closing the Connection

The closing process for the XMPP subprotocol mirrors that of the XMPP TCP binding, except that a <close/> element is used instead of the ending </stream:stream> tag. An example of ending an XMPP-over-WebSocket session by first closing the XMPP stream layer and then the WebSocket connection layer:

   Client                         (XMPP WSS)                      Server
   |  |                                                             |  |
   |  | <close xmlns="urn:ietf:params:xml:ns:xmpp-framing" />       |  |
   |  |------------------------------------------------------------>|  |
   |  |       <close xmlns="urn:ietf:params:xml:ns:xmpp-framing" /> |  |
   |  |<------------------------------------------------------------|  |
   |  |                                                             |  |
   |  |                      (XMPP Stream Closed)                   |  |
   |  +-------------------------------------------------------------+  |
   |                                                                   |
   | WS CLOSE FRAME                                                    |
   |------------------------------------------------------------------>|
   |                                                    WS CLOSE FRAME |
   |<------------------------------------------------------------------|
   |                                                                   |
   |                         (Connection Closed)                       |
   +-------------------------------------------------------------------+

If the WebSocket connection is closed or broken without the XMPP stream having been closed first, then the XMPP stream is considered implicitly closed and the XMPP session ended, however, if the use of stream management resumption was negotiated, the server SHOULD consider the XMPP session still alive for a period of time based on server policy.

Discovering the WebSocket Connection Method

The procedure used for connecting to an XMPP server in TCP binding is to discover the TCP/IP address and port of the server using Domain Name System service (DNS SRV) records. However, web browsers and other WebSocket-capable software applications typically cannot obtain such information from the DNS. An alternative way is needed for the client to discover information about the server's connection methods. The alternative lookup process uses Web-host Metadata [RFC6415] and Web Linking [RFC5988]. Conceptually, the host-meta lookup process used for the WebSocket binding is analogous to the DNS SRV lookup process used for the TCP binding. The process is as follows.

Send a request over secure HTTP to the path "/.well-known/host-meta" at an HTTP origin [RFC6454] that matches the XMPP service domain (e.g., a URL of "https://im.example.org/.well-known/host-meta" if the XMPP service domain is "im.example.org").
Retrieve a host-meta document specifying a link relation type of "urn:xmpp:alt-connections:websocket", such as:

<XRD xmlns='http://docs.oasis-open.org/ns/xri/xrd-1.0'>
 <Link rel="urn:xmpp:alt-connections:websocket"
 href="wss://im.example.org:443/ws" />
</XRD>

Servers may expose discovery information using host-meta documents,and clients may use such information to determine the WebSocket endpoint for a server.

Security Considerations

The WebSocket binding for XMPP differs in several respects from the TCP binding.

The method for discovering a connection endpoint uses Web-host Metadata files retrieved via HTTPS from a URL at the XMPP service domain. From a security standpoint, this is functionally equivalent to resolution via DNS SRV records (and still relies on the DNS for resolution of the XMPP source domain).
The method for authenticating a connection endpoint uses TLS as in the TCP binding. The delegation from the XMPP service domain to the connection endpoint address (if any) is accomplished via the discovery method described previously. Thus, the connection endpoint is still authenticated, and the delegation is secure as long as the Web-host Metadata file is retrieved via HTTPS.
The framing method sends one top-level XML element per WebSocket message, instead of using streaming XML as in the TCP binding. However, the framing method has no impact on the security properties of an XMPP session (e.g., end-to-end encryption of XML stanzas can be accomplished just as easily with WebSocket framing as with streaming XML).
In all other respects, the WebSocket binding does not differ from the TCP binding and thus, does not modify the security properties of the protocol.

References

https://tools.ietf.org/html/rfc7395#section-3.9