HTML Sanitizing - adobe/aem-core-email-components GitHub Wiki

HTML Sanitizing

Overview

This document provides information about how HTML sanitizer utility works and what its expected behavior is.

Technical details

The HTML sanitizer class is HtmlSanitizer. It is a simple utility class that can sanitize an HTML page by removing script tags and some HTML attributes (see the paragraph "XSS prevention").

It exposes 2 different public methods:

  1. sanitizeHtml: takes raw HTML as input and returns sanitized HTML
  2. sanitizeDocument: takes a Document (previously parsed with Jsoup) as input, and returns a sanitized Document

Both methods perform the "full" HTML sanitization, which means that all HTML "script" tags and particular HTML attributes (see the paragraph "XSS prevention") will be removed from resulting page.

Its execution is embedded in StylesInlinerServiceImpl.

XSS prevention

To make HTML pages more secure, the utility removes all HTML attributes that can contain malicious JavaScript code. According with Cross-site scripting (XSS) cheat sheet, the project contains 2 resource files:

  1. xss_tags.txt: list of HTML tags to be checked
  2. xss_events.txt: list of attributes that can actually reference JavaScript code (that can be used to perform XSS attacks)

HtmlSanitizer cycles through all XSS tags, and looks if relative existing HTML elements contains any of the XSS attributes: in that case, it simply removes them.

At the time of creating this documentation, the update date of xss_tags.txt and xss_events.txt is Wed, 02 Feb 2022 12:38:10.

Tests and coverage

The class HtmlSanitizerTest is the JUnit test class. Tests that cover the features required for this utility are defined here. The code coverage (retrieved from Intellij Idea Coverage window) is:

  1. Class: 100%
  2. Method: 100%
  3. Lines: 84%