en Tutorial Safe UGC - chiba233/yumeDSL GitHub Wiki

Tutorial: Safe UGC Chat

← Home


The Problem

You have a chat or comment system. Users type messages. You want to allow simple formatting -- bold, italic, links -- but prevent abuse:

  • No block or raw tags (could break layout or inject raw content)
  • Unknown tags should show as plain text, not throw errors
  • Malformed markup should never crash the parser
  • URLs must be sanitized (no javascript: or data: schemes)
  • Error reporting for moderation tools

This tutorial walks through building a complete, production-ready UGC chat message pipeline with yume-dsl-rich-text. Every step includes working code and attack-scenario tests so you can verify the safety properties yourself.


Step 1: Define the Whitelist

The first line of defense is restricting which tags and which tag forms the parser accepts. A chat system typically needs only inline formatting -- no multi-line code blocks, no raw content injection, no block-level layout containers.

Create the parser

import {
    createParser,
    createSimpleInlineHandlers,
    createPipeHandlers,
    type PipeArgs,
    type TokenDraft,
    type TextToken,
    type ParseError,
} from "yume-dsl-rich-text";

const chatParser = createParser({
    handlers: {
        // Simple inline formatting -- bold, italic, underline, strike, code
        ...createSimpleInlineHandlers(["bold", "italic", "underline", "strike", "code"]),

        // Link handler with URL sanitization (see Step 4)
        ...createPipeHandlers({
            link: {
                inline(args, ctx) {
                    const rawUrl = args.text(0);
                    const url = sanitizeUrl(rawUrl);
                    return {
                        type: "link",
                        url,
                        value: args.materializedTailTokens(1),
                    };
                },
            },
        }),
    },

    // KEY SECURITY MEASURE: only allow inline forms
    allowForms: ["inline"],
});

Why allowForms: ["inline"] matters

The allowForms option is the single most important security setting for UGC. It restricts the parser globally -- not per tag, but across the entire parse operation. Here is what it does and does not allow:

Allowed (inline form):

Input Result
$$bold(hello)$$ Parsed as bold token
$$italic(world)$$ Parsed as italic token
$$link(https://example.com | click)$$ Parsed as link token
$$bold($$italic(nested)$$)$$ Parsed as bold containing italic

Blocked (raw and block forms):

Input Result
$$code(js)%\nalert('xss')\n%end$$ Entire markup becomes literal text
$$info(title)*\n<script>...</script>\n*end$$ Entire markup becomes literal text
$$unknown()*\nmalicious content\n*end$$ Entire markup becomes literal text

When allowForms does not include "raw" or "block", the parser treats those forms as if the handler does not support them. The raw $$code(js)%\nalert('xss')\n%end$$ syntax is not parsed as a tag at all -- it flows through as literal text characters in the output. No error is thrown, no special handling is needed. The user just sees the raw markup as text.

This applies to all tags, including unregistered ones. Even if someone invents $$exploit()*\n...\n*end$$, the block form is disabled globally, so the parser never enters block-parsing mode.

Comparison: with vs without allowForms

Without allowForms (UNSAFE for UGC):

const unsafeParser = createParser({
    handlers: {
        ...createSimpleInlineHandlers(["bold"]),
        ...createSimpleRawHandlers(["code"]),
    },
    // No allowForms -- all forms enabled by default
});

unsafeParser.parse("$$code(js)%\nalert(document.cookie)\n%end$$");
// Result: [{ type: "code", arg: "js", value: "alert(document.cookie)", id: "rt-0" }]
// The raw content is captured verbatim -- if your renderer does not escape it,
// this is an XSS vector.

With allowForms: ["inline"] (SAFE):

chatParser.parse("$$code(js)%\nalert(document.cookie)\n%end$$");
// Result: [{ type: "text", value: "$$code(js)%\nalert(document.cookie)\n%end$$", id: "rt-0" }]
// The entire input is plain text. No tag was recognized.

Step 2: Test Graceful Degradation

The parser is designed to never throw on malformed input. Every syntax error, every unknown tag, every nesting abuse degrades to plain text. This is critical for UGC: you cannot predict what users will type, and a crash means denial of service.

Here are the key degradation scenarios, with exact output:

2.1 Unregistered tag -- content becomes plain text

chatParser.parse("$$unknown(hello)$$");
// Result:
// [{ type: "text", value: "hello", id: "rt-0" }]

The tag unknown is not in the handlers map. The parser recognizes the syntax but has no handler for it, so the content "hello" is unwrapped as plain text. The $$unknown( and )$$ delimiters are stripped, and only the inner content survives.

2.2 Unclosed tag -- entire string becomes plain text

const errors: ParseError[] = [];
chatParser.parse("$$bold(unclosed", {
    onError: (e) => errors.push(e),
});
// Result:
// [{ type: "text", value: "$$bold(unclosed", id: "rt-0" }]
//
// errors[0]:
// {
//   code: "INLINE_NOT_CLOSED",
//   message: "(L1:C1) Inline tag not closed:  >>>$$bold(<<< unclosed",
//   line: 1,
//   column: 1,
//   snippet: " >>>$$bold(<<< unclosed"
// }

The opening $$bold( is never closed with )$$. The parser reports INLINE_NOT_CLOSED and recovers by treating the entire string as literal text. No crash, no partial token.

2.3 Raw form blocked by allowForms -- literal text

chatParser.parse("$$code(js)%\nalert(1)\n%end$$");
// Result:
// [{ type: "text", value: "$$code(js)%\nalert(1)\n%end$$", id: "rt-0" }]

Even though the syntax is perfectly valid raw-form DSL, the allowForms: ["inline"] setting means the raw form is globally disabled. The parser does not even attempt to parse it as a raw tag.

2.4 Deep nesting -- DEPTH_LIMIT error

const deepInput = "$$bold(".repeat(100) + "hello" + ")$$".repeat(100);
const errors: ParseError[] = [];
chatParser.parse(deepInput, {
    onError: (e) => errors.push(e),
});
// At depth 50 (default depthLimit), the parser stops recursing.
// The offending tag degrades to literal text.
// errors will contain at least one entry with code: "DEPTH_LIMIT"

The default depthLimit is 50. For a chat system, you might want to lower it:

const chatParser = createParser({
    handlers: { /* ... */ },
    allowForms: ["inline"],
    depthLimit: 10,  // Chat messages rarely need more than a few levels
});

2.5 Normal usage -- works as expected

chatParser.parse("$$bold(hello $$italic(world)$$)$$");
// Result:
// [
//   {
//     type: "bold",
//     value: [
//       { type: "text", value: "hello ", id: "rt-0" },
//       {
//         type: "italic",
//         value: [{ type: "text", value: "world", id: "rt-1" }],
//         id: "rt-2",
//       },
//     ],
//     id: "rt-3",
//   },
// ]

Nested inline tags parse correctly. The bold token contains both a text node and an italic child token.

2.6 Mixed valid and invalid -- partial recovery

chatParser.parse("$$bold(hello)$$ $$unknown(oops)$$ $$italic(world)$$");
// Result:
// [
//   { type: "bold", value: [{ type: "text", value: "hello", ... }], ... },
//   { type: "text", value: " ", ... },
//   { type: "text", value: "oops", ... },
//   { type: "text", value: " ", ... },
//   { type: "italic", value: [{ type: "text", value: "world", ... }], ... },
// ]

The valid bold and italic tags parse normally. The unregistered unknown tag degrades to plain text "oops". The surrounding content is unaffected.


Step 3: Add Error Reporting

The parser's onError callback is your window into malformed input. For a chat system, error data is valuable for moderation -- a message full of parse errors is likely spam or an exploit attempt.

Collecting errors

function parseMessage(input: string) {
    const errors: ParseError[] = [];
    const tokens = chatParser.parse(input, {
        onError: (e) => errors.push(e),
    });
    return { tokens, errors };
}

The ParseError interface

Every error contains:

interface ParseError {
    code: ErrorCode;
    message: string;
    line: number;
    column: number;
    snippet: string;
}
Field Description
code Machine-readable error type (ErrorCode union)
message Human-readable with (L{line}:C{column}) prefix and >>>...<<< snippet markers
line 1-indexed line number where the error starts
column 1-indexed column number where the error starts
snippet Context around the error with >>> <<< markers showing the problematic span

Error codes relevant to UGC chat

With allowForms: ["inline"], you will primarily encounter these codes:

Code Meaning Example trigger
INLINE_NOT_CLOSED Inline tag opened but never closed $$bold(unclosed
SHORTHAND_NOT_CLOSED Implicit inline shorthand opened but never closed (since 1.3) bold(unclosed with implicitInlineShorthand enabled
UNEXPECTED_CLOSE Stray close marker with no matching open lone )$$ in the middle of text
DEPTH_LIMIT Nesting exceeded depthLimit $$a($$b($$c($$d(... beyond the limit

You will not see BLOCK_NOT_CLOSED, RAW_NOT_CLOSED, or their malformed variants, because block and raw forms are globally disabled by allowForms: ["inline"].

Using errors for moderation

interface ModerationResult {
    tokens: TextToken[];
    flagged: boolean;
    reason?: string;
}

function moderateMessage(input: string): ModerationResult {
    const { tokens, errors } = parseMessage(input);

    // Flag messages with excessive parse errors
    if (errors.length > 5) {
        return {
            tokens,
            flagged: true,
            reason: `Excessive parse errors (${errors.length}): possible markup abuse`,
        };
    }

    // Flag messages with depth limit hits -- likely nesting attack
    const depthErrors = errors.filter((e) => e.code === "DEPTH_LIMIT");
    if (depthErrors.length > 0) {
        return {
            tokens,
            flagged: true,
            reason: `Depth limit exceeded ${depthErrors.length} time(s): possible nesting attack`,
        };
    }

    return { tokens, flagged: false };
}

Logging errors for monitoring

function logParseErrors(userId: string, messageId: string, errors: ParseError[]) {
    for (const err of errors) {
        console.warn(
            `[UGC Parse Error] user=${userId} msg=${messageId} ` +
            `code=${err.code} L${err.line}:C${err.column} ${err.snippet}`
        );
    }
}

This gives you a stream of structured data you can feed into your monitoring system. A sudden spike in DEPTH_LIMIT errors from a single user is a strong signal of abuse.


Step 4: URL Sanitization in the Link Handler

The parser does not validate URLs -- that is the handler's responsibility. For UGC, you must sanitize URLs to prevent javascript:, data:, and other dangerous schemes.

The sanitizeUrl function

function sanitizeUrl(raw: string): string | undefined {
    if (!raw) return undefined;

    const trimmed = raw.trim();
    if (!trimmed) return undefined;

    // Decode and normalize to catch obfuscation attempts
    let decoded: string;
    try {
        decoded = decodeURIComponent(trimmed);
    } catch {
        // Malformed percent encoding -- reject
        return undefined;
    }

    // Strip whitespace and control characters that browsers might ignore
    const normalized = decoded.replace(/[\s\x00-\x1f]/g, "").toLowerCase();

    // Only allow http and https schemes
    if (normalized.startsWith("http://") || normalized.startsWith("https://")) {
        return trimmed;  // Return the original (not decoded) URL
    }

    // Allow protocol-relative URLs (resolve to page's protocol)
    if (normalized.startsWith("//")) {
        return trimmed;
    }

    // Allow relative paths (no scheme)
    if (!normalized.includes(":")) {
        return trimmed;
    }

    // Reject everything else (javascript:, data:, vbscript:, etc.)
    return undefined;
}

The full link handler

const handlers = createPipeHandlers({
    link: {
        inline(args, ctx) {
            const rawUrl = args.text(0);
            const url = sanitizeUrl(rawUrl);
            const displayTokens = args.materializedTailTokens(1);

            // If URL is rejected, output the display text as plain content
            // (or fall back to the raw URL text if no display text was provided)
            if (url === undefined) {
                return {
                    type: "text",
                    value: displayTokens.length > 0 ? displayTokens : rawUrl,
                };
            }

            return {
                type: "link",
                url,
                value: displayTokens.length > 0 ? displayTokens : [{ type: "text", value: url, id: "" }],
            };
        },
    },
});

Attack examples

Input rawUrl sanitizeUrl result Output
$$link(https://safe.com | click)$$ "https://safe.com" "https://safe.com" Link token with url: "https://safe.com"
$$link(javascript:alert(1) | click)$$ "javascript:alert(1)" undefined Plain text "click" (link dropped)
$$link(data:text/html,<script>... | click)$$ "data:text/html,..." undefined Plain text "click" (link dropped)
$$link(JAVASCRIPT:alert(1) | click)$$ "JAVASCRIPT:alert(1)" undefined Plain text "click" (case-insensitive check)
$$link(java\x00script:alert(1) | click)$$ "java\x00script:..." undefined Plain text "click" (control char stripped during normalization)
$$link(| no url)$$ "" undefined Plain text "no url" (empty URL rejected)

Why sanitize in the handler, not the renderer?

You might think: "I will just sanitize when I render the HTML." That works, but sanitizing at parse time has advantages:

  1. Defense in depth -- even if a renderer has a bug, the dangerous URL never makes it into the token tree.
  2. Consistent behavior -- every consumer of the token tree (web renderer, mobile renderer, notification text, API response) gets safe URLs without each implementing sanitization.
  3. Moderation visibility -- you can log when URLs are rejected, giving your moderation tools more signal.

Step 5: Content Length and Rate Limiting (Outside the Parser)

The parser handles syntax safety. Content policy -- length limits, spam detection, rate limiting, profanity filtering -- is your application's responsibility. The parser provides one useful tool for content policy: stripRichText (or dsl.strip()).

Why you need strip for length checks

If a user sends:

$$bold($$italic($$underline(hello)$$)$$)$$

The raw input is 47 characters. The actual visible text is 5 characters: hello. If you enforce a 500-character limit on the raw input, a user could send a message with 500 characters of markup and only 10 characters of content. Conversely, if you only check the raw string, users with heavy formatting might hit the limit prematurely.

Use dsl.strip() to get the plain-text length for policy checks:

function checkContentLength(input: string, maxLength: number): { ok: boolean; plainLength: number } {
    const plainText = chatParser.strip(input);
    return {
        ok: plainText.length <= maxLength,
        plainLength: plainText.length,
    };
}

Combining strip and parse efficiently

If you need both tokens (for rendering) and plain text (for length checks), avoid parsing twice. Call parse once, then use extractText on the result:

import { extractText } from "yume-dsl-rich-text";

function processMessage(input: string) {
    const errors: ParseError[] = [];
    const tokens = chatParser.parse(input, {
        onError: (e) => errors.push(e),
    });
    const plainText = extractText(tokens);
    const plainLength = plainText.length;

    return { tokens, plainText, plainLength, errors };
}

Content policy pipeline

interface PolicyResult {
    allowed: boolean;
    reason?: string;
}

function checkContentPolicy(input: string): PolicyResult {
    // 1. Raw input length (prevent extremely large payloads)
    if (input.length > 10_000) {
        return { allowed: false, reason: "Message too long (raw)" };
    }

    // 2. Plain text length (actual content)
    const plainText = chatParser.strip(input);
    if (plainText.length > 2_000) {
        return { allowed: false, reason: "Message too long (content)" };
    }
    if (plainText.length === 0) {
        return { allowed: false, reason: "Message is empty" };
    }

    // 3. Markup-to-content ratio (detect markup spam)
    const ratio = input.length / plainText.length;
    if (ratio > 10) {
        return { allowed: false, reason: "Excessive markup" };
    }

    return { allowed: true };
}

Step 6: Put It All Together

Here is the complete chat message processing pipeline, combining every concept from the previous steps:

import {
    createParser,
    createSimpleInlineHandlers,
    createPipeHandlers,
    extractText,
    type PipeArgs,
    type TokenDraft,
    type TextToken,
    type ParseError,
} from "yume-dsl-rich-text";

// --- URL sanitization ---

function sanitizeUrl(raw: string): string | undefined {
    if (!raw) return undefined;
    const trimmed = raw.trim();
    if (!trimmed) return undefined;

    let decoded: string;
    try {
        decoded = decodeURIComponent(trimmed);
    } catch {
        return undefined;
    }

    const normalized = decoded.replace(/[\s\x00-\x1f]/g, "").toLowerCase();

    if (normalized.startsWith("http://") || normalized.startsWith("https://")) {
        return trimmed;
    }
    if (normalized.startsWith("//")) {
        return trimmed;
    }
    if (!normalized.includes(":")) {
        return trimmed;
    }
    return undefined;
}

// --- Parser setup ---

const chatParser = createParser({
    handlers: {
        ...createSimpleInlineHandlers(["bold", "italic", "underline", "strike", "code"]),
        ...createPipeHandlers({
            link: {
                inline(args, ctx) {
                    const rawUrl = args.text(0);
                    const url = sanitizeUrl(rawUrl);
                    const display = args.materializedTailTokens(1);

                    if (url === undefined) {
                        return {
                            type: "text",
                            value: display.length > 0 ? display : rawUrl,
                        };
                    }
                    return {
                        type: "link",
                        url,
                        value: display.length > 0 ? display : [{ type: "text", value: url, id: "" }],
                    };
                },
            },
        }),
    },
    allowForms: ["inline"],
    depthLimit: 10,
});

// --- Message processing pipeline ---

interface ProcessedMessage {
    tokens: TextToken[];
    plainText: string;
    errors: ParseError[];
    flagged: boolean;
    flagReason?: string;
}

function processMessage(input: string): ProcessedMessage {
    // Step 1: Raw input length guard
    if (input.length > 10_000) {
        return {
            tokens: [{ type: "text", value: "[Message too long]", id: "err-0" }],
            plainText: "",
            errors: [],
            flagged: true,
            flagReason: "Raw input exceeds 10,000 characters",
        };
    }

    // Step 2: Parse with error collection
    const errors: ParseError[] = [];
    const tokens = chatParser.parse(input, {
        onError: (e) => errors.push(e),
    });

    // Step 3: Extract plain text (single pass -- no re-parse)
    const plainText = extractText(tokens);

    // Step 4: Content policy checks
    if (plainText.length === 0) {
        return {
            tokens,
            plainText,
            errors,
            flagged: true,
            flagReason: "Empty message after parsing",
        };
    }

    if (plainText.length > 2_000) {
        return {
            tokens,
            plainText,
            errors,
            flagged: true,
            flagReason: "Content exceeds 2,000 characters",
        };
    }

    // Step 5: Moderation signals from parse errors
    const depthErrors = errors.filter((e) => e.code === "DEPTH_LIMIT");
    if (depthErrors.length > 0) {
        return {
            tokens,
            plainText,
            errors,
            flagged: true,
            flagReason: `Depth limit hit ${depthErrors.length} time(s)`,
        };
    }

    if (errors.length > 5) {
        return {
            tokens,
            plainText,
            errors,
            flagged: true,
            flagReason: `Excessive parse errors: ${errors.length}`,
        };
    }

    // Step 6: Markup-to-content ratio
    const ratio = input.length / Math.max(plainText.length, 1);
    if (ratio > 10) {
        return {
            tokens,
            plainText,
            errors,
            flagged: true,
            flagReason: `Markup ratio ${ratio.toFixed(1)}:1 exceeds threshold`,
        };
    }

    return { tokens, plainText, errors, flagged: false };
}

Usage

// Normal message
const result1 = processMessage("$$bold(Hello)$$ $$italic(world)$$!");
// result1.flagged === false
// result1.tokens contains bold + italic tokens

// Attack: javascript URL
const result2 = processMessage("$$link(javascript:alert(1) | click me)$$");
// result2.flagged === false (not an error -- URL was sanitized)
// The link token has url: undefined, renders as plain text "click me"

// Attack: raw form injection
const result3 = processMessage("$$code(js)%\nalert(1)\n%end$$");
// result3.flagged === false
// result3.tokens is plain text (raw form blocked)

// Attack: nesting bomb
const result4 = processMessage("$$bold(".repeat(100) + "x" + ")$$".repeat(100));
// result4.flagged === true
// result4.flagReason contains "Depth limit hit"

Security Checklist

A summary of every safety layer covered in this tutorial:

Status Measure Responsible layer What it prevents
Done allowForms: ["inline"] Parser config Block/raw form injection
Done URL sanitization in link handler Tag handler javascript:, data:, and other dangerous URL schemes
Done Parser never throws Parser core Denial of service via malformed input
Done onError callback Parser config Monitoring and moderation signal
Done depthLimit (lowered to 10) Parser config Nesting bomb attacks
Done stripRichText / extractText Application code Accurate content-length checks ignoring markup
Done Markup ratio check Application code Markup spam / padding attacks
Done Raw input length cap Application code Memory exhaustion from oversized payloads

What is NOT the parser's job

These concerns must be handled by your application or rendering layer:

Concern Responsible layer Why
HTML escaping Your renderer (e.g., Vue, React) The parser produces tokens, not HTML. XSS via HTML injection is a rendering-layer concern.
Rate limiting Your API layer The parser is stateless -- it does not know about request frequency.
Spam detection Your moderation system Content-level policy (profanity, links to malicious domains, etc.) requires domain knowledge the parser does not have.
Image/media validation Your media pipeline If you add an img tag, URL validation is necessary but not sufficient -- you need to verify the resource is safe.
Session/auth checks Your API layer The parser does not know who is sending messages.
⚠️ **GitHub.com Fallback** ⚠️