Tutorial: Safe UGC Chat

The Problem

You have a chat or comment system. Users type messages. You want to allow simple formatting -- bold, italic, links -- but prevent abuse:

No block or raw tags (could break layout or inject raw content)
Unknown tags should show as plain text, not throw errors
Malformed markup should never crash the parser
URLs must be sanitized (no javascript: or data: schemes)
Error reporting for moderation tools

This tutorial walks through building a complete, production-ready UGC chat message pipeline with yume-dsl-rich-text. Every step includes working code and attack-scenario tests so you can verify the safety properties yourself.

Step 1: Define the Whitelist

The first line of defense is restricting which tags and which tag forms the parser accepts. A chat system typically needs only inline formatting -- no multi-line code blocks, no raw content injection, no block-level layout containers.

Create the parser

import {
    createParser,
    createSimpleInlineHandlers,
    createPipeHandlers,
    type PipeArgs,
    type TokenDraft,
    type TextToken,
    type ParseError,
} from "yume-dsl-rich-text";

const chatParser = createParser({
    handlers: {
        // Simple inline formatting -- bold, italic, underline, strike, code
        ...createSimpleInlineHandlers(["bold", "italic", "underline", "strike", "code"]),

        // Link handler with URL sanitization (see Step 4)
        ...createPipeHandlers({
            link: {
                inline(args, ctx) {
                    const rawUrl = args.text(0);
                    const url = sanitizeUrl(rawUrl);
                    return {
                        type: "link",
                        url,
                        value: args.materializedTailTokens(1),
                    };
                },
            },
        }),
    },

    // KEY SECURITY MEASURE: only allow inline forms
    allowForms: ["inline"],
});

Why `allowForms: ["inline"]` matters

The allowForms option is the single most important security setting for UGC. It restricts the parser globally -- not per tag, but across the entire parse operation. Here is what it does and does not allow:

Allowed (inline form):

Input	Result
`$$bold(hello)$$`	Parsed as bold token
`$$italic(world)$$`	Parsed as italic token
`$$link(https://example.com \| click)$$`	Parsed as link token
`$$bold($$italic(nested)$$)$$`	Parsed as bold containing italic

Blocked (raw and block forms):

Input	Result
`$$code(js)%\nalert('xss')\n%end$$`	Entire markup becomes literal text
`$$info(title)\n<script>...</script>\nend$$`	Entire markup becomes literal text
`$$unknown()\nmalicious content\nend$$`	Entire markup becomes literal text

When allowForms does not include "raw" or "block", the parser treats those forms as if the handler does not support them. The raw $$code(js)%\nalert('xss')\n%end$$ syntax is not parsed as a tag at all -- it flows through as literal text characters in the output. No error is thrown, no special handling is needed. The user just sees the raw markup as text.

This applies to all tags, including unregistered ones. Even if someone invents $$exploit()*\n...\n*end$$, the block form is disabled globally, so the parser never enters block-parsing mode.

Comparison: with vs without `allowForms`

Without allowForms (UNSAFE for UGC):

const unsafeParser = createParser({
    handlers: {
        ...createSimpleInlineHandlers(["bold"]),
        ...createSimpleRawHandlers(["code"]),
    },
    // No allowForms -- all forms enabled by default
});

unsafeParser.parse("$$code(js)%\nalert(document.cookie)\n%end$$");
// Result: [{ type: "code", arg: "js", value: "alert(document.cookie)", id: "rt-0" }]
// The raw content is captured verbatim -- if your renderer does not escape it,
// this is an XSS vector.

With allowForms: ["inline"] (SAFE):

chatParser.parse("$$code(js)%\nalert(document.cookie)\n%end$$");
// Result: [{ type: "text", value: "$$code(js)%\nalert(document.cookie)\n%end$$", id: "rt-0" }]
// The entire input is plain text. No tag was recognized.

Step 2: Test Graceful Degradation

The parser is designed to never throw on malformed input. Every syntax error, every unknown tag, every nesting abuse degrades to plain text. This is critical for UGC: you cannot predict what users will type, and a crash means denial of service.

Here are the key degradation scenarios, with exact output:

2.1 Unregistered tag -- content becomes plain text

chatParser.parse("$$unknown(hello)$$");
// Result:
// [{ type: "text", value: "hello", id: "rt-0" }]

The tag unknown is not in the handlers map. The parser recognizes the syntax but has no handler for it, so the content "hello" is unwrapped as plain text. The $$unknown( and )$$ delimiters are stripped, and only the inner content survives.

2.2 Unclosed tag -- entire string becomes plain text

const errors: ParseError[] = [];
chatParser.parse("$$bold(unclosed", {
    onError: (e) => errors.push(e),
});
// Result:
// [{ type: "text", value: "$$bold(unclosed", id: "rt-0" }]
//
// errors[0]:
// {
//   code: "INLINE_NOT_CLOSED",
//   message: "(L1:C1) Inline tag not closed:  >>>$$bold(<<< unclosed",
//   line: 1,
//   column: 1,
//   snippet: " >>>$$bold(<<< unclosed"
// }

The opening $$bold( is never closed with )$$. The parser reports INLINE_NOT_CLOSED and recovers by treating the entire string as literal text. No crash, no partial token.

2.3 Raw form blocked by `allowForms` -- literal text

chatParser.parse("$$code(js)%\nalert(1)\n%end$$");
// Result:
// [{ type: "text", value: "$$code(js)%\nalert(1)\n%end$$", id: "rt-0" }]

Even though the syntax is perfectly valid raw-form DSL, the allowForms: ["inline"] setting means the raw form is globally disabled. The parser does not even attempt to parse it as a raw tag.

2.4 Deep nesting -- `DEPTH_LIMIT` error

const deepInput = "$$bold(".repeat(100) + "hello" + ")$$".repeat(100);
const errors: ParseError[] = [];
chatParser.parse(deepInput, {
    onError: (e) => errors.push(e),
});
// At depth 50 (default depthLimit), the parser stops recursing.
// The offending tag degrades to literal text.
// errors will contain at least one entry with code: "DEPTH_LIMIT"

The default depthLimit is 50. For a chat system, you might want to lower it:

const chatParser = createParser({
    handlers: { /* ... */ },
    allowForms: ["inline"],
    depthLimit: 10,  // Chat messages rarely need more than a few levels
});

2.5 Normal usage -- works as expected

chatParser.parse("$$bold(hello $$italic(world)$$)$$");
// Result:
// [
//   {
//     type: "bold",
//     value: [
//       { type: "text", value: "hello ", id: "rt-0" },
//       {
//         type: "italic",
//         value: [{ type: "text", value: "world", id: "rt-1" }],
//         id: "rt-2",
//       },
//     ],
//     id: "rt-3",
//   },
// ]

Nested inline tags parse correctly. The bold token contains both a text node and an italic child token.

2.6 Mixed valid and invalid -- partial recovery

chatParser.parse("$$bold(hello)$$ $$unknown(oops)$$ $$italic(world)$$");
// Result:
// [
//   { type: "bold", value: [{ type: "text", value: "hello", ... }], ... },
//   { type: "text", value: " ", ... },
//   { type: "text", value: "oops", ... },
//   { type: "text", value: " ", ... },
//   { type: "italic", value: [{ type: "text", value: "world", ... }], ... },
// ]

The valid bold and italic tags parse normally. The unregistered unknown tag degrades to plain text "oops". The surrounding content is unaffected.

Step 3: Add Error Reporting

The parser's onError callback is your window into malformed input. For a chat system, error data is valuable for moderation -- a message full of parse errors is likely spam or an exploit attempt.

Collecting errors

function parseMessage(input: string) {
    const errors: ParseError[] = [];
    const tokens = chatParser.parse(input, {
        onError: (e) => errors.push(e),
    });
    return { tokens, errors };
}

The `ParseError` interface

Every error contains:

interface ParseError {
    code: ErrorCode;
    message: string;
    line: number;
    column: number;
    snippet: string;
}

Field	Description
`code`	Machine-readable error type (`ErrorCode` union)
`message`	Human-readable with `(L{line}:C{column})` prefix and `>>>...<<<` snippet markers
`line`	1-indexed line number where the error starts
`column`	1-indexed column number where the error starts
`snippet`	Context around the error with `>>>` `<<<` markers showing the problematic span

Error codes relevant to UGC chat

With allowForms: ["inline"], you will primarily encounter these codes:

Code	Meaning	Example trigger
`INLINE_NOT_CLOSED`	Inline tag opened but never closed	`$$bold(unclosed`
`SHORTHAND_NOT_CLOSED`	Implicit inline shorthand opened but never closed (since 1.3)	`bold(unclosed` with `implicitInlineShorthand` enabled
`UNEXPECTED_CLOSE`	Stray close marker with no matching open	lone `)$$` in the middle of text
`DEPTH_LIMIT`	Nesting exceeded `depthLimit`	`$$a($$b($$c($$d(...` beyond the limit

You will not see BLOCK_NOT_CLOSED, RAW_NOT_CLOSED, or their malformed variants, because block and raw forms are globally disabled by allowForms: ["inline"].

Using errors for moderation

interface ModerationResult {
    tokens: TextToken[];
    flagged: boolean;
    reason?: string;
}

function moderateMessage(input: string): ModerationResult {
    const { tokens, errors } = parseMessage(input);

    // Flag messages with excessive parse errors
    if (errors.length > 5) {
        return {
            tokens,
            flagged: true,
            reason: `Excessive parse errors (${errors.length}): possible markup abuse`,
        };
    }

    // Flag messages with depth limit hits -- likely nesting attack
    const depthErrors = errors.filter((e) => e.code === "DEPTH_LIMIT");
    if (depthErrors.length > 0) {
        return {
            tokens,
            flagged: true,
            reason: `Depth limit exceeded ${depthErrors.length} time(s): possible nesting attack`,
        };
    }

    return { tokens, flagged: false };
}

Logging errors for monitoring

function logParseErrors(userId: string, messageId: string, errors: ParseError[]) {
    for (const err of errors) {
        console.warn(
            `[UGC Parse Error] user=${userId} msg=${messageId} ` +
            `code=${err.code} L${err.line}:C${err.column} ${err.snippet}`
        );
    }
}

This gives you a stream of structured data you can feed into your monitoring system. A sudden spike in DEPTH_LIMIT errors from a single user is a strong signal of abuse.

Step 4: URL Sanitization in the Link Handler

The parser does not validate URLs -- that is the handler's responsibility. For UGC, you must sanitize URLs to prevent javascript:, data:, and other dangerous schemes.

The `sanitizeUrl` function

function sanitizeUrl(raw: string): string | undefined {
    if (!raw) return undefined;

    const trimmed = raw.trim();
    if (!trimmed) return undefined;

    // Decode and normalize to catch obfuscation attempts
    let decoded: string;
    try {
        decoded = decodeURIComponent(trimmed);
    } catch {
        // Malformed percent encoding -- reject
        return undefined;
    }

    // Strip whitespace and control characters that browsers might ignore
    const normalized = decoded.replace(/[\s\x00-\x1f]/g, "").toLowerCase();

    // Only allow http and https schemes
    if (normalized.startsWith("http://") || normalized.startsWith("https://")) {
        return trimmed;  // Return the original (not decoded) URL
    }

    // Allow protocol-relative URLs (resolve to page's protocol)
    if (normalized.startsWith("//")) {
        return trimmed;
    }

    // Allow relative paths (no scheme)
    if (!normalized.includes(":")) {
        return trimmed;
    }

    // Reject everything else (javascript:, data:, vbscript:, etc.)
    return undefined;
}

The full link handler

const handlers = createPipeHandlers({
    link: {
        inline(args, ctx) {
            const rawUrl = args.text(0);
            const url = sanitizeUrl(rawUrl);
            const displayTokens = args.materializedTailTokens(1);

            // If URL is rejected, output the display text as plain content
            // (or fall back to the raw URL text if no display text was provided)
            if (url === undefined) {
                return {
                    type: "text",
                    value: displayTokens.length > 0 ? displayTokens : rawUrl,
                };
            }

            return {
                type: "link",
                url,
                value: displayTokens.length > 0 ? displayTokens : [{ type: "text", value: url, id: "" }],
            };
        },
    },
});

Attack examples

Input	`rawUrl`	`sanitizeUrl` result	Output
`$$link(https://safe.com \| click)$$`	`"https://safe.com"`	`"https://safe.com"`	Link token with `url: "https://safe.com"`
`$$link(javascript:alert(1) \| click)$$`	`"javascript:alert(1)"`	`undefined`	Plain text `"click"` (link dropped)
`$$link(data:text/html,<script>... \| click)$$`	`"data:text/html,..."`	`undefined`	Plain text `"click"` (link dropped)
`$$link(JAVASCRIPT:alert(1) \| click)$$`	`"JAVASCRIPT:alert(1)"`	`undefined`	Plain text `"click"` (case-insensitive check)
`$$link(java\x00script:alert(1) \| click)$$`	`"java\x00script:..."`	`undefined`	Plain text `"click"` (control char stripped during normalization)
`$$link(\| no url)$$`	`""`	`undefined`	Plain text `"no url"` (empty URL rejected)

Why sanitize in the handler, not the renderer?

You might think: "I will just sanitize when I render the HTML." That works, but sanitizing at parse time has advantages:

Defense in depth -- even if a renderer has a bug, the dangerous URL never makes it into the token tree.
Consistent behavior -- every consumer of the token tree (web renderer, mobile renderer, notification text, API response) gets safe URLs without each implementing sanitization.
Moderation visibility -- you can log when URLs are rejected, giving your moderation tools more signal.

Step 5: Content Length and Rate Limiting (Outside the Parser)

The parser handles syntax safety. Content policy -- length limits, spam detection, rate limiting, profanity filtering -- is your application's responsibility. The parser provides one useful tool for content policy: stripRichText (or dsl.strip()).

Why you need `strip` for length checks

If a user sends:

$$bold($$italic($$underline(hello)$$)$$)$$

The raw input is 47 characters. The actual visible text is 5 characters: hello. If you enforce a 500-character limit on the raw input, a user could send a message with 500 characters of markup and only 10 characters of content. Conversely, if you only check the raw string, users with heavy formatting might hit the limit prematurely.

Use dsl.strip() to get the plain-text length for policy checks:

function checkContentLength(input: string, maxLength: number): { ok: boolean; plainLength: number } {
    const plainText = chatParser.strip(input);
    return {
        ok: plainText.length <= maxLength,
        plainLength: plainText.length,
    };
}

Combining strip and parse efficiently

If you need both tokens (for rendering) and plain text (for length checks), avoid parsing twice. Call parse once, then use extractText on the result:

import { extractText } from "yume-dsl-rich-text";

function processMessage(input: string) {
    const errors: ParseError[] = [];
    const tokens = chatParser.parse(input, {
        onError: (e) => errors.push(e),
    });
    const plainText = extractText(tokens);
    const plainLength = plainText.length;

    return { tokens, plainText, plainLength, errors };
}

Content policy pipeline

interface PolicyResult {
    allowed: boolean;
    reason?: string;
}

function checkContentPolicy(input: string): PolicyResult {
    // 1. Raw input length (prevent extremely large payloads)
    if (input.length > 10_000) {
        return { allowed: false, reason: "Message too long (raw)" };
    }

    // 2. Plain text length (actual content)
    const plainText = chatParser.strip(input);
    if (plainText.length > 2_000) {
        return { allowed: false, reason: "Message too long (content)" };
    }
    if (plainText.length === 0) {
        return { allowed: false, reason: "Message is empty" };
    }

    // 3. Markup-to-content ratio (detect markup spam)
    const ratio = input.length / plainText.length;
    if (ratio > 10) {
        return { allowed: false, reason: "Excessive markup" };
    }

    return { allowed: true };
}

Step 6: Put It All Together

Here is the complete chat message processing pipeline, combining every concept from the previous steps:

import {
    createParser,
    createSimpleInlineHandlers,
    createPipeHandlers,
    extractText,
    type PipeArgs,
    type TokenDraft,
    type TextToken,
    type ParseError,
} from "yume-dsl-rich-text";

// --- URL sanitization ---

function sanitizeUrl(raw: string): string | undefined {
    if (!raw) return undefined;
    const trimmed = raw.trim();
    if (!trimmed) return undefined;

    let decoded: string;
    try {
        decoded = decodeURIComponent(trimmed);
    } catch {
        return undefined;
    }

    const normalized = decoded.replace(/[\s\x00-\x1f]/g, "").toLowerCase();

    if (normalized.startsWith("http://") || normalized.startsWith("https://")) {
        return trimmed;
    }
    if (normalized.startsWith("//")) {
        return trimmed;
    }
    if (!normalized.includes(":")) {
        return trimmed;
    }
    return undefined;
}

// --- Parser setup ---

const chatParser = createParser({
    handlers: {
        ...createSimpleInlineHandlers(["bold", "italic", "underline", "strike", "code"]),
        ...createPipeHandlers({
            link: {
                inline(args, ctx) {
                    const rawUrl = args.text(0);
                    const url = sanitizeUrl(rawUrl);
                    const display = args.materializedTailTokens(1);

                    if (url === undefined) {
                        return {
                            type: "text",
                            value: display.length > 0 ? display : rawUrl,
                        };
                    }
                    return {
                        type: "link",
                        url,
                        value: display.length > 0 ? display : [{ type: "text", value: url, id: "" }],
                    };
                },
            },
        }),
    },
    allowForms: ["inline"],
    depthLimit: 10,
});

// --- Message processing pipeline ---

interface ProcessedMessage {
    tokens: TextToken[];
    plainText: string;
    errors: ParseError[];
    flagged: boolean;
    flagReason?: string;
}

function processMessage(input: string): ProcessedMessage {
    // Step 1: Raw input length guard
    if (input.length > 10_000) {
        return {
            tokens: [{ type: "text", value: "[Message too long]", id: "err-0" }],
            plainText: "",
            errors: [],
            flagged: true,
            flagReason: "Raw input exceeds 10,000 characters",
        };
    }

    // Step 2: Parse with error collection
    const errors: ParseError[] = [];
    const tokens = chatParser.parse(input, {
        onError: (e) => errors.push(e),
    });

    // Step 3: Extract plain text (single pass -- no re-parse)
    const plainText = extractText(tokens);

    // Step 4: Content policy checks
    if (plainText.length === 0) {
        return {
            tokens,
            plainText,
            errors,
            flagged: true,
            flagReason: "Empty message after parsing",
        };
    }

    if (plainText.length > 2_000) {
        return {
            tokens,
            plainText,
            errors,
            flagged: true,
            flagReason: "Content exceeds 2,000 characters",
        };
    }

    // Step 5: Moderation signals from parse errors
    const depthErrors = errors.filter((e) => e.code === "DEPTH_LIMIT");
    if (depthErrors.length > 0) {
        return {
            tokens,
            plainText,
            errors,
            flagged: true,
            flagReason: `Depth limit hit ${depthErrors.length} time(s)`,
        };
    }

    if (errors.length > 5) {
        return {
            tokens,
            plainText,
            errors,
            flagged: true,
            flagReason: `Excessive parse errors: ${errors.length}`,
        };
    }

    // Step 6: Markup-to-content ratio
    const ratio = input.length / Math.max(plainText.length, 1);
    if (ratio > 10) {
        return {
            tokens,
            plainText,
            errors,
            flagged: true,
            flagReason: `Markup ratio ${ratio.toFixed(1)}:1 exceeds threshold`,
        };
    }

    return { tokens, plainText, errors, flagged: false };
}

Usage

// Normal message
const result1 = processMessage("$$bold(Hello)$$ $$italic(world)$$!");
// result1.flagged === false
// result1.tokens contains bold + italic tokens

// Attack: javascript URL
const result2 = processMessage("$$link(javascript:alert(1) | click me)$$");
// result2.flagged === false (not an error -- URL was sanitized)
// The link token has url: undefined, renders as plain text "click me"

// Attack: raw form injection
const result3 = processMessage("$$code(js)%\nalert(1)\n%end$$");
// result3.flagged === false
// result3.tokens is plain text (raw form blocked)

// Attack: nesting bomb
const result4 = processMessage("$$bold(".repeat(100) + "x" + ")$$".repeat(100));
// result4.flagged === true
// result4.flagReason contains "Depth limit hit"

Security Checklist

A summary of every safety layer covered in this tutorial:

Status	Measure	Responsible layer	What it prevents
Done	`allowForms: ["inline"]`	Parser config	Block/raw form injection
Done	URL sanitization in link handler	Tag handler	`javascript:`, `data:`, and other dangerous URL schemes
Done	Parser never throws	Parser core	Denial of service via malformed input
Done	`onError` callback	Parser config	Monitoring and moderation signal
Done	`depthLimit` (lowered to 10)	Parser config	Nesting bomb attacks
Done	`stripRichText` / `extractText`	Application code	Accurate content-length checks ignoring markup
Done	Markup ratio check	Application code	Markup spam / padding attacks
Done	Raw input length cap	Application code	Memory exhaustion from oversized payloads

What is NOT the parser's job

These concerns must be handled by your application or rendering layer:

Concern	Responsible layer	Why
HTML escaping	Your renderer (e.g., Vue, React)	The parser produces tokens, not HTML. XSS via HTML injection is a rendering-layer concern.
Rate limiting	Your API layer	The parser is stateless -- it does not know about request frequency.
Spam detection	Your moderation system	Content-level policy (profanity, links to malicious domains, etc.) requires domain knowledge the parser does not have.
Image/media validation	Your media pipeline	If you add an `img` tag, URL validation is necessary but not sufficient -- you need to verify the resource is safe.
Session/auth checks	Your API layer	The parser does not know who is sending messages.

en Tutorial Safe UGC - chiba233/yumeDSL GitHub Wiki

Tutorial: Safe UGC Chat

The Problem

Step 1: Define the Whitelist

Create the parser

Why `allowForms: ["inline"]` matters

Comparison: with vs without `allowForms`

Step 2: Test Graceful Degradation

2.1 Unregistered tag -- content becomes plain text

2.2 Unclosed tag -- entire string becomes plain text

2.3 Raw form blocked by `allowForms` -- literal text

2.4 Deep nesting -- `DEPTH_LIMIT` error

2.5 Normal usage -- works as expected

2.6 Mixed valid and invalid -- partial recovery

Step 3: Add Error Reporting

Collecting errors

The `ParseError` interface

Error codes relevant to UGC chat

Using errors for moderation

Logging errors for monitoring

Step 4: URL Sanitization in the Link Handler

The `sanitizeUrl` function

The full link handler

Attack examples

Why sanitize in the handler, not the renderer?

Step 5: Content Length and Rate Limiting (Outside the Parser)

Why you need `strip` for length checks

Combining strip and parse efficiently

Content policy pipeline

Step 6: Put It All Together

Usage

Security Checklist

What is NOT the parser's job

⚠️ GitHub.com Fallback ⚠️

en Tutorial Safe UGC - chiba233/yumeDSL GitHub Wiki

Tutorial: Safe UGC Chat

The Problem

Step 1: Define the Whitelist

Create the parser

Why allowForms: ["inline"] matters

Comparison: with vs without allowForms

Step 2: Test Graceful Degradation

2.1 Unregistered tag -- content becomes plain text

2.2 Unclosed tag -- entire string becomes plain text

2.3 Raw form blocked by allowForms -- literal text

2.4 Deep nesting -- DEPTH_LIMIT error

2.5 Normal usage -- works as expected

2.6 Mixed valid and invalid -- partial recovery

Step 3: Add Error Reporting

Collecting errors

The ParseError interface

Error codes relevant to UGC chat

Using errors for moderation

Logging errors for monitoring

Step 4: URL Sanitization in the Link Handler

The sanitizeUrl function

The full link handler

Attack examples

Why sanitize in the handler, not the renderer?

Step 5: Content Length and Rate Limiting (Outside the Parser)

Why you need strip for length checks

Combining strip and parse efficiently

Content policy pipeline

Step 6: Put It All Together

Usage

Security Checklist

What is NOT the parser's job

⚠️ **GitHub.com Fallback** ⚠️

Why `allowForms: ["inline"]` matters

Comparison: with vs without `allowForms`

2.3 Raw form blocked by `allowForms` -- literal text

2.4 Deep nesting -- `DEPTH_LIMIT` error

The `ParseError` interface

The `sanitizeUrl` function

Why you need `strip` for length checks

⚠️ GitHub.com Fallback ⚠️