en Tutorial Safe UGC - chiba233/yumeDSL GitHub Wiki
You have a chat or comment system. Users type messages. You want to allow simple formatting -- bold, italic, links -- but prevent abuse:
- No block or raw tags (could break layout or inject raw content)
- Unknown tags should show as plain text, not throw errors
- Malformed markup should never crash the parser
- URLs must be sanitized (no
javascript:ordata:schemes) - Error reporting for moderation tools
This tutorial walks through building a complete, production-ready UGC chat message pipeline with yume-dsl-rich-text. Every step includes working code and attack-scenario tests so you can verify the safety properties yourself.
The first line of defense is restricting which tags and which tag forms the parser accepts. A chat system typically needs only inline formatting -- no multi-line code blocks, no raw content injection, no block-level layout containers.
import {
createParser,
createSimpleInlineHandlers,
createPipeHandlers,
type PipeArgs,
type TokenDraft,
type TextToken,
type ParseError,
} from "yume-dsl-rich-text";
const chatParser = createParser({
handlers: {
// Simple inline formatting -- bold, italic, underline, strike, code
...createSimpleInlineHandlers(["bold", "italic", "underline", "strike", "code"]),
// Link handler with URL sanitization (see Step 4)
...createPipeHandlers({
link: {
inline(args, ctx) {
const rawUrl = args.text(0);
const url = sanitizeUrl(rawUrl);
return {
type: "link",
url,
value: args.materializedTailTokens(1),
};
},
},
}),
},
// KEY SECURITY MEASURE: only allow inline forms
allowForms: ["inline"],
});The allowForms option is the single most important security setting for UGC. It restricts the parser globally -- not per tag, but across the entire parse operation. Here is what it does and does not allow:
Allowed (inline form):
| Input | Result |
|---|---|
$$bold(hello)$$ |
Parsed as bold token |
$$italic(world)$$ |
Parsed as italic token |
$$link(https://example.com | click)$$ |
Parsed as link token |
$$bold($$italic(nested)$$)$$ |
Parsed as bold containing italic |
Blocked (raw and block forms):
| Input | Result |
|---|---|
$$code(js)%\nalert('xss')\n%end$$ |
Entire markup becomes literal text |
$$info(title)*\n<script>...</script>\n*end$$ |
Entire markup becomes literal text |
$$unknown()*\nmalicious content\n*end$$ |
Entire markup becomes literal text |
When allowForms does not include "raw" or "block", the parser treats those forms as if the handler does not support them. The raw $$code(js)%\nalert('xss')\n%end$$ syntax is not parsed as a tag at all -- it flows through as literal text characters in the output. No error is thrown, no special handling is needed. The user just sees the raw markup as text.
This applies to all tags, including unregistered ones. Even if someone invents $$exploit()*\n...\n*end$$, the block form is disabled globally, so the parser never enters block-parsing mode.
Without allowForms (UNSAFE for UGC):
const unsafeParser = createParser({
handlers: {
...createSimpleInlineHandlers(["bold"]),
...createSimpleRawHandlers(["code"]),
},
// No allowForms -- all forms enabled by default
});
unsafeParser.parse("$$code(js)%\nalert(document.cookie)\n%end$$");
// Result: [{ type: "code", arg: "js", value: "alert(document.cookie)", id: "rt-0" }]
// The raw content is captured verbatim -- if your renderer does not escape it,
// this is an XSS vector.With allowForms: ["inline"] (SAFE):
chatParser.parse("$$code(js)%\nalert(document.cookie)\n%end$$");
// Result: [{ type: "text", value: "$$code(js)%\nalert(document.cookie)\n%end$$", id: "rt-0" }]
// The entire input is plain text. No tag was recognized.The parser is designed to never throw on malformed input. Every syntax error, every unknown tag, every nesting abuse degrades to plain text. This is critical for UGC: you cannot predict what users will type, and a crash means denial of service.
Here are the key degradation scenarios, with exact output:
chatParser.parse("$$unknown(hello)$$");
// Result:
// [{ type: "text", value: "hello", id: "rt-0" }]The tag unknown is not in the handlers map. The parser recognizes the syntax but has no handler for it, so the content "hello" is unwrapped as plain text. The $$unknown( and )$$ delimiters are stripped, and only the inner content survives.
const errors: ParseError[] = [];
chatParser.parse("$$bold(unclosed", {
onError: (e) => errors.push(e),
});
// Result:
// [{ type: "text", value: "$$bold(unclosed", id: "rt-0" }]
//
// errors[0]:
// {
// code: "INLINE_NOT_CLOSED",
// message: "(L1:C1) Inline tag not closed: >>>$$bold(<<< unclosed",
// line: 1,
// column: 1,
// snippet: " >>>$$bold(<<< unclosed"
// }The opening $$bold( is never closed with )$$. The parser reports INLINE_NOT_CLOSED and recovers by treating the entire string as literal text. No crash, no partial token.
chatParser.parse("$$code(js)%\nalert(1)\n%end$$");
// Result:
// [{ type: "text", value: "$$code(js)%\nalert(1)\n%end$$", id: "rt-0" }]Even though the syntax is perfectly valid raw-form DSL, the allowForms: ["inline"] setting means the raw form is globally disabled. The parser does not even attempt to parse it as a raw tag.
const deepInput = "$$bold(".repeat(100) + "hello" + ")$$".repeat(100);
const errors: ParseError[] = [];
chatParser.parse(deepInput, {
onError: (e) => errors.push(e),
});
// At depth 50 (default depthLimit), the parser stops recursing.
// The offending tag degrades to literal text.
// errors will contain at least one entry with code: "DEPTH_LIMIT"The default depthLimit is 50. For a chat system, you might want to lower it:
const chatParser = createParser({
handlers: { /* ... */ },
allowForms: ["inline"],
depthLimit: 10, // Chat messages rarely need more than a few levels
});chatParser.parse("$$bold(hello $$italic(world)$$)$$");
// Result:
// [
// {
// type: "bold",
// value: [
// { type: "text", value: "hello ", id: "rt-0" },
// {
// type: "italic",
// value: [{ type: "text", value: "world", id: "rt-1" }],
// id: "rt-2",
// },
// ],
// id: "rt-3",
// },
// ]Nested inline tags parse correctly. The bold token contains both a text node and an italic child token.
chatParser.parse("$$bold(hello)$$ $$unknown(oops)$$ $$italic(world)$$");
// Result:
// [
// { type: "bold", value: [{ type: "text", value: "hello", ... }], ... },
// { type: "text", value: " ", ... },
// { type: "text", value: "oops", ... },
// { type: "text", value: " ", ... },
// { type: "italic", value: [{ type: "text", value: "world", ... }], ... },
// ]The valid bold and italic tags parse normally. The unregistered unknown tag degrades to plain text "oops". The surrounding content is unaffected.
The parser's onError callback is your window into malformed input. For a chat system, error data is valuable for moderation -- a message full of parse errors is likely spam or an exploit attempt.
function parseMessage(input: string) {
const errors: ParseError[] = [];
const tokens = chatParser.parse(input, {
onError: (e) => errors.push(e),
});
return { tokens, errors };
}Every error contains:
interface ParseError {
code: ErrorCode;
message: string;
line: number;
column: number;
snippet: string;
}| Field | Description |
|---|---|
code |
Machine-readable error type (ErrorCode union) |
message |
Human-readable with (L{line}:C{column}) prefix and >>>...<<< snippet markers |
line |
1-indexed line number where the error starts |
column |
1-indexed column number where the error starts |
snippet |
Context around the error with >>> <<< markers showing the problematic span |
With allowForms: ["inline"], you will primarily encounter these codes:
| Code | Meaning | Example trigger |
|---|---|---|
INLINE_NOT_CLOSED |
Inline tag opened but never closed | $$bold(unclosed |
SHORTHAND_NOT_CLOSED |
Implicit inline shorthand opened but never closed (since 1.3) |
bold(unclosed with implicitInlineShorthand enabled |
UNEXPECTED_CLOSE |
Stray close marker with no matching open | lone )$$ in the middle of text |
DEPTH_LIMIT |
Nesting exceeded depthLimit
|
$$a($$b($$c($$d(... beyond the limit |
You will not see BLOCK_NOT_CLOSED, RAW_NOT_CLOSED, or their malformed variants, because block and raw forms are globally disabled by allowForms: ["inline"].
interface ModerationResult {
tokens: TextToken[];
flagged: boolean;
reason?: string;
}
function moderateMessage(input: string): ModerationResult {
const { tokens, errors } = parseMessage(input);
// Flag messages with excessive parse errors
if (errors.length > 5) {
return {
tokens,
flagged: true,
reason: `Excessive parse errors (${errors.length}): possible markup abuse`,
};
}
// Flag messages with depth limit hits -- likely nesting attack
const depthErrors = errors.filter((e) => e.code === "DEPTH_LIMIT");
if (depthErrors.length > 0) {
return {
tokens,
flagged: true,
reason: `Depth limit exceeded ${depthErrors.length} time(s): possible nesting attack`,
};
}
return { tokens, flagged: false };
}function logParseErrors(userId: string, messageId: string, errors: ParseError[]) {
for (const err of errors) {
console.warn(
`[UGC Parse Error] user=${userId} msg=${messageId} ` +
`code=${err.code} L${err.line}:C${err.column} ${err.snippet}`
);
}
}This gives you a stream of structured data you can feed into your monitoring system. A sudden spike in DEPTH_LIMIT errors from a single user is a strong signal of abuse.
The parser does not validate URLs -- that is the handler's responsibility. For UGC, you must sanitize URLs to prevent javascript:, data:, and other dangerous schemes.
function sanitizeUrl(raw: string): string | undefined {
if (!raw) return undefined;
const trimmed = raw.trim();
if (!trimmed) return undefined;
// Decode and normalize to catch obfuscation attempts
let decoded: string;
try {
decoded = decodeURIComponent(trimmed);
} catch {
// Malformed percent encoding -- reject
return undefined;
}
// Strip whitespace and control characters that browsers might ignore
const normalized = decoded.replace(/[\s\x00-\x1f]/g, "").toLowerCase();
// Only allow http and https schemes
if (normalized.startsWith("http://") || normalized.startsWith("https://")) {
return trimmed; // Return the original (not decoded) URL
}
// Allow protocol-relative URLs (resolve to page's protocol)
if (normalized.startsWith("//")) {
return trimmed;
}
// Allow relative paths (no scheme)
if (!normalized.includes(":")) {
return trimmed;
}
// Reject everything else (javascript:, data:, vbscript:, etc.)
return undefined;
}const handlers = createPipeHandlers({
link: {
inline(args, ctx) {
const rawUrl = args.text(0);
const url = sanitizeUrl(rawUrl);
const displayTokens = args.materializedTailTokens(1);
// If URL is rejected, output the display text as plain content
// (or fall back to the raw URL text if no display text was provided)
if (url === undefined) {
return {
type: "text",
value: displayTokens.length > 0 ? displayTokens : rawUrl,
};
}
return {
type: "link",
url,
value: displayTokens.length > 0 ? displayTokens : [{ type: "text", value: url, id: "" }],
};
},
},
});| Input | rawUrl |
sanitizeUrl result |
Output |
|---|---|---|---|
$$link(https://safe.com | click)$$ |
"https://safe.com" |
"https://safe.com" |
Link token with url: "https://safe.com"
|
$$link(javascript:alert(1) | click)$$ |
"javascript:alert(1)" |
undefined |
Plain text "click" (link dropped) |
$$link(data:text/html,<script>... | click)$$ |
"data:text/html,..." |
undefined |
Plain text "click" (link dropped) |
$$link(JAVASCRIPT:alert(1) | click)$$ |
"JAVASCRIPT:alert(1)" |
undefined |
Plain text "click" (case-insensitive check) |
$$link(java\x00script:alert(1) | click)$$ |
"java\x00script:..." |
undefined |
Plain text "click" (control char stripped during normalization) |
$$link(| no url)$$ |
"" |
undefined |
Plain text "no url" (empty URL rejected) |
You might think: "I will just sanitize when I render the HTML." That works, but sanitizing at parse time has advantages:
- Defense in depth -- even if a renderer has a bug, the dangerous URL never makes it into the token tree.
- Consistent behavior -- every consumer of the token tree (web renderer, mobile renderer, notification text, API response) gets safe URLs without each implementing sanitization.
- Moderation visibility -- you can log when URLs are rejected, giving your moderation tools more signal.
The parser handles syntax safety. Content policy -- length limits, spam detection, rate limiting, profanity filtering -- is your application's responsibility. The parser provides one useful tool for content policy: stripRichText (or dsl.strip()).
If a user sends:
$$bold($$italic($$underline(hello)$$)$$)$$
The raw input is 47 characters. The actual visible text is 5 characters: hello. If you enforce a 500-character limit on the raw input, a user could send a message with 500 characters of markup and only 10 characters of content. Conversely, if you only check the raw string, users with heavy formatting might hit the limit prematurely.
Use dsl.strip() to get the plain-text length for policy checks:
function checkContentLength(input: string, maxLength: number): { ok: boolean; plainLength: number } {
const plainText = chatParser.strip(input);
return {
ok: plainText.length <= maxLength,
plainLength: plainText.length,
};
}If you need both tokens (for rendering) and plain text (for length checks), avoid parsing twice. Call parse once, then use extractText on the result:
import { extractText } from "yume-dsl-rich-text";
function processMessage(input: string) {
const errors: ParseError[] = [];
const tokens = chatParser.parse(input, {
onError: (e) => errors.push(e),
});
const plainText = extractText(tokens);
const plainLength = plainText.length;
return { tokens, plainText, plainLength, errors };
}interface PolicyResult {
allowed: boolean;
reason?: string;
}
function checkContentPolicy(input: string): PolicyResult {
// 1. Raw input length (prevent extremely large payloads)
if (input.length > 10_000) {
return { allowed: false, reason: "Message too long (raw)" };
}
// 2. Plain text length (actual content)
const plainText = chatParser.strip(input);
if (plainText.length > 2_000) {
return { allowed: false, reason: "Message too long (content)" };
}
if (plainText.length === 0) {
return { allowed: false, reason: "Message is empty" };
}
// 3. Markup-to-content ratio (detect markup spam)
const ratio = input.length / plainText.length;
if (ratio > 10) {
return { allowed: false, reason: "Excessive markup" };
}
return { allowed: true };
}Here is the complete chat message processing pipeline, combining every concept from the previous steps:
import {
createParser,
createSimpleInlineHandlers,
createPipeHandlers,
extractText,
type PipeArgs,
type TokenDraft,
type TextToken,
type ParseError,
} from "yume-dsl-rich-text";
// --- URL sanitization ---
function sanitizeUrl(raw: string): string | undefined {
if (!raw) return undefined;
const trimmed = raw.trim();
if (!trimmed) return undefined;
let decoded: string;
try {
decoded = decodeURIComponent(trimmed);
} catch {
return undefined;
}
const normalized = decoded.replace(/[\s\x00-\x1f]/g, "").toLowerCase();
if (normalized.startsWith("http://") || normalized.startsWith("https://")) {
return trimmed;
}
if (normalized.startsWith("//")) {
return trimmed;
}
if (!normalized.includes(":")) {
return trimmed;
}
return undefined;
}
// --- Parser setup ---
const chatParser = createParser({
handlers: {
...createSimpleInlineHandlers(["bold", "italic", "underline", "strike", "code"]),
...createPipeHandlers({
link: {
inline(args, ctx) {
const rawUrl = args.text(0);
const url = sanitizeUrl(rawUrl);
const display = args.materializedTailTokens(1);
if (url === undefined) {
return {
type: "text",
value: display.length > 0 ? display : rawUrl,
};
}
return {
type: "link",
url,
value: display.length > 0 ? display : [{ type: "text", value: url, id: "" }],
};
},
},
}),
},
allowForms: ["inline"],
depthLimit: 10,
});
// --- Message processing pipeline ---
interface ProcessedMessage {
tokens: TextToken[];
plainText: string;
errors: ParseError[];
flagged: boolean;
flagReason?: string;
}
function processMessage(input: string): ProcessedMessage {
// Step 1: Raw input length guard
if (input.length > 10_000) {
return {
tokens: [{ type: "text", value: "[Message too long]", id: "err-0" }],
plainText: "",
errors: [],
flagged: true,
flagReason: "Raw input exceeds 10,000 characters",
};
}
// Step 2: Parse with error collection
const errors: ParseError[] = [];
const tokens = chatParser.parse(input, {
onError: (e) => errors.push(e),
});
// Step 3: Extract plain text (single pass -- no re-parse)
const plainText = extractText(tokens);
// Step 4: Content policy checks
if (plainText.length === 0) {
return {
tokens,
plainText,
errors,
flagged: true,
flagReason: "Empty message after parsing",
};
}
if (plainText.length > 2_000) {
return {
tokens,
plainText,
errors,
flagged: true,
flagReason: "Content exceeds 2,000 characters",
};
}
// Step 5: Moderation signals from parse errors
const depthErrors = errors.filter((e) => e.code === "DEPTH_LIMIT");
if (depthErrors.length > 0) {
return {
tokens,
plainText,
errors,
flagged: true,
flagReason: `Depth limit hit ${depthErrors.length} time(s)`,
};
}
if (errors.length > 5) {
return {
tokens,
plainText,
errors,
flagged: true,
flagReason: `Excessive parse errors: ${errors.length}`,
};
}
// Step 6: Markup-to-content ratio
const ratio = input.length / Math.max(plainText.length, 1);
if (ratio > 10) {
return {
tokens,
plainText,
errors,
flagged: true,
flagReason: `Markup ratio ${ratio.toFixed(1)}:1 exceeds threshold`,
};
}
return { tokens, plainText, errors, flagged: false };
}// Normal message
const result1 = processMessage("$$bold(Hello)$$ $$italic(world)$$!");
// result1.flagged === false
// result1.tokens contains bold + italic tokens
// Attack: javascript URL
const result2 = processMessage("$$link(javascript:alert(1) | click me)$$");
// result2.flagged === false (not an error -- URL was sanitized)
// The link token has url: undefined, renders as plain text "click me"
// Attack: raw form injection
const result3 = processMessage("$$code(js)%\nalert(1)\n%end$$");
// result3.flagged === false
// result3.tokens is plain text (raw form blocked)
// Attack: nesting bomb
const result4 = processMessage("$$bold(".repeat(100) + "x" + ")$$".repeat(100));
// result4.flagged === true
// result4.flagReason contains "Depth limit hit"A summary of every safety layer covered in this tutorial:
| Status | Measure | Responsible layer | What it prevents |
|---|---|---|---|
| Done | allowForms: ["inline"] |
Parser config | Block/raw form injection |
| Done | URL sanitization in link handler | Tag handler |
javascript:, data:, and other dangerous URL schemes |
| Done | Parser never throws | Parser core | Denial of service via malformed input |
| Done |
onError callback |
Parser config | Monitoring and moderation signal |
| Done |
depthLimit (lowered to 10) |
Parser config | Nesting bomb attacks |
| Done |
stripRichText / extractText
|
Application code | Accurate content-length checks ignoring markup |
| Done | Markup ratio check | Application code | Markup spam / padding attacks |
| Done | Raw input length cap | Application code | Memory exhaustion from oversized payloads |
These concerns must be handled by your application or rendering layer:
| Concern | Responsible layer | Why |
|---|---|---|
| HTML escaping | Your renderer (e.g., Vue, React) | The parser produces tokens, not HTML. XSS via HTML injection is a rendering-layer concern. |
| Rate limiting | Your API layer | The parser is stateless -- it does not know about request frequency. |
| Spam detection | Your moderation system | Content-level policy (profanity, links to malicious domains, etc.) requires domain knowledge the parser does not have. |
| Image/media validation | Your media pipeline | If you add an img tag, URL validation is necessary but not sufficient -- you need to verify the resource is safe. |
| Session/auth checks | Your API layer | The parser does not know who is sending messages. |