Article: How to consume RSS safely - simplepie/simplepie-ng GitHub Wiki
June 12, 2003
tags: html, rss, security
(Original source: http://diveintomark.org/archives/2003/06/12/how_to_consume_rss_safely)
First of all, I apologize to those of you who subscribe to my RSS feed and use web-based or browser- based news aggregators. If you checked your news page in the last 12 hours, you no doubt saw my little prank: an entire screen full of platypuses. (Please, let’s not turn this into a discussion of proper pluralization. Try to stay with me.) They’re gone from my feed now, although depending on your software you may need to delete the post in question from your local news page as well.
Now that the contrition is out of the way, let’s face facts: if this prank affected you, your software is dangerously broken. It accepts arbitrary HTML from potentially 100s of sources and blindly republishes it all on a single page on your own web server (or desktop web server). This is fundamentally dangerous.
Now, the current situation is not entirely your software’s fault. RSS, by design, is difficult to
consume safely. The RSS specification allows for description
elements to contain arbitrary entity-
encoded HTML. While this is great for RSS publishers (who can just throw stuff together and make an
RSS feed), it makes writing a safe and effective RSS consumer application exceedingly difficult. And
now that RSS is moving into the mainstream, the design decisions that got it there are becoming more
and more of a problem.
HTML is nasty. Arbitrary HTML can carry nasty payloads: scripts, ActiveX objects, remote image web bugs, and arbitrary CSS styles that (as you saw with my platypus prank) can take over the entire screen. Browsers protect against the worst of these payloads by having different rules for different zones. For example, pages in the general Internet are marked untrusted and may not have privileges to run ActiveX objects, but pages on your own machine or within your own intranet can. Unfortunately, the practice of republishing remote HTML locally eliminates even this minimal safeguard.
Still, dealing with arbitrary HTML is not impossible. Web-based mail systems like Hotmail and Yahoo allow users to send and receive HTML mail, and they take great pains to display it safely. It’s a lot of work, and there have been several high-profile failures over the years, but they’re coping.
Let me be clear: by design, RSS forces every single consumer application to cope with this problem.
So, to anyone who wants to write a safe RSS aggregator (or who has already written an unsafe one), I offer this advice:
- Strip
script
tags. This almost goes without saying. Want to see the prank I didn’t pull? More seriously,script
tags can be used by unscrupulous publishers to insert pop-up ads onto your news page. Think it won’t happen? Some larger commercial publishers are already inserting text ads and banner ads into their feeds. - Strip
embed
tags. - Strip
object
tags. - Strip
frameset
tags. - Strip
frame
tags. - Strip
iframe
tags. - Strip
meta
tags, which can be used to hijack a page and redirect it to a remote URL. - Strip
link
tags, which can be used to import additional style definitions. - Strip
style
tags, for the same reason. - Strip
style
attributes from every single remaining tag. My platypus prank was based entirely on a single roguestyle
attribute.
Alternatively, you can simply strip all but a known subset of tags. Many comment systems work this
way. You’ll still need to strip style
attributes though, even from the known good tags.
You forgot two important ones:
- If you strip style attributes, you want to strip event handlers too.
Otherwise:
… onLoad=”location.href=’http://www.playboy.com‘” …
- Plus, there are the layout-breaking tags, like a closing DIV or closing TABLE.
.
Emmanuel: the issue has been raised many times in many forums. See, for instance:
http://www.intertwingly.net/blog/940.html http://feeds.archive.org/validator/docs/warning/ContainsScript.html http://webservices.xml.com/pub/a/ws/2002/11/19/rssfeedquality.html?page=2 http://www.securiteam.com/unixfocus/6L00H205PY.html http://project.antville.org/stories/200348/ http://www.peerfear.org/rss/permalink/1028943207.shtml http://diveintomark.org/archives/2002/10/10/more_on_evolvable_formats.html http://philringnalda.com/blog/2002/04/thinking_about_rss.php http://groups.yahoo.com/group/radio-userland/message/9965 http://radio.weblogs.com/0100887/categories/rss/2002/05/23.html#a265 http://vyom.org/cat_internet/rss_security_vulnerabilities.php
A Google search for “rss strip html tags” will turn up dozens more.
.
And you should use regular expressions to remove them. /<(script|noscript|object|embed|style|frameset|frame|iframe)[>\s\S]</\1>/i /</?!?(param|link|meta|doctype|div|font)[^>]>/i /(class|style|id)=”[^"]*”/I
.
One more tag to be wary of:
<body>
. When IE encounters a<body onload>
inside the main<body>
-section, it will execute that script as if it was on the outer-<body>
.
.
I’d be surprised if this problem can be solved properly using regular expressions - for example, the examples regexps pasted in above would miss out on tags that don’t have a closing tag and unquoted attributes. I know from experience (http://simon.incutio.com/archive/2003/02/23/safeHtmlChecker) that there are a huge number of HTML “tricks” for causing problems, especially if your browser is IE (which is reknowned for accepting pretty much any garbage markup).
To be truly safe, you need to use a proper HTML parser to pre-process the markup. Even worse, the parser can’t just be a standard HTML parser - it will need to closely match the parser of the eventual consuming browser (generally IE) as otherwise it could miss stuff that IE will still process.
It’s a very nasty problem.
.
There are a lot of feeds out there that blindly copy out the bad html that has been entered by someone else in a comments box. Either that or they include all the formatting used on the blog itself. To avoid the item being too long they then chop after N characters and add “…”. The end result of this is
<description>
containing not malicious but annoying tags like<font
and<table
and because this isn’t cleaned up before chopping these are often unmatched. I think all this is much more of a problem than the rare occasions where someone deliberately tries an exploit, dangerous though that might be.So I can
strip_tags()
selectively, and use some simple regex to get rid of the worst of the tag attributes. But I’ve still got to build an HTML tidy to catch the unmatched tags.It’s enough to make me want to exclude everything except
<a href
,<img
and I’m not too sure about those either.
.
Also, be sure to restrict the URLs of images, links, etc. For Mozilla, you must disallow links to
javascript:
anddata:
URLs. For IE and NS4, I think there are a few synonyms forjavascript:
you also have to disallow.