Parsing HTML - garyhurtz/html2json.py GitHub Wiki
In addition to allowing you to build up a document, the Element also features a parse(html) method, which makes it very easy to parse HTML documents into a tree of Elements, which can then be rendered to JSON:
with open(u'document.html', u'r') as infile:
html = infile.read()
dut = Element.parse(html)
Assuming the HTML document contains:
<h1>title</h1>
<p>content</p>
<figure>
<img src="cover.jpg">
</figure>
You can then render JSON
dut.render() --> {
u'tag': u'div',
u'child': [
{
u'tag': u'h1',
u'text': u'title',
},
{
u'tag': u'p',
u'text': u'content',
},
{
u'tag': u'figure',
u'child': [
{
u'tag': u'img',
u'attr': {
u'src': u'cover.jpg'
}
}
]
}
]
}
Astute readers will notice that there is an extra <div> tag in the output. Whats up with that?
If the incoming HTML contains a single root element, the document will be parsed and rendered directly to JSON. If the incoming HTML contains multiple elements at the root level, as in the preceding example, a root element will be instantiated and the HTML will be parsed into children of that element. By default the root element is a <div>, although this can be overridden by passing the desired tag to the parse method:
dut = Element.parse(html, parent=u'article')
If you really wanted the JSON for the file you can easily recover it by accessing the root element's child attribute:
original_elements = dut.get(u'child')
which will return a list of parsed elements.