Parsing HTML - garyhurtz/html2json.py GitHub Wiki

In addition to allowing you to build up a document, the Element also features a parse(html) method, which makes it very easy to parse HTML documents into a tree of Elements, which can then be rendered to JSON:

with open(u'document.html', u'r') as infile:
    html = infile.read()

dut = Element.parse(html)

Assuming the HTML document contains:

<h1>title</h1>

<p>content</p>

<figure>
    <img src="cover.jpg">
</figure>

You can then render JSON

dut.render() --> {
        u'tag': u'div',
        u'child': [
            {
                u'tag': u'h1',
                u'text': u'title',
            },
            {
                u'tag': u'p',
                u'text': u'content',
            },
            {
                u'tag': u'figure',
                u'child': [
                    {
                        u'tag': u'img',
                        u'attr': {
                            u'src': u'cover.jpg'
                        }
                    }
                ]
            }
        ]
    }

Astute readers will notice that there is an extra <div> tag in the output. Whats up with that?

If the incoming HTML contains a single root element, the document will be parsed and rendered directly to JSON. If the incoming HTML contains multiple elements at the root level, as in the preceding example, a root element will be instantiated and the HTML will be parsed into children of that element. By default the root element is a <div>, although this can be overridden by passing the desired tag to the parse method:

dut = Element.parse(html, parent=u'article')

If you really wanted the JSON for the file you can easily recover it by accessing the root element's child attribute:

original_elements = dut.get(u'child')

which will return a list of parsed elements.

⚠️ **GitHub.com Fallback** ⚠️