HTML - mostafa-karimi/Web-Scraping GitHub Wiki

Source: Automated Data Collection with R

  • HTML: HyperText Markup Language

  • HTML itself is not a programming language. HTML is a markup language that describes content and defines its presentation.

  • Although each revision of HTML has established new features and restructured old ones, the basic grammar of HTML documents has not changed much over the years.

1. Browser presentation and source code

  • An HTML file is basically nothing but plain text.

  • What makes HTML so powerful is its marked up structure.

  • The markup definitions rely on predefined character sequences — the tags — that enclose parts of the text.

  • What you see in your browser is therefore not the HTML document itself but an interpretation of it.

  • To identify which parts of the source code correspond to which elements in the browser window and vice versa, we can use an element inspector, which is implemented in most browsers.

2. Syntax rules

  • Plain text is turned into an HTML document by tags that can be interpreted by a browser.

  • The combination of start tag, content, and end tag is called element.

  • Attributes enable the specification of options for how the content of a tag should be handled. Which attributes are permitted depends on the specific tag. Attributes are always placed within the start tag right after the tag name.

  • The <html> element is the root element that splits into two branches, <head> and <body>. <head> is followed by another branch called <title>.

  • Comments are marked by <!-- at the beginning and --> at the end.

  • There is an extensive list of entities, all starting with an ampersand (&) and ending with a semicolon (;).

3. Tags and attributes

  • The anchor tag <a> is what turns HTML from just a markup language into a hypertext markup language by enabling HTML documents to link to other documents.

  • The <meta> tag is an empty tag written in the head element of an HTML document.

  • In general, two attributes are specified in a meta element. The first attribute can be either name or http-equiv; the second is always content.

  • The link tag is used to link to and include information and external files.

  • The <link> element is empty and used within the <head> element.

  • Tags like <b>, <i>, <strong> are layout tags that refer to bold, italics, and strong emphasis.

  • The <p> tag labels its content as being a paragraph and ensures that line breaks are inserted before and after its content.

  • In order to define different levels of headlines — level 1 to level 6 — HTML provides a series of tags <h1>, <h2>,…down to <h6>.

  • Several tags exist to list content. They are used depending on whether they wrap around an ordered list <ol>, an unordered list <ul>, or a description list <dl>. The former two tags make use of nested <li> elements to define list items, while the latter needs two further elements: <dt> for keyword and <dd> for its description.

  • While <div> and <span> themselves do not change the appearance of the content they enclose, these tags are used to group parts of the document — the former is used to define groups across lines, tags, and paragraphs, while the latter is used for in-line grouping.

  • CSS: Cascading Style Sheets

  • Style definitions are commonly stored in separate * CSS files, and are later included via <link> tags in the header.

  • The purpose of CSS is to separate content from layout to improve the document’s accessibility.

  • Forms are introduced by the <form> tag and supported by other tags like <fieldset>, <input>, <textarea>, <select>, and <option> and their respective attributes.

  • Query strings always appear at the end of the URL and start with ?.

  • The <script> element is a container for scripts that enable HTML to include functionality from other programming languages. This other language will frequently be JavaScript.

  • JavaScript allows the browser to change the content and structure of the document after it has been loaded from the server, enabling user interaction and event handling.

  • JavaScript can appear broadly in three forms: explicitly in a <script> element, implicitly by referring to an external JavaScript within a <script> element, and implicitly as an event in an HTML element.

  • To begin a table we make use of <table>. We start new lines with <tr>. Within <tr>, we can either use <td> for defining cells or <th> for header cells.

4. Parsing

  • Parsing HTML occurs by the browser to display HTML content nicely, and also by parsers in R to construct useful representations of HTML documents in our programming environment.

  • Reading functions differ from parsing functions in that the former do not care to understand the formal grammar that underlies HTML but merely recognize the sequence of symbols included in the HTML file.

  • readLines() maps every line of the input file to a separate value in a character vector.

  • To achieve a useful representation of HTML files, we need to employ a program that understands the special meaning of the markup structures and reconstructs the implied hierarchy of an HTML file within some R-specific data structure. This representation is also referred to as the Document Object Model (DOM).

  • DOM is a queryable data object that we can build from any HTML file and is useful for further processing of document parts.

  • Parsers belong to a general class of domain-specific programs that traverse over symbol sequences and reconstruct the semantic structure of the document within a data object of the programming environment.

  • Package xml (xml2)

  • DOM-style parsers first parses the entire target document and creates the DOM in a tree-like data structure of the C language. In this data structure every element that occurs in the HTML is now represented as its own entity, or as an individual node. All nodes taken together are referred to as the node set.