Webrexp advanced - Twinside/Webrexp GitHub Wiki
You can apply webrexp to local files like if they were url, instead of typing an URI in a string, just type a filename :
webrexp '"localFile.html" >> title [.]'
You can even navigate across local files using the regular syntax.
When you search for tags, you might only want to take the first few for prototyping or, only some really interest you. For this situtation, you can use the indexing operator
webrexp '"index.rss" >> item title #{0} [.]'
You can separate different indices with a comma, it doesn't matter if the indice are higher than the number of nodes available.
webrexp '"index.rss" >> item title #{1,2,3,4,12,18} [.]'
To save a little bit of typing, you can use range, and mix them with normal index declaration if you want.
webrexp '"index.rss" >> item title #{1-4,12,18} [.]'
Imagine the following document named local1.html
<html>
<head> <title>local1</title> </head>
<body>
<h1>Local 1 Title</h1>
<a href="local2.html">next</a>
</body>
</html>
And the document local2.html
<html>
<head> <title>local2</title> </head>
<body>
<h1>Local 2 title</h1>
<a href="local1.html">prev</a>
</body>
</html>
As you can see they share a similar structure, each one referencing each other using a simple link. Let's try a simple and naïve crawling to display all the titles of the webpages.
Webrexp '"local1.html" >> (h1 [.]; a >>)*'
You should see an infinite display of "Local 1 Title" and "Local 2 title". Nothing is here to stop the engine to crawl them add-infinitum.
To avoid this situation which can appear unexpectedly in some complex webpages,
we can use the cycle breaker operator !
:
Webrexp '"local1.html" >> (! h1 [.]; a >>)*'
In this example, The "local1.html" is displayed once, as the "local2.html". What happen when checking for unicity :
- When the
!
operator is accessed, it check every node or string. - If the current element are strings, each one is checked against the recorded one, if found, it's then discarded. Each unrecorded string is then kept in memory.
- If the elements are nodes, it checks whether or not the source document of the nodes has been visited or not. If it has, the node is discarded, otherwise the source document is recorded.
After these verification, the rest of the expression is evaluated. So you should enclose
a cycle breaker inside a *
.
Sometimes, you will be confronted to unstructured data, and you will want to exploit it as if it were structured. We're going to use an exemple, the file local3.html
<html>
<head> <title>Local 3 Title</title> </head>
<body>
<h1>Local 3 title</h1>
<div class="text">
You know, <b>local1.html</b> is a pretty good webpage.
</div>
</body>
</html>
We want to exploit the link in text form in bold balises. Right now all we can do is dump the content on the string
webrexp '"local3.html" >> b [.]'
To follow the link, we will introduce the inject operator : $
webrexp '"local3.html" >> b [$ .] >> h1 [.]'
When the inject operator is found, it replace the currently analysed node or string with the current result of the action. It can be any complex action.
To extract the sentence `You know, local1.html is a pretty good webpage from this webpage :
<html>
<head> <title>Local 3 Title</title> </head>
<body>
<h1>Local 3 title</h1>
<div class="text">
You know, <b>local1.html</b> is a pretty good webpage.
</div>
</body>
</html>
We can use the #
operator. It's meaning is equivalent to .
but gathering
the text of all the children tags and concatenating it. The whole webrexp is :
Webrexp '"local1.html" >> div [ # ]'