Webrexp tutorial - Twinside/Webrexp GitHub Wiki
Tutorial
A webrexp describe a path within a graph (by extension tree) followed by the engine. At each moment, the engine manipulate a list of graph node (thinks tags if you prefer) or a list of string.
The documentation is still incomplete as some other functionalities hasn't been tested yet.
Walking in a web page
A webrexp basically perform searches in the dom, a bit like jQuery selectors. For example if you find the following element in your webrexp :
div a img
It means to find from the current nodes (it can be many), a div
element, then
from all the found div element, find all a
, and finally within all the a
elements
found, find all the images. This way you can walk around easily. Moreover you can
refine easily some element
div.some-class
Will find some div element with the attribute class
equal to "some-class". You
can also find by "name"
div#someId
which will find div element with the attribute id
equal to someId. And you can
even combine elements :
div.some-class#someId
Accessing a web page
The first thing to do is to access a web page, the first idea is to use a string, with an URI in it, and follow it
"http://somewebsiteIWant.com" >>
The >>
operator is the "Dereference" operator, it will try to follow any link
or graph element. If you give it a string, it will try to download a string and
parse it. If you give it a tag, it will try to find an href
attribute and follow
it.
Some pattern is to find some links and follow them
div.nav-next a >>
For example, the previous request will search a div element of class nav-next
, all
the links within (hopefully just one :)) and follow the link. Now the analysed node
Filtering content
You might want to further refine your currently selected node, and checking some attributes of it, for this you can use action.
div img [alt != ""]
the actions are between []
, everything that's between []
can exprimate some
comparison. You can directly query attributes and basic expression syntax is
provided. You can chain comparison with ;
div img [alt = "some alt text"; width /= ""]
Dumping content
Dumping content is a simple action called .
img [.]
As all actions, it's put into []
, you can still filter before dumping :
img [alt /= ""; .]
If the selected nodes have an src
attribute, it's what is dumped, otherwise
some text approximation of the current node is found and returned. If an action
return an element which is not a boolean, then it's displayed on screen.
Document dumping
If you want to dump the whole content of the page, you can use the ->
operator
instead of >>
, this way, the document won't be parsed and put directly on
the hard drive.
"http://www.google.com" ->
This expression will dump the HTML content of the first page of google.
Branching
You might want to find/dump several elements in some webpage, placed at different place in the document, for that you can use branches
"http://somesite.com" >> (head title [.]; img [.])
Here two elements are dumped, first the title of the document, then all the
images of the document. When branches are encountered, the current state is kept
and is reset after all ;
. So here head
will be searched in the html
node,
and img
to. Each branch is independent, only the last one is kept for further
processing.
If one of the branch fail, the following branches are not executed.
Repeating
You might want to repeat some path into your documents often, if you want to dump data spread across many web pages for instance. Let's write an expression which dump all title of the pages linked to the current one recursively.
"http://somesite.com" >> (head title [.]; a >>)*
We use a branch to dump the title of the page, then we find
all links of the current page and follow them. After that the
*
repeat the expression while there is no error, or if there
is no node left. The stared expression is always valid, even
if no full execution occur.
More
You can see some tips to learn more about webrexp without learning about more features, ore go to advanced part.