Create a provider for Fishnet - cr0wbar/fishnet GitHub Wiki
Introduction
Creating a provider for Fishnet is certainly not rocket science or quantum physics, but you should at least have heard something about the following things:
- JSON (Java Script Object Notation);
- HTML (just the basics, I ain't no Guru either);
- Query strings (again, just what they are and how they work is more than enough);
- XPath syntax (most important).
Starting from a template
Let's see how a sample provider looks like.
{
"name":"Foo",
"baseUrl": "http://foo.org",
"pattern":"/search?category=[category]&search=[text]&page=[page]",
"headers:{}
"pageRules":
{
"start":1,
"step":1,
"maxItems":15
},
"categories":
{
"All":"0",
"Images":"563"
},
"ops":
{
"titles":
{
"xpath":"//*[@class=\"mbuto\"]/id[2]/a",
"container":"text"
},
"urls":
{
"xpath":"//*[@class=\"mbuto\"]/id[2]/a",
"container":"href",
"crawler":
{
"xpath":"//*[@id=\"lapo\"]/a[5]"
}
},
"magnets":
{
"xpath":"//*[@class=\"magnetlink\"]/id[1]/a",
"container":"href"
}
}
}
Starting from the top we have
name
: the name of the provider (it will be displayed in the user interface when selecting the provider) (Mandatory);baseUrl
: is what should be added at the beginning of anyhref
attribute value in order to obtain a complete path (Mandatory);pattern
: see below (Mandatory);headers
: custom request headers (Optional);pageRules
: see below (Mandatory);categories
: the categories on which we can focus the search into. See below (Mandatory);ops
: the operations performed to extract the data. See below (Mandatory);
Categories
the categories
dictionary maps the value seen in the user interface with the value used in the query (see Understanding the pattern
field): let us assume that we want to search in the Images section of a torrent search site; usually in the html code of the search page there is the mapping between the value as it appears in the combobox on the web page and the value displayed in the query. Fishnet does the same but on a provider basis, so that it can display a human readable value in the UI ("Images" in the example above) and do a search by category using the numeric code (563 in the example above).
In other words: the key in the dictionary becomes the value displayed in the UI and the corresponding value becomes the value in query.
pattern
field
Understanding the From the example above we have
"pattern":"/search?category=[category]&search=[text]&page=[page]"
What Fishnet's engine does is replacing the fields in square brackets as follows:
[category]
is replaced by the value in thecategories
dictionary. If the Search All mode is used then it's All by default;[text]
is the string in the text input field in the main pane (encoded properly);[page]
is the page to retrieve, which is determined by how many pages we want to retrieve (this is configured in the settings pane), and thepageRules
dictionary (see dedicated section).
Rules for page retrieval
The pageRules
dictionary should be filled as follows:
start
: is the numerical value that is going to replace[page]
in the query at the very first iteration of the engine (usually it's 1).step
: sets how much the value of[page]
is incremented at each iteration (usually it's 1).maxItems
: is the maximum number of items displayed per page by the torrent search site. It's useful to understand if we have fetched all the available results from the site and end the search prematurely. All fields are mandatory. The total number of pages to retrieve can be defined in the Settings pane from the UI.
The 'ops' dictionary
This is where the magic happens. There can be, as in the example, a dictionary for the following keys
titles
(MANDATORY): the title of the torrent;sizes
(OPTIONAL): the size of the torrent;categories
(OPTIONAL): the category to which the torrent belongs;seeders
(OPTIONAL): the number of seeders for the torrent;leechers
(OPTIONAL): the number of leechers for the torrent;urls
(OPTIONAL ifmagnets
is defined, MANDATORY otherwise): the URL to the .torrent file;magnets
(OPTIONAL ifurls
is defined, MANDATORY otherwise): the torrent's magnet link.
As pointed out above, at least one between urls
and magnets
must be defined, otherwise the engine's sanity check will fail.
For each of the previous keys, if present, we have to define a dictionary containing:
xpath
(MANDATORY): all the elements matching the given XPath are extracted;container
(MANDATORY): this key can have three possible values for now:text
(all the text from the XPath position will be extracted), an attribute name, 'href' or 'class' for example (the value of the attribute will be extracted) and 'raw', which is used for example when a function liketext()
is used.crawler
(urls
andmagnets
keys only) (OPTIONAL): for this key there should be a dictionary. Some sites don't show the URLs or the magnet links directly on the results page, in these cases we can store the url contained in the XPath, which points to the torrent's web page. When it's time to copy the URL or the magnet link to the clipboard, if thecrawler
key is present, then the web page is opened and the URL/magnet link is finally extracted using the second XPath (the one in thecrawler
dictionary), using the same procedure described previously (though only the first element returned by the XPath query will be returned, for obvious reasons).
Sanity check
The engine also does a sanity check, which verifies that the number of elements fetched for each key in ops
is the same and that at least the titles
and one between urls
and magnets
keys are present, otherwise the sanity check fails.
test.py
Test your provider using In the repository there is a simple script (i.e test.py
) which can be run as follows
test.py provider.json "Search this" 1
where the integer at the end is the number of pages to retrieve. The script directly attaches to Engine.py
which contains Fishnet's engine. It is very useful to debug and develop new providers (at least that's what I think).