Create a provider for Fishnet - cr0wbar/fishnet GitHub Wiki

Introduction

Creating a provider for Fishnet is certainly not rocket science or quantum physics, but you should at least have heard something about the following things:

JSON (Java Script Object Notation);
HTML (just the basics, I ain't no Guru either);
Query strings (again, just what they are and how they work is more than enough);
XPath syntax (most important).

Starting from a template

Let's see how a sample provider looks like.

{
    "name":"Foo",
    "baseUrl": "http://foo.org",
    "pattern":"/search?category=[category]&search=[text]&page=[page]",
    "headers:{}
    "pageRules":
    {
	"start":1,
	"step":1,
	"maxItems":15
    },
    "categories":
    {
	"All":"0",
	"Images":"563"
    },
    "ops":
    {
	"titles":
	{
	    "xpath":"//*[@class=\"mbuto\"]/id[2]/a",
	    "container":"text"
	},
	"urls":
	{
	    "xpath":"//*[@class=\"mbuto\"]/id[2]/a",
	    "container":"href",
	    "crawler":
	    {
		"xpath":"//*[@id=\"lapo\"]/a[5]"
	    }
	},
	"magnets":
	{
	    "xpath":"//*[@class=\"magnetlink\"]/id[1]/a",
	    "container":"href"
	}
    }
}

Starting from the top we have

name : the name of the provider (it will be displayed in the user interface when selecting the provider) (Mandatory);
baseUrl : is what should be added at the beginning of any href attribute value in order to obtain a complete path (Mandatory);
pattern : see below (Mandatory);
headers : custom request headers (Optional);
pageRules: see below (Mandatory);
categories : the categories on which we can focus the search into. See below (Mandatory);
ops : the operations performed to extract the data. See below (Mandatory);

Understanding the `pattern` field

From the example above we have

"pattern":"/search?category=[category]&search=[text]&page=[page]"

What Fishnet's engine does is replacing the fields in square brackets as follows:

[category] is replaced by the value in the categories dictionary. If the Search All mode is used then it's All by default;
[text] is the string in the text input field in the main pane (encoded properly);
[page] is the page to retrieve, which is determined by how many pages we want to retrieve (this is configured in the settings pane), and the pageRules dictionary (see dedicated section).

Rules for page retrieval

The pageRules dictionary should be filled as follows:

start : is the numerical value that is going to replace [page] in the query at the very first iteration of the engine (usually it's 1).
step : sets how much the value of [page] is incremented at each iteration (usually it's 1).
maxItems : is the maximum number of items displayed per page by the torrent search site. It's useful to understand if we have fetched all the available results from the site and end the search prematurely. All fields are mandatory. The total number of pages to retrieve can be defined in the Settings pane from the UI.

The 'ops' dictionary

This is where the magic happens. There can be, as in the example, a dictionary for the following keys

titles (MANDATORY): the title of the torrent;
sizes (OPTIONAL): the size of the torrent;
categories (OPTIONAL): the category to which the torrent belongs;
seeders (OPTIONAL): the number of seeders for the torrent;
leechers (OPTIONAL): the number of leechers for the torrent;
urls (OPTIONAL if magnets is defined, MANDATORY otherwise): the URL to the .torrent file;
magnets (OPTIONAL if urls is defined, MANDATORY otherwise): the torrent's magnet link.

As pointed out above, at least one between urls and magnets must be defined, otherwise the engine's sanity check will fail.

For each of the previous keys, if present, we have to define a dictionary containing:

xpath (MANDATORY): all the elements matching the given XPath are extracted;
container (MANDATORY): this key can have three possible values for now: text (all the text from the XPath position will be extracted), an attribute name, 'href' or 'class' for example (the value of the attribute will be extracted) and 'raw', which is used for example when a function like text() is used.
crawler (urls and magnets keys only) (OPTIONAL): for this key there should be a dictionary. Some sites don't show the URLs or the magnet links directly on the results page, in these cases we can store the url contained in the XPath, which points to the torrent's web page. When it's time to copy the URL or the magnet link to the clipboard, if the crawler key is present, then the web page is opened and the URL/magnet link is finally extracted using the second XPath (the one in the crawler dictionary), using the same procedure described previously (though only the first element returned by the XPath query will be returned, for obvious reasons).

Sanity check

The engine also does a sanity check, which verifies that the number of elements fetched for each key in ops is the same and that at least the titles and one between urls and magnets keys are present, otherwise the sanity check fails.

Test your provider using `test.py`

In the repository there is a simple script (i.e test.py) which can be run as follows

test.py provider.json "Search this" 1

where the integer at the end is the number of pages to retrieve. The script directly attaches to Engine.py which contains Fishnet's engine. It is very useful to debug and develop new providers (at least that's what I think).