Tutorial [EN] - IAAA-Lab/Butler GitHub Wiki

First of all, is creating the configuration file acording to this specification

For example:

dockerOS:
  name: ubuntu
  version: 14.04

crawlSystem:
    name: nutch
    version: 1.9
    seeds:
      - https://eina.unizar.es/
      - http://www.unizar.es/
    rounds: 2
    extraction: round
    infoCrawled: text
    queueMode: byHost

    #I don't need any optional configuration

This file is here [conf_tutorial.yml] (https://github.com/Shathe/101CrawlersWeb/blob/master/conf_tutorial.yml), in order to being able to use it, it must be at the same folder level than butler, like in this proyect.

Let's execute the Application executing this [script] (https://github.com/Shathe/101CrawlersWeb/blob/master/Shell.sh) in order to open the shell:

For this tutorial We are going to play the role of the User 1 and we are going to create our first crawl.

-1. Execute the configuration:

config --file conf_tutorial.yml --idProject 5

Expected output:

configurated successfully

In this step, the system will validate our file. If there is any error on the file, we will be warned about what error is and where the error is. If everything is OK, the system will create the required files for the crawl.

-2. Execute its building:

build --idProject 5 --imageName nueva

Expected output:

Image built successfully

In this step, the system will create the docker image (specified by the user in the configuration file) as well as the crawling system.

-3. Start docker:

start --containerName container --idProject 5 --imageName nueva

Expected output:

Container started

In this moment, the system has created the docker container which is running. But actually, the crawl is not, ony the Operating system so the next step is running the crawl.

This command works for restarting the container when it has been whether paused or stopped.

-4. Run the crawl:

run --containerName container --idProject 5 --imageName nueva

Expected output:

Crawler started

Now the crawl is running and it will keep doing it the number of rounds specified in the file, in our case this is 2. Now the only thing to do left is getting the crawled information.

We can extract this information whether the crawl has finished or not, Knowing this is a good thing, though.

-5. How can we know if the crawl has finished?:

We can execute the info command, let's see:

info --containerName container --idProject 5 --imageName nueva

Expected output:

The crawler is running

or the status of the container

status --containerName container --idProject 5 --imageName nueva

Expected output:

Running

We see the crawler in now running, let's wait some minutes [.. 10 minutes later..]

Now we are going to execute another command to see if it has finished:

finished --containerName container --idProject 5 --imageName nueva

Possible outputs:

The crawler hasn't finised yet

Yes, the crawler has finished

If you want to some info about the running crawler, you can see the most important information just like this:

runningStatus --containerName container --idProject 5 --imageName nueva

Expected output:

Fetched links: 2, unfetched links: 170, rounds: 1/2

This means that your crawler has done 1 out of 2 rounds, that has got the info of 2 links, the next round is gonna see if some of the 170 unfetched links have valuable information.

-6. Extracting the information:

First of all we have to know, before searching, we have to index the information. As we have set "extraction: round" it will be indexed in each round so the index will be updated. Nevertheless we can do it ourselves

index --containerName container --idProject 5 --imageName nueva

Expected output:

Indexed correctly

And know we can search the information we want just like this:

search --query 'unizar universidad'  --containerName container --idProject 5 --imageName nueva --top 5

Expected output:

58 total matching documents
http://paper.li/CatedrasUnizar/1361728396
https://twitter.com/unizar
https://twitter.com/EINAunizar
http://paper.li/OTRI_Unizar/1374046234
http://www.unizar.es/
Results shown

Results are order from high to low importance.

We can even specify the number of results we want to see in the console. If we don't specify a maximun number of results, the system will return us all the results

The system will also create a file with the 58 total matching documents

Useful commands:

If we ever need to stop the crawl:

stopCrawl --containerName container --idProject 5 --imageName nueva

Expected output:

Crawl stopped correctly

This is useful if we want to stop consuming resources when we have extracted some information and we do not need more.

If we ever need to stop the container:

stopContainer --containerName container --idProject 5 --imageName nueva1

Expected output:

Container stopped correctly

Useful for not cosuming resources anymore.

If we want to pause the container because I dont want the crawl to keep going:

pauseContainer --containerName container --idProject 5 --imageName nueva

Expected output:

Container paused correctly

If we want to delete the container:

deleteContainer --containerName container --idProject 5 --imageName nueva

Expected output:

Container deleted correctly

If we want to delete the docker image:

deleteImage --idProject 5 --imageName nueva

Expected output:

Image deleted correctly

This kind of commands are normally use when the crawl has finished and we have extracted the information

If there's any error when we execute these commands, the system will warn us.

If you want to exit just execute:

bye