Tutorial [EN] - IAAA-Lab/Butler GitHub Wiki
First of all, is creating the configuration file acording to this specification
For example:
dockerOS:
name: ubuntu
version: 14.04
crawlSystem:
name: nutch
version: 1.9
seeds:
- https://eina.unizar.es/
- http://www.unizar.es/
rounds: 2
extraction: round
infoCrawled: text
queueMode: byHost
#I don't need any optional configuration
This file is here [conf_tutorial.yml] (https://github.com/Shathe/101CrawlersWeb/blob/master/conf_tutorial.yml), in order to being able to use it, it must be at the same folder level than butler, like in this proyect.
Let's execute the Application executing this [script] (https://github.com/Shathe/101CrawlersWeb/blob/master/Shell.sh) in order to open the shell:
For this tutorial We are going to play the role of the User 1 and we are going to create our first crawl.
-1. Execute the configuration:
config --file conf_tutorial.yml --idProject 5
Expected output:
configurated successfully
In this step, the system will validate our file. If there is any error on the file, we will be warned about what error is and where the error is. If everything is OK, the system will create the required files for the crawl.
-2. Execute its building:
build --idProject 5 --imageName nueva
Expected output:
Image built successfully
In this step, the system will create the docker image (specified by the user in the configuration file) as well as the crawling system.
-3. Start docker:
start --containerName container --idProject 5 --imageName nueva
Expected output:
Container started
In this moment, the system has created the docker container which is running. But actually, the crawl is not, ony the Operating system so the next step is running the crawl.
This command works for restarting the container when it has been whether paused or stopped.
-4. Run the crawl:
run --containerName container --idProject 5 --imageName nueva
Expected output:
Crawler started
Now the crawl is running and it will keep doing it the number of rounds specified in the file, in our case this is 2. Now the only thing to do left is getting the crawled information.
We can extract this information whether the crawl has finished or not, Knowing this is a good thing, though.
-5. How can we know if the crawl has finished?:
We can execute the info command, let's see:
info --containerName container --idProject 5 --imageName nueva
Expected output:
The crawler is running
or the status of the container
status --containerName container --idProject 5 --imageName nueva
Expected output:
Running
We see the crawler in now running, let's wait some minutes [.. 10 minutes later..]
Now we are going to execute another command to see if it has finished:
finished --containerName container --idProject 5 --imageName nueva
Possible outputs:
The crawler hasn't finised yet
Or
Yes, the crawler has finished
If you want to some info about the running crawler, you can see the most important information just like this:
runningStatus --containerName container --idProject 5 --imageName nueva
Expected output:
Fetched links: 2, unfetched links: 170, rounds: 1/2
This means that your crawler has done 1 out of 2 rounds, that has got the info of 2 links, the next round is gonna see if some of the 170 unfetched links have valuable information.
-6. Extracting the information:
First of all we have to know, before searching, we have to index the information. As we have set "extraction: round" it will be indexed in each round so the index will be updated. Nevertheless we can do it ourselves
index --containerName container --idProject 5 --imageName nueva
Expected output:
Indexed correctly
And know we can search the information we want just like this:
search --query 'unizar universidad' --containerName container --idProject 5 --imageName nueva --top 5
Expected output:
58 total matching documents
http://paper.li/CatedrasUnizar/1361728396
https://twitter.com/unizar
https://twitter.com/EINAunizar
http://paper.li/OTRI_Unizar/1374046234
http://www.unizar.es/
Results shown
Results are order from high to low importance.
We can even specify the number of results we want to see in the console. If we don't specify a maximun number of results, the system will return us all the results
The system will also create a file with the 58 total matching documents
Useful commands:
- If we ever need to stop the crawl:
stopCrawl --containerName container --idProject 5 --imageName nueva
Expected output:
Crawl stopped correctly
This is useful if we want to stop consuming resources when we have extracted some information and we do not need more.
- If we ever need to stop the container:
stopContainer --containerName container --idProject 5 --imageName nueva1
Expected output:
Container stopped correctly
Useful for not cosuming resources anymore.
- If we want to pause the container because I dont want the crawl to keep going:
pauseContainer --containerName container --idProject 5 --imageName nueva
Expected output:
Container paused correctly
- If we want to delete the container:
deleteContainer --containerName container --idProject 5 --imageName nueva
Expected output:
Container deleted correctly
- If we want to delete the docker image:
deleteImage --idProject 5 --imageName nueva
Expected output:
Image deleted correctly
This kind of commands are normally use when the crawl has finished and we have extracted the information
If there's any error when we execute these commands, the system will warn us.
If you want to exit just execute:
bye