Documentation - RattleyCooper/selenext GitHub Wiki
This is where you can learn how to use the selenext micro-framework to scrape the web good, and do other things good too. It's important to read through the entire documentation to get a full idea as to what the framework is capable of so far. Some of the cooler features are near the end of the documentation, like the Page
object and the Requests WebReader
.
Starting out
After installing the library, you'll need to generate a new project...
Generating a new project
Open up a python console(or idle), do an import and call a function:
>>> from selenext import make_project
>>> make_project('~/some/folder/Some Awesome Project')
This will create the project folder. Inside the project folder there will be a SiteAutomations
folder for your Controllers
, a Jobs
folder, for running custom Jobs
, and some project files. In the example below, we import from selenext's example SiteAutomations
folder. When developing, you would write your Controllers in the SiteAutomations
folder contained in your project.
The Project Files
.env
The .env
file is used to hold static variables. It's used for things like setting up your database connection, or holding API keys:
# DB_TYPE values: sql, mysql, postgresql, berkeley
DB_TYPE=sql
DB=default.db
DB_HOST=localhost
DB_PORT=3306
DB_USERNAME=None
DB_PASSWORD=None
You can also store python list
s in the file, along with dict
s:
# List
CUSTOMERS[]:
BOBLHEAD
WOOGLE
CUSTOMERS[END]
# Dict
PRICES{}:
BOOK=15.95
ORANGE=.75
PRICES{END}
Import selenexts env
function to access these values:
>>> from selenext.Environment import env
>>> float(env('PRICES')['BOOK'])
>>> 15.95
or if you want to transform the data as you get it, give it a callback function.
>>> env('DB_PORT', func=int)
>>> 3306
This isn't meant to be used as a complex datastore, it's main purpose is for setting up configurations for applications in a way that makes the data easily accessible.
models.py
models.py
is where you put your peewee
database models if your automation requires database access.
migrations.py
Finally, running the migrations.py
file will drop all the tables in your database(based on what is defined in models.py
), then recreate them. It's used in the database design side of things when you're developing. All this file does is import your models
, then passes them to selenext
s migrate
function which takes care of the rest. If you need to write database seeders, then you could add that to migrations.py
. I usually create an additional Seed
module in my project's Jobs
module. I place all my table seeder jobs into the Seed
module, then run each seeder as a Job using run_job('Seed.UserTable')
.
Single Threaded Automations
Here is an example of a single threaded automation. If you want to learn about building multi-threaded automations, head the the ThreadedCommandFactory
section. For the single-threaded automations, we need to import all the appropriate pieces.
from time import sleep
# Database models used to interact with databases if needed.
import models
# The environment variable loader. These variables can be set in the .env file.
# This is important if we want to create configurable web automations / scrapers.
from selenext.Environment import env, env_driver
# Pull in controllers from selenext's example SiteAutomations.
from selenext.SiteAutomations.Examples import GoogleExample, BingExample
Controllers are kept in the SiteAutomations
folder. In all of these examples, we are importing selenext's example controllers which are located at selenext/SiteAutomations/Examples
, so you can play with those, or write your own Controller
s in the SiteAutomations
folder in your projects folder.
# The quitting contexts helps to `close()` and `quit()` the WebDriver instance if
# something goes wrong.
from selenext.Helpers.Contexts import quitting
from selenium.webdriver.support.wait import WebDriverWait
To get a hold of our WebDriver
instance, we need to use the env_driver
and env
functions. env('BROWSER')
will return the name of the browser set in the .env
file and env_driver
takes the name of the browser, and returns
the appropriate WebDriver
instance. The quitting
function is used to open the WebDriver
instance the same way you would open a file using with
. When you add all of this together, you get:
# This could be written as:
#
# browser = env("BROWSER")
# web_driver = env_driver(browser)
# with quitting(web_driver()) as driver:
# pass
with quitting(env_driver(env("BROWSER"))()) as driver:
# Do stuff.
Now that we have a valid WebDriver
instance, we can instantiate our Controller
s and do some work.
# Get an instance of `WebDriverWait`.
wait = WebDriverWait(driver, 30)
# Pass the web driver to the site automation along with anything
# else it might need to do its job. This could include an
# instance of `WebDriverWait`, and even the collection of
# Models.
google_search = GoogleExample.GoogleSearch(driver, wait, Models)
bing_search = BingExample.BingSearch(driver, wait, Models)
# Do stuff with your controllers.
google_search.do_search('google wiki')
sleep(5)
bing_search.do_search('bing wiki')
sleep(5)
Controllers
There are 2 types of controllers. The first type is really just a class that controls a WebDriver
instance. The other is a class that inherits from IndependentController
, and controls a WebDriver
instance. The only difference is instances of IndpendentController
attach their instance of WebDriver
after they're instantiated using it's attach_driver
method. This facilitates the use of the ThreadedCommandFactory
and CommandFactory
objects. A basic controller might look something like:
class Google(object):
def __init__(self, driver, wait):
self.driver = driver
self.wait = wait
def do_search(self, search_term):
self.driver.get('https://google.com')
# Type search
search_input = self.driver.find_element_by_name('q')
search_input.send_keys(search_term)
# Click search button.
search_button = self.driver.find_element_by_name('btnG')
search_button.click()
self.wait.until(lambda the_driver: the_driver.find_element_by_id('resultStats').is_displayed())
return self
Or, if you wanted to create Command
objects with the ThreadedCommandFactory
objects, it might look like this:
from selenext.Helpers.Controllers import has_kwargs
# Inherit from IndependentController to automatically get access to the `attach_driver` method.
class ThreadedGoogleSearch(IndependentController):
def __init__(self, models):
self.models = models
# Using the @has_kwargs decorator allows keyword arguments to be
# passed to the method. When you assemble a command pack for the
# CommandManager, just include an instance of the Kwargs object.
@has_kwargs
def do_search(self, search_term, some_kwarg='some value'):
print some_kwarg
self.driver.get('https://google.com')
# Type search
search_input = self.driver.find_element_by_name('q')
search_input.send_keys(search_term)
# Click search button.
search_button = self.driver.find_element_by_name('btnG')
search_button.click()
self.wait.until(lambda the_driver: the_driver.find_element_by_id('resultStats').is_displayed())
return self
ThreadedCommandFactory
, CommandFactory
, and Command
ThreadedCommandFactory
and CommandFactory
are used to create Command
objects, which are used to execute Controller
methods. This facilitates the use of separate WebDriver
s for each Controller
(each controller gets it's own browser). Both CommandFactory
objects inherit from BaseCommandFactory
, which sets up the dict
like functionality, and also the base methods that make up the factories. In order to use one of these factories, you must pass a dict
of Controllers to the factory.
# Grab the Models that the Controllers need. They aren't used, just as an example.
import models
# Grab the Example Controllers.
from selenext.SiteAutomations.Examples import GoogleExample, BingExample
# And lastly the CommandFactory
from selenext.Helpers.Commands import ThreadedCommandFactory
# Here we set up the dict of controllers.
controllers = {
'google': GoogleExample.ThreadedGoogleSearch(Models),
'bing': BingExample.ThreadedBingSearch(Models)
}
# Get the CommandFactory instance by passing it the Controllers.
cmd_factory = ThreadedCommandFactory(controllers, logging=False)
Once we have a CommandFactory
, we can create the Command
instance. The Command
instance is used to execute the various commands(methods) your controllers have. This is done by creating a dict
of tuple
s. Use the same keys you used in the dict
of Controllers. Pass a function that takes a Controller
as it's first argument, and this new dict
to cmd_factory.create_command
.
# Setting up the Command pack.
search_command = {
'google': ('google wiki',), # note how single arguments still need to be passed as a tuple
'bing': ('bing wiki',)
}
# Here we pass an anonymous function as the fist argument,
# and search_command as the second.
cmd = cmd_factory.create_command(
lambda controller, *args: controller.do_search(*args),
search_command
)
# Start the command!
cmd.start()
This will execute the do_search
method on each controller, in their own threads, meaning it will only take as long as the longest method to finish executing.
Jobs
Jobs allow you to run pieces of code by calling run_job('JobName')
. Jobs are kept inside the Jobs
folder of your project, as individual python files. So long as these files contain a start_job
function, when the Job's filename(without the extension) is passed to run_job
, it will execute the start_job
function. What is actually happening, is run_job
is importing the name you pass to it, then it extracts the start_job
function and runs it for you. This is super useful for doing common database operations. Maybe you need to seed a database with some fake data, or maybe you need to clear some data out of the database every Sunday at midnight. If you're using PyCharm, it's super simple to open a python console in your project folder, then run your jobs.
from selenext.Project.Jobs import run_job
run_job('ExampleJob')
Here is a what a custom job might look like. Let's assume this file is called UserSeeder.py
:
import models
from faker import Faker
def start_job():
print 'starting user table seeder!'
fake = Faker()
# Create 10 fake users in the User table(doesn't actually exist)
for i in range(0, 10):
user = {
'name': fake.name(),
'address': fake.address()
}
models.User.create(**user)
print 'finished seeding users table!'
To run this job, we run the run_job
function with the python file name:
run_job('UserSeeder')
Requests WebReader
WebReader
allows you to use the find_element
/find_elements
methods
to retrieve Requests.WebElement
instances(this is a helper object within selenext, not Requests). These are different than
a selenium.WebElement
in the sense that interaction with elements
is not supported, but you can traverse the DOM and read information from
the elements in the same way you would with a selenium.WebElement
instance. That means that you can save resources by
using the requests helpers, instead of firing up a selenium WebDriver
,
if all you are doing is scraping a non-interactive web-page:
from selenext.Helpers.Requests import WebReader
driver = WebReader()
driver.get('http://docs.python-requests.org/en/master/')
section = driver.find_element_by_class_name('section')
print(section.text)
print(section.get_attribute('id'))
print(section.find_element_by_tag_name('p').text)
print([e.text for e in section.find_elements_by_xpath('//*[@class="section"]')])
human_click
, human_fill
and randomly_waits
These are all part of the selenext.Helpers.Controllers
module.
randomly_waits
is a decorator that will make a method or function wait a random amount of time between 0.99 seconds and 3.01 seconds. You can use it on controller methods to enable more human-like behavior. human_click
and human_fill
are wrappers around selenium.WebElement
's click
and send_keys
methods that do just this. Here is human_fill
defined:
from selenext.Helpers.Controllers import randomly_waits
@randomly_waits
def human_fill(element, text)
return element.send_keys(text)
Page
object
Using the A Page
object is used to access parts of a webpage in an object oriented way. You define the root url of the webpage in question, and how the driver should find the elements you would like to interact with. The Page
object will fetch the elements as you need them.
Each Page
requires valid dict
with the page's elements
and states
.
In the examples, I use JSON which would then be loaded into a dict
with json.loads
. Let's assume that JSON is in a file called free_search.json
:
{
"root": "https://portland.craigslist.org/search/zip",
"elements": {
"search_input": {
"lookup_method": "id",
"selector": "query"
},
"search_button": {
"lookup_method": "xpath",
"selector": "//button[@class='searchbtn changed_input clickme']"
},
"item_titles": {
"multiple": true,
"bind": "str",
"lookup_method": "xpath",
"selector": "//li[@class='result-row']"
},
"range_from": {
"bind": ["decimal", "Decimal"],
"lookup_method": "class_name",
"selector": "rangeFrom"
},
"range_to": {
"bind": "int",
"lookup_method": "class_name",
"selector": "rangeTo"
}
}
}
You can see that each element is named and, has a lookup_method
, and a selector
. One of them even includes the multiple
attribute which allows multiple elements to be selected. If you wanted to grab a specific element out of the multiple
elements in the list, you can set an index
. Basically, any of the find_element_by_
methods can be called by setting the lookup_method
(id
, name
, class_name
, etc.) and the selector
. Now that we have our page defined, let's interact with it! First we need to pull in some of the various pieces.
from time import sleep
from json import loads
from selenext.Environment import env, env_driver
from selenext.Helpers import load_page
from selenext.Helpers.Contexts import quitting
from selenext.Helpers.Controllers import human_fill, human_click
# Set up the WebDriver instance. I'm using Chrome in my .env file.
with quitting(env_driver(env('BROWSER'))()) as driver:
# Instantiate the `Page` object and give it the path to the JSON file.
free_search_page = load_page('free_search.json', driver)
Next, we check to see if we are on the right page. If the Page
instance has its __bool__
method invoked, it will check to make sure that each one of the elements that have been defined for the page can be found by the driver.
if not free_search_page:
# An instance of `Page` also wraps a `WebDriver`'s `get` method as well.
free_search_page.get(free_search_page.root)
Now we check to see if the search input exists. Since accessing an element directly from the Page
instance will result in a selenium web element being returned, and an error being thrown if it is not found, the PageElement
instances have an exists
method that can be used to check if the elements exist on the current page or not. These PageElement
instances can be checked directly through the view
attribute on the Page
instance.
if free_search_page.view.search_input.exists():
human_fill(free_search_page.search_input, 'lumber')
# or free_search_page.search_input.send_keys('lumber')
# if you didn't read about `human_click` or `human_fill`.
# Wait until the button appears.
while not free_search_page.view.search_button.exists():
sleep(1)
# Click the search button.
human_click(free_search_page.search_button)
# Look at the titles.
for name in free_search_page.item_titles:
print(name.text)
If an element exists within a frame, you can define that as well. The free_search.json
example does not have any elements inside of a frame, but I will give an example of how this can be defined below.
{
"root": "https://somewebsitewithframes.com",
"elements": {
"some_input": {
"selector": "input",
"lookup_method": "id",
"frame": {
"selector": "results",
"lookup_method": "id",
"frame": {
"selector": "content",
"lookup_method": "name"
}
}
}
}
}
The Page
defined above has an input
element that exists in a frame
. Not only that, the frame
the element exists in is sitting in its own frame
. So in order for the WebDriver
to find the element, it must first find the content
frame
, then find the results
frame
in order to locate the input
element.
You can even select elements using parent
elements:
{
"elements": {
"search_input": {
"selector": "query",
"lookup_method": "id",
"parent": {
"selector": "div",
"lookup_method": "tag_name",
"parent": {
"selector": "body",
"lookup_method": "tag_name"
}
}
}
}
}
This is the equivalent of the following selenium
script:
body = driver.find_element_by_tag_name('body')
div = body.find_element_by_tag_name('div')
query = div.find_element_by_id('query')
PageStateContainer
and PageState
objects
PageStateContainer
holds PageState
instances and makes them accessible as attributes. The PageStateContainer
is automatically added to Page
objects as the state
attribute if you have
defined page states
in the dict
used to instantiate the Page
object..
PageState
objects can be used to check if the web page is in a pre-defined state. You
can define this in the same dict
used for the Page
object. Here is JSON example for Reddit.
Everything works the same except now we add a dict of states
.
{
"login_url": "https://reddit.com",
"elements": {
"login_link": {
"selector": "Log in or sign up",
"lookup_method": "link_text"
},
"username_input": {
"selector": "user_login",
"lookup_method": "id"
},
"password_input": {
"selector": "passwd_login",
"lookup_method": "id"
},
"remember_me": {
"selector": "rem_login",
"lookup_method": "id"
},
"login_button": {
"selector": "button",
"lookup_method": "tag_name",
"parent": {
"selector": "login-form",
"lookup_method": "id"
}
},
"logout_link": {
"selector": "logout",
"lookup_method": "link_text"
},
"redditor_time": {
"selector": "age",
"lookup_method": "class_name"
},
"moderator_list": {
"selector": "side-mod-list",
"lookup_method": "id"
},
"user_profile_link": {
"selector": "user",
"lookup_method": "class_name"
}
},
"states": {
"login_form_displayed": {
"exists": [
"username_input",
"password_input"
],
"displayed": [
"username_input",
"password_input"
]
},
"login_link_displayed": {
"exists": [
"login_link"
],
"displayed": [
"login_link"
]
},
"login_finished": {
"exists": [
"logout_link"
],
"displayed": [
"logout_link"
],
"absent": [
"username_input"
]
},
"on_profile": {
"exists": [
"redditor_time",
"moderator_list"
]
},
"logged_out": {
"exists": [
"login_link"
],
"displayed": [
"login_link"
]
}
}
}
Each state
can contain any combination of the following named lists; "exists", "absent", "displayed", "not_displayed", "enabled" and "disabled". You can use the names of the elements
you defined in these lists to define the state of a web page.
“exists” means the element appears in the DOM, “absent” means the element does not appear in the DOM. “displayed” means the element is in the DOM and actually displayed on the screen, and “not_displayed” means the element is in the DOM but no longer displayed on the screen. "enabled" means the element is enabled in the DOM and "disabled" is the opposite.
Using the JSON example above, we can write an automation like this:
if not page.state.login_link_displayed():
page.get(page.login_url)
page.state.login_link_displayed.wait()
page.login_link.click()
page.state.login_form_displayed.wait()
if page.state.login_form_displayed():
page.username_input.send_keys('some_user')
page.password_input.send_keys('some_password')
page.login_button.click()
page.state.login_finished.wait(timeout=10)
It's important to note that when checking if an element is displayed or enabled, the element must exist in the DOM or an error will be thrown.