Documentation - RattleyCooper/selenext GitHub Wiki

This is where you can learn how to use the selenext micro-framework to scrape the web good, and do other things good too. It's important to read through the entire documentation to get a full idea as to what the framework is capable of so far. Some of the cooler features are near the end of the documentation, like the Page object and the Requests WebReader.

Starting out

After installing the library, you'll need to generate a new project...

Generating a new project

Open up a python console(or idle), do an import and call a function:

>>> from selenext import make_project
>>> make_project('~/some/folder/Some Awesome Project')

This will create the project folder. Inside the project folder there will be a SiteAutomations folder for your Controllers, a Jobs folder, for running custom Jobs, and some project files. In the example below, we import from selenext's example SiteAutomations folder. When developing, you would write your Controllers in the SiteAutomations folder contained in your project.

The Project Files

`.env`

The .env file is used to hold static variables. It's used for things like setting up your database connection, or holding API keys:

# DB_TYPE values: sql, mysql, postgresql, berkeley
DB_TYPE=sql
DB=default.db
DB_HOST=localhost
DB_PORT=3306
DB_USERNAME=None
DB_PASSWORD=None

You can also store python lists in the file, along with dicts:

# List
CUSTOMERS[]:
BOBLHEAD
WOOGLE
CUSTOMERS[END]

# Dict
PRICES{}:
BOOK=15.95
ORANGE=.75
PRICES{END}

Import selenexts env function to access these values:

>>> from selenext.Environment import env

>>> float(env('PRICES')['BOOK'])
>>> 15.95

or if you want to transform the data as you get it, give it a callback function.

>>> env('DB_PORT', func=int)
>>> 3306

This isn't meant to be used as a complex datastore, it's main purpose is for setting up configurations for applications in a way that makes the data easily accessible.

`models.py`

models.py is where you put your peewee database models if your automation requires database access.

`migrations.py`

Finally, running the migrations.py file will drop all the tables in your database(based on what is defined in models.py), then recreate them. It's used in the database design side of things when you're developing. All this file does is import your models, then passes them to selenexts migrate function which takes care of the rest. If you need to write database seeders, then you could add that to migrations.py. I usually create an additional Seed module in my project's Jobs module. I place all my table seeder jobs into the Seed module, then run each seeder as a Job using run_job('Seed.UserTable').

Single Threaded Automations

Here is an example of a single threaded automation. If you want to learn about building multi-threaded automations, head the the ThreadedCommandFactory section. For the single-threaded automations, we need to import all the appropriate pieces.

from time import sleep

# Database models used to interact with databases if needed.
import models

# The environment variable loader.  These variables can be set in the .env file.  
# This is important if we want to create configurable web automations / scrapers.
from selenext.Environment import env, env_driver

# Pull in controllers from selenext's example SiteAutomations.
from selenext.SiteAutomations.Examples import GoogleExample, BingExample

Controllers are kept in the SiteAutomations folder. In all of these examples, we are importing selenext's example controllers which are located at selenext/SiteAutomations/Examples, so you can play with those, or write your own Controllers in the SiteAutomations folder in your projects folder.

# The quitting contexts helps to `close()` and `quit()` the WebDriver instance if 
# something goes wrong.
from selenext.Helpers.Contexts import quitting
from selenium.webdriver.support.wait import WebDriverWait

To get a hold of our WebDriver instance, we need to use the env_driver and env functions. env('BROWSER') will return the name of the browser set in the .env file and env_driver takes the name of the browser, and returns the appropriate WebDriver instance. The quitting function is used to open the WebDriver instance the same way you would open a file using with. When you add all of this together, you get:

# This could be written as:
#
# browser = env("BROWSER")
# web_driver = env_driver(browser)
# with quitting(web_driver()) as driver:
#     pass
with quitting(env_driver(env("BROWSER"))()) as driver:
    # Do stuff.

Now that we have a valid WebDriver instance, we can instantiate our Controllers and do some work.

    # Get an instance of `WebDriverWait`.
    wait = WebDriverWait(driver, 30)

    # Pass the web driver to the site automation along with anything
    # else it might need to do its job. This could include an
    # instance of `WebDriverWait`, and even the collection of
    # Models.
    google_search = GoogleExample.GoogleSearch(driver, wait, Models)
    bing_search = BingExample.BingSearch(driver, wait, Models)

    # Do stuff with your controllers.
    google_search.do_search('google wiki')
    sleep(5)
    bing_search.do_search('bing wiki')
    sleep(5)

Controllers

There are 2 types of controllers. The first type is really just a class that controls a WebDriver instance. The other is a class that inherits from IndependentController, and controls a WebDriver instance. The only difference is instances of IndpendentController attach their instance of WebDriver after they're instantiated using it's attach_driver method. This facilitates the use of the ThreadedCommandFactory and CommandFactory objects. A basic controller might look something like:

class Google(object):
    def __init__(self, driver, wait):
        self.driver = driver
        self.wait = wait

    def do_search(self, search_term):
        self.driver.get('https://google.com')

        # Type search
        search_input = self.driver.find_element_by_name('q')
        search_input.send_keys(search_term)

        # Click search button.
        search_button = self.driver.find_element_by_name('btnG')
        search_button.click()
        self.wait.until(lambda the_driver: the_driver.find_element_by_id('resultStats').is_displayed())
        return self

Or, if you wanted to create Command objects with the ThreadedCommandFactory objects, it might look like this:

from selenext.Helpers.Controllers import has_kwargs

# Inherit from IndependentController to automatically get access to the `attach_driver` method.
class ThreadedGoogleSearch(IndependentController):
    def __init__(self, models):
        self.models = models

    # Using the @has_kwargs decorator allows keyword arguments to be
    # passed to the method.  When you assemble a command pack for the
    # CommandManager, just include an instance of the Kwargs object.
    @has_kwargs
    def do_search(self, search_term, some_kwarg='some value'):
        print some_kwarg
        self.driver.get('https://google.com')

        # Type search
        search_input = self.driver.find_element_by_name('q')
        search_input.send_keys(search_term)

        # Click search button.
        search_button = self.driver.find_element_by_name('btnG')
        search_button.click()
        self.wait.until(lambda the_driver: the_driver.find_element_by_id('resultStats').is_displayed())
        return self

`ThreadedCommandFactory`, `CommandFactory`, and `Command`

ThreadedCommandFactory and CommandFactory are used to create Command objects, which are used to execute Controller methods. This facilitates the use of separate WebDrivers for each Controller(each controller gets it's own browser). Both CommandFactory objects inherit from BaseCommandFactory, which sets up the dict like functionality, and also the base methods that make up the factories. In order to use one of these factories, you must pass a dict of Controllers to the factory.

# Grab the Models that the Controllers need. They aren't used, just as an example.
import models
# Grab the Example Controllers.
from selenext.SiteAutomations.Examples import GoogleExample, BingExample
# And lastly the CommandFactory
from selenext.Helpers.Commands import ThreadedCommandFactory

# Here we set up the dict of controllers.
controllers = {
    'google': GoogleExample.ThreadedGoogleSearch(Models),
    'bing': BingExample.ThreadedBingSearch(Models)
}

# Get the CommandFactory instance by passing it the Controllers.
cmd_factory = ThreadedCommandFactory(controllers, logging=False)

Once we have a CommandFactory, we can create the Command instance. The Command instance is used to execute the various commands(methods) your controllers have. This is done by creating a dict of tuples. Use the same keys you used in the dict of Controllers. Pass a function that takes a Controller as it's first argument, and this new dict to cmd_factory.create_command.

# Setting up the Command pack.
search_command = {
    'google': ('google wiki',),  # note how single arguments still need to be passed as a tuple
    'bing': ('bing wiki',)
}
# Here we pass an anonymous function as the fist argument,
# and search_command as the second.
cmd = cmd_factory.create_command(
    lambda controller, *args: controller.do_search(*args), 
    search_command
)
# Start the command!
cmd.start()

This will execute the do_search method on each controller, in their own threads, meaning it will only take as long as the longest method to finish executing.

Jobs

Jobs allow you to run pieces of code by calling run_job('JobName'). Jobs are kept inside the Jobs folder of your project, as individual python files. So long as these files contain a start_job function, when the Job's filename(without the extension) is passed to run_job, it will execute the start_job function. What is actually happening, is run_job is importing the name you pass to it, then it extracts the start_job function and runs it for you. This is super useful for doing common database operations. Maybe you need to seed a database with some fake data, or maybe you need to clear some data out of the database every Sunday at midnight. If you're using PyCharm, it's super simple to open a python console in your project folder, then run your jobs.

from selenext.Project.Jobs import run_job
run_job('ExampleJob')

Here is a what a custom job might look like. Let's assume this file is called UserSeeder.py:

import models
from faker import Faker


def start_job():
    print 'starting user table seeder!'
    fake = Faker()

    # Create 10 fake users in the User table(doesn't actually exist)
    for i in range(0, 10):
        user = {
            'name': fake.name(),
            'address': fake.address()
        }
        models.User.create(**user)

    print 'finished seeding users table!'

To run this job, we run the run_job function with the python file name:

run_job('UserSeeder')

Requests WebReader

WebReader allows you to use the find_element/find_elements methods to retrieve Requests.WebElement instances(this is a helper object within selenext, not Requests). These are different than a selenium.WebElement in the sense that interaction with elements is not supported, but you can traverse the DOM and read information from the elements in the same way you would with a selenium.WebElement instance. That means that you can save resources by using the requests helpers, instead of firing up a selenium WebDriver, if all you are doing is scraping a non-interactive web-page:

from selenext.Helpers.Requests import WebReader

driver = WebReader()

driver.get('http://docs.python-requests.org/en/master/')

section = driver.find_element_by_class_name('section')
print(section.text)
print(section.get_attribute('id'))
print(section.find_element_by_tag_name('p').text)
print([e.text for e in section.find_elements_by_xpath('//*[@class="section"]')])

`human_click`, `human_fill` and `randomly_waits`

These are all part of the selenext.Helpers.Controllers module.

randomly_waits is a decorator that will make a method or function wait a random amount of time between 0.99 seconds and 3.01 seconds. You can use it on controller methods to enable more human-like behavior. human_click and human_fill are wrappers around selenium.WebElement's click and send_keys methods that do just this. Here is human_fill defined:

from selenext.Helpers.Controllers import randomly_waits

@randomly_waits
def human_fill(element, text)
    return element.send_keys(text)

Using the `Page` object

A Page object is used to access parts of a webpage in an object oriented way. You define the root url of the webpage in question, and how the driver should find the elements you would like to interact with. The Page object will fetch the elements as you need them.

Each Page requires valid dict with the page's elements and states.

In the examples, I use JSON which would then be loaded into a dict with json.loads. Let's assume that JSON is in a file called free_search.json:

{
  "root": "https://portland.craigslist.org/search/zip",
  "elements": {
    "search_input": {
      "lookup_method": "id",
      "selector": "query"
    },
    "search_button": {
      "lookup_method": "xpath",
      "selector": "//button[@class='searchbtn changed_input clickme']"
    },
    "item_titles": {
      "multiple": true,
      "bind": "str",
      "lookup_method": "xpath",
      "selector": "//li[@class='result-row']"
    },
    "range_from": {
      "bind": ["decimal", "Decimal"],
      "lookup_method": "class_name",
      "selector": "rangeFrom"
    },
    "range_to": {
      "bind": "int",
      "lookup_method": "class_name",
      "selector": "rangeTo"
    }
  }
}

You can see that each element is named and, has a lookup_method, and a selector. One of them even includes the multiple attribute which allows multiple elements to be selected. If you wanted to grab a specific element out of the multiple elements in the list, you can set an index. Basically, any of the find_element_by_ methods can be called by setting the lookup_method(id, name, class_name, etc.) and the selector. Now that we have our page defined, let's interact with it! First we need to pull in some of the various pieces.

from time import sleep
from json import loads
from selenext.Environment import env, env_driver
from selenext.Helpers import load_page
from selenext.Helpers.Contexts import quitting
from selenext.Helpers.Controllers import human_fill, human_click


# Set up the WebDriver instance.  I'm using Chrome in my .env file.
with quitting(env_driver(env('BROWSER'))()) as driver:

    # Instantiate the `Page` object and give it the path to the JSON file.
    free_search_page = load_page('free_search.json', driver)

Next, we check to see if we are on the right page. If the Page instance has its __bool__ method invoked, it will check to make sure that each one of the elements that have been defined for the page can be found by the driver.

    if not free_search_page:
        # An instance of `Page` also wraps a `WebDriver`'s `get` method as well.
        free_search_page.get(free_search_page.root)

Now we check to see if the search input exists. Since accessing an element directly from the Page instance will result in a selenium web element being returned, and an error being thrown if it is not found, the PageElement instances have an exists method that can be used to check if the elements exist on the current page or not. These PageElement instances can be checked directly through the view attribute on the Page instance.

    if free_search_page.view.search_input.exists():
        human_fill(free_search_page.search_input, 'lumber')
        # or free_search_page.search_input.send_keys('lumber') 
        # if you didn't read about `human_click` or `human_fill`.

    # Wait until the button appears.
    while not free_search_page.view.search_button.exists():
        sleep(1)

    # Click the search button.
    human_click(free_search_page.search_button)

    # Look at the titles.
    for name in free_search_page.item_titles:
        print(name.text)

If an element exists within a frame, you can define that as well. The free_search.json example does not have any elements inside of a frame, but I will give an example of how this can be defined below.

{
  "root": "https://somewebsitewithframes.com",
  "elements": {
    "some_input": {
      "selector": "input",
      "lookup_method": "id",
      "frame": {
        "selector": "results",
        "lookup_method": "id",
        "frame": {
          "selector": "content",
          "lookup_method": "name"
        }
      }
    }
  }
}

The Page defined above has an input element that exists in a frame. Not only that, the frame the element exists in is sitting in its own frame. So in order for the WebDriver to find the element, it must first find the content frame, then find the results frame in order to locate the input element.

You can even select elements using parent elements:

{
  "elements": {
    "search_input": {
      "selector": "query",
      "lookup_method": "id",
      "parent": {
        "selector": "div",
        "lookup_method": "tag_name",
        "parent": {
          "selector": "body",
          "lookup_method": "tag_name"
        }
      }
    }
  }
}

This is the equivalent of the following selenium script:

body = driver.find_element_by_tag_name('body')
div = body.find_element_by_tag_name('div')
query = div.find_element_by_id('query')

`PageStateContainer` and `PageState` objects

PageStateContainer holds PageState instances and makes them accessible as attributes. The PageStateContainer is automatically added to Page objects as the state attribute if you have defined page states in the dict used to instantiate the Page object..

PageState objects can be used to check if the web page is in a pre-defined state. You can define this in the same dict used for the Page object. Here is JSON example for Reddit.
Everything works the same except now we add a dict of states.

{
  "login_url": "https://reddit.com",
  "elements": {
    "login_link": {
      "selector": "Log in or sign up",
      "lookup_method": "link_text"
    },
    "username_input": {
      "selector": "user_login",
      "lookup_method": "id"
    },
    "password_input": {
      "selector": "passwd_login",
      "lookup_method": "id"
    },
    "remember_me": {
      "selector": "rem_login",
      "lookup_method": "id"
    },
    "login_button": {
      "selector": "button",
      "lookup_method": "tag_name",
      "parent": {
        "selector": "login-form",
        "lookup_method": "id"
      }
    },
    "logout_link": {
      "selector": "logout",
      "lookup_method": "link_text"
    },
    "redditor_time": {
      "selector": "age",
      "lookup_method": "class_name"
    },
    "moderator_list": {
      "selector": "side-mod-list",
      "lookup_method": "id"
    },
    "user_profile_link": {
      "selector": "user",
      "lookup_method": "class_name"
    }
  },
  "states": {
    "login_form_displayed": {
      "exists": [
        "username_input",
        "password_input"
      ],
      "displayed": [
        "username_input",
        "password_input"
      ]
    },
    "login_link_displayed": {
      "exists": [
        "login_link"
      ],
      "displayed": [
        "login_link"
      ]
    },
    "login_finished": {
      "exists": [
        "logout_link"
      ],
      "displayed": [
        "logout_link"
      ],
      "absent": [
        "username_input"
      ]
    },
    "on_profile": {
      "exists": [
        "redditor_time",
        "moderator_list"
      ]
    },
    "logged_out": {
      "exists": [
        "login_link"
      ],
      "displayed": [
        "login_link"
      ]
    }
  }
}

Each state can contain any combination of the following named lists; "exists", "absent", "displayed", "not_displayed", "enabled" and "disabled". You can use the names of the elements you defined in these lists to define the state of a web page.

“exists” means the element appears in the DOM, “absent” means the element does not appear in the DOM. “displayed” means the element is in the DOM and actually displayed on the screen, and “not_displayed” means the element is in the DOM but no longer displayed on the screen. "enabled" means the element is enabled in the DOM and "disabled" is the opposite.

Using the JSON example above, we can write an automation like this:

if not page.state.login_link_displayed():
    page.get(page.login_url)
    page.state.login_link_displayed.wait()

page.login_link.click()
page.state.login_form_displayed.wait()
if page.state.login_form_displayed():
    page.username_input.send_keys('some_user')
    page.password_input.send_keys('some_password')
    page.login_button.click()
    page.state.login_finished.wait(timeout=10)

It's important to note that when checking if an element is displayed or enabled, the element must exist in the DOM or an error will be thrown.

Documentation - RattleyCooper/selenext GitHub Wiki

Starting out

Generating a new project

The Project Files

.env

models.py

migrations.py