old_php - gskluzacek/previews GitHub Wiki

Here is some information on old code bases that I've previous written

Get Cover Date

Functionality

This program program takes an input file (generated as an export out of Comic Collectorz) with a list of comic book titles and issue numbers (along with some other info) and loads it into a SQLite DB.

For each title, the program uses curl to get title index from a web site and parses the returned html to extract various data, including issue number and cover date. This data is also loaded into the DB.

Finally, the two sets of data are joined together and produces a report with the original titles and issue numbers along with each issue's cover date.

The interesting part...

The interesting part of this is the code that fetches & parses the Title index from the web site. The php code used curl_exec() to fetch the web page and regular expressions to parse out the data.

To rewrite this code in Python, I want to use the Request module to fetch the web page and instead of relying exclusively on regex to parse the data, I'd like to use XPATH, as after all, HTML is just XML... so I want to use the lxml module.

some prototyping in Python

import requests
from lxml import html
from lxml import etree

# getting the web page is a snap with Requests
url = "http://www.index.com/title.php?id=1234"
page = requests.get(url)

# now we parse the html into a tree
tree = html.fromstring(page.content)

# and get all the <tr> & <tbody> elements that contain the data we are interested in
trs = tree.xpath("//a[@class='page_link'][text()='Issue']/ancestor::table[1]/child::*")[1:]

# lets collect all the <td> tags from the <tr> & <tbody> tags into a single list but keeping
# everything in order. Right now I'm just interested in the first <td> element of the <tr> tag
tds_collected = []
for tr in trs:
  if tr.tag == 'tr':
    # get the 1st child tag of the <tr> tag
    tds_collected.append(tr[0])
  else:
    tbody = tr
    # loop through all the <tr> tags in the <tbody> tag
    for tr in tbody:
      # again, get just the <td> tag (first child) of the <tr> tag
      tds_collected.append(tr[0])

# now print out the collected 
for ndx, td1 in enumerate(tds_collected, 1):
  print("%s: %s" % (ndx, etree.tostring(td1)))