Data - mattwigway/bikeshare-analysis GitHub Wiki

This page summarizes where data is coming from for this project.

Station popularity

For DC and Minneapolis, data come from trip history files, available here and here.

For San Francisco, data comes from scraping the realtime feed here.

Transit data

The transit stops for this project are coming out of the GTFS feeds for the relevant transit agencies. Once retrieved, they are processed by transitdata/extractStopsByMode.py to create CSV files for each mode. We differentiate rail and bus not because we believe one to be more attractive than the other, but because we believe this distinction to be a rough proxy for ridership. Rail stops, we assume, tend to have more ridership than bus stops and are thus more of a draw for bikeshare use. If we had Automated Passenger Count (APC) data for all of the regions we are using in the model, we could weight by ridership directly when calculating accessibility measures, but such data is not available in Washington.

First, download all of the GTFS data to one directory, giving the files meaningful names (i.e. not google_transit(1...n).zip). The GTFS files we're using can be found here. Then run the script extractStopsByMode.py in that directory to create files bus.csv and rail.csv with all stops for each mode. These are in WGS 84 and can be fed directly to OTP Batch Analyst.

For Minneapolis, GTFS is here.

Residence locations

We are using the Census Bureau's TIGER/Line files pre-joined with demographic data. We are interested only in the population of each block. We use the following files for each area:

  • Washington, DC: the files for Maryland, Virginia and the District of Columbia.
  • San Francisco Bay Area, CA: The file for California.
  • Santa Barbara, CA: The file for California.

If you're working on a multi-state area, first merge the layers in QGIS. Place all of the shapefiles in one directory and then merge them in QGIS (Vector > Data Management Tools > Merge Shapefiles to One). Save the file to a temporary directory, and add it to the map. If you're working in a very large state (e.g. California), first select the blocks you want, then export the selection to a vector file.

One more wrinkle is that you'll need to create a csvt file. QGIS assumes all fields in a CSV are strings unless stated otherwise, which is problematic. So simply create a CSVT file with this line: "String","Integer","Integer","Integer" (hat tip: Anita Graser).

Add the employment CSV to the map using Layer > Add Delimited Text Layer, specify that it has no geometry, give it a reasonable name like jobs, and join it to the blocks data using the properties window for the blocks layer. Set w_geocode as the join field, and BLOCKID10 as the target field. Export the file using the UTM coordinate system (Zone 18N for DC, EPSG:32618) to a temporary directory, and add it to the map. Turn off rendering. Open the attribute table of the blocks and select all records where the jobs variable IS NULL (using Select by Attributes). Use field calculator to set the field to zero. Save the shapefile. Finally, extract the centroids of each block using Vector > Geometry tools > Polygon centroids. There is no need to reproject to WGS 84; OTP will handle that.

Employment data

We get the total number of jobs from the Census Longitudinal Employer Household Dynamics (LEHD) Origin-Destination Employment Statistics, as was done in Rixey 2013. These are at the census block level as well. The script fetchLodes.sh will download all of the needed files and unzip them. Note that the LODES technical documentation is incorrect; the files are actually at http://lehd.ces.census.gov/data/lodes/LODES7/{state}/wac/filename.csv.gz. The script handles all of the URLs for you.

If you're using multiple CSV files, confirm that the column orders are the same, then merge them like so. First create a copy of one of the files. Then run tail -n +2 file.csv >> mergedFile.csv to merge the rest.