Embedded SQL - brianhigh/data-workshop GitHub Wiki

  1. Embedding SQL in other environments (Stata, R, etc.)
  2. Using a SQL query to populate a data structure
  3. Examples with RStudio and Stata
  4. HoE: Try embedded SQL with R, Stata, SAS, SPSS, Python, etc.
  5. Discussion
  6. Action Items (readings, videos and tasks)

Examples

Example 1: Running a SQL query of Firefox history from R

install.packages("RSQLite")
library(RSQLite)
hist <- dbConnect(SQLite(),'places.sqlite')
> dbListTables(hist)
 [1] "moz_anno_attributes" "moz_annos"           "moz_bookmarks"      
 [4] "moz_bookmarks_roots" "moz_favicons"        "moz_historyvisits"  
 [7] "moz_hosts"           "moz_inputhistory"    "moz_items_annos"    
[10] "moz_keywords"        "moz_places"          "sqlite_sequence"    
[13] "sqlite_stat1"       
> dbListFields(hist,'moz_places')
 [1] "id"              "url"             "title"           "rev_host"       
 [5] "visit_count"     "hidden"          "typed"           "favicon_id"     
 [9] "frecency"        "last_visit_date" "guid"           
> dbGetQuery(hist,'SELECT substr(url,0,26) as link,frecency from moz_places where link like "http%" order by frecency desc limit 3')
                         link frecency
1     https://www.google.com/     2100
2  http://www.washington.edu/     2000
3    http://www.slashdot.org/      150

See also:

Example 2: Using R to create a SQL table from a "data.frame"

We will be creating a SQL table from an R data.frame and running this SQL query on it:

SELECT row_names AS State, Murder, Assault, 
   ROUND(100 * Murder / Assault, 1) AS MurderAssaultRatio 
FROM arrests 
ORDER BY MurderAssaultRatio DESC 
LIMIT 10

Our query will SELECT the top-ten states having the highest ratio of murders to assaults, which we will report as a percentage. This sort of analysis is may lead to further inquiries about which state may have a "murder problem" or a "reporting problem".

Here is how we will create the database and run the query in R.

> data("USArrests")
> str(USArrests)
'data.frame':	50 obs. of  4 variables:
 $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
 $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
 $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
 $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
> library("RSQLite")
Loading required package: DBI
> drv <- dbDriver("SQLite")
> sqlfile <- tempfile(tmpdir="~", fileext=".sqlite")
> sqlfile
[1] "~/file1ea87682fba.sqlite"
> con <- dbConnect(drv, dbname = sqlfile)
> data(USArrests)
> dbWriteTable(con, "arrests", USArrests)
[1] TRUE
> query <- "SELECT sql from sqlite_master WHERE type='table' and name = 'arrests';"
> cat(dbGetQuery(con, query)$sql)
CREATE TABLE arrests 
( row_names TEXT,
	Murder REAL,
	Assault INTEGER,
	UrbanPop INTEGER,
	Rape REAL 
);
> dbGetQuery(con, "SELECT COUNT(*) FROM arrests")[1, ]
[1] 50
> dbListTables(con)
[1] "arrests"
> dbListFields(con, "arrests")
[1] "row_names" "Murder"    "Assault"   "UrbanPop"  "Rape"     
> query <- "SELECT row_names AS State, Murder, Assault, 
                ROUND(100 * Murder / Assault, 1) AS MurderAssaultRatio 
            FROM arrests 
            ORDER BY MurderAssaultRatio DESC 
            LIMIT 10"
> dbGetQuery(con, query)
           State Murder Assault MurderAssaultRatio
1         Hawaii    5.3      46               11.5
2       Kentucky    9.7     109                8.9
3        Georgia   17.4     211                8.2
4      Tennessee   13.2     188                7.0
5  West Virginia    5.7      81                7.0
6        Indiana    7.2     113                6.4
7          Texas   12.7     201                6.3
8      Louisiana   15.4     249                6.2
9    Mississippi   16.1     259                6.2
10          Ohio    7.3     120                6.1

And we can see that this same query result can be produced from the Bash shell using the sqlite3 command, using the same SQL SELECT statement. First we will set .header on, .mode column, and .width 15 so that the formatting of the output will be similar. We do this with an "init file" which we will create with echo, which is a shell command that sends a text string to a file. The commands look like this:

$ echo -e ".header on\n.mode column\n.width 15" > sqlite.init
$ QUERY="SELECT row_names AS State, Murder, Assault, 
                ROUND(100 * Murder / Assault, 1) AS MurderAssaultRatio 
            FROM arrests 
            ORDER BY MurderAssaultRatio DESC 
            LIMIT 10"
$ sqlite3 -init sqlite.init ~/file1ea87682fba.sqlite "$QUERY"
-- Loading resources from sqlite.init
State            Murder      Assault     MurderAssaultRatio
---------------  ----------  ----------  ------------------
Hawaii           5.3         46          11.5              
Kentucky         9.7         109         8.9               
Georgia          17.4        211         8.2               
Tennessee        13.2        188         7.0               
West Virginia    5.7         81          7.0               
Indiana          7.2         113         6.4               
Texas            12.7        201         6.3               
Louisiana        15.4        249         6.2               
Mississippi      16.1        259         6.2               
Ohio             7.3         120         6.1

See also:

Example 3: Connecting to MySQL, Creating a Table, and Running a Query

The R syntax to perform the same operations using MySQL as the database is almost the same as for SQLite. The only difference is there is not need to create the SQLite file and the dbConnect() function needs different parameters. Here we are connecting to a MySQL server on the same host (MyServer) as our R session and our username is MyUser and password is MyPassword. We are assuming that the "MyUser" user and "MyUser" database already exist. For example, they may been created with these SQL commands from the mysql command-line utility as a user with enhanced privileges.

CREATE USER 'MyUser'@'localhost' IDENTIFIED BY 'MyPassword';
CREATE DATABASE MyUser;
GRANT ALL ON MyUser.* TO 'MyUser'@'localhost';

Here are the R commands...

> install.packages("RMySQL")
> library("RMySQL")
Loading required package: DBI
> drv <- dbDriver("MySQL")
> con <- dbConnect(drv, host="TheServer", dbname="MyUser", user="MyUser", password="MyPassword")
> data("USArrests")
> str(USArrests)
'data.frame':	50 obs. of  4 variables:
 $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
 $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
 $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
 $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
> dbListTables(con)
character(0)
> dbWriteTable(con, "arrests", USArrests)
[1] TRUE
> dbGetQuery(con, "SELECT COUNT(*) FROM arrests")[1, ]
[1] 50
> dbListTables(con)
[1] "arrests"
> dbListFields(con, "arrests")
[1] "row_names" "Murder"    "Assault"   "UrbanPop"  "Rape"     
> dbGetQuery(con, "SELECT row_names AS State, Murder, Assault, ROUND(100 * Murder / Assault, 1) AS MurderAssaultRatio FROM arrests ORDER BY MurderAssaultRatio DESC LIMIT 10")
           State Murder Assault MurderAssaultRatio
1         Hawaii    5.3      46               11.5
2       Kentucky    9.7     109                8.9
3        Georgia   17.4     211                8.2
4      Tennessee   13.2     188                7.0
5  West Virginia    5.7      81                7.0
6        Indiana    7.2     113                6.4
7          Texas   12.7     201                6.3
8      Louisiana   15.4     249                6.2
9    Mississippi   16.1     259                6.2
10          Ohio    7.3     120                6.1

See also:

Example 4: Connecting to PostgreSQL, Creating a Table, and Running a Query

Here we will create the same table and run the same (or similar) query as in the last example, but using PostgreSQL from the same host the as database server runs on (TheServer). (See the comment in the code for connecting from another host.) Also, we are assuming that our account and default database already exists.

Here is the query, slightly modified to work with PostgreSQL:

SELECT "row.names" AS "State", "Murder", "Assault", 
   ROUND(CAST(100 * "Murder" / "Assault" AS NUMERIC), 1) AS "MurderAssaultRatio" 
FROM private.arrests 
ORDER BY "MurderAssaultRatio" DESC 
LIMIT 10

You will notice that we had to quote the column names, the underscore (_) in "row.names" has been replaced by a period (.) and we had to CAST the percentage calculation result. Also, we have to refer to the table as private.arrests since we are creating it in our private schema. Using this schema is more secure, as the default schema is public. However, it will make the commands a little more complex.

Since the private schema may not already exist, you may wish to run these commands from the psql utility first. This should only need to be done once.

$ psql "sslmode=require"
psql (9.1.12)
Type "help" for help.

MyUsername=> CREATE SCHEMA private AUTHORIZATION MyUsername;
ERROR:  schema "private" already exists
MyUsername=> \q

You will also see, below, how the database connection is a little more complex than with SQLite, but very similar to connections with MySQL databases.

> data("USArrests")
> str(USArrests)
'data.frame':	50 obs. of  4 variables:
 $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
 $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
 $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
 $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
> install.packages("RPostgreSQL")
> library(RPostgreSQL)
Loading required package: DBI
> drv <- dbDriver("PostgreSQL")
> con <- dbConnect(drv)

# Or, if you are connecting from another host...
# con <- dbConnect(drv,host="TheServer",user="YourUsername", password="YourPassword")

# In either case, we did not specify the database name with dbname="MyDatabase" because 
# we are using the default in this example. We will however, choose to specify the `private` 
# schema in subsequent commands.

> dbWriteTable(con, c("private","arrests"), USArrests)
[1] TRUE
> rs <- dbSendQuery(con, 'SELECT "row.names" AS "State", "Murder", "Assault", ROUND(CAST(100 * "Murder" / "Assault" AS NUMERIC), 1) AS "MurderAssaultRatio" FROM private.arrests ORDER BY "MurderAssaultRatio" DESC LIMIT 10')
> df <- fetch(rs, n = 10)
> df
           State Murder Assault MurderAssaultRatio
1         Hawaii    5.3      46               11.5
2       Kentucky    9.7     109                8.9
3        Georgia   17.4     211                8.2
4  West Virginia    5.7      81                7.0
5      Tennessee   13.2     188                7.0
6        Indiana    7.2     113                6.4
7          Texas   12.7     201                6.3
8      Louisiana   15.4     249                6.2
9    Mississippi   16.1     259                6.2
10          Ohio    7.3     120                6.1
> dbClearResult(rs)
[1] TRUE

If we connect from another host, we have to embed our database password in our code, which is not very secure. So, it is better to run the code from the same host that runs the server. That way, we do not have to authenticate within the code. If you do store the password in your code, be very careful about securing your code and any history files generated by your statistics package. Some will save a history file without prompting you, and that file will contain any command you may have typed at the prompt. Our server is configured to encrypt all network traffic (through SSL) by default. Make sure your server is configured this way before connecting to it with a password.

Interestingly, it is a lot less work to run from the PostgreSQL command-line utility.

$ psql "sslmode=require"
psql (9.1.12)
Type "help" for help.

MyUsername=> SELECT "row.names" AS "State", "Murder", "Assault", ROUND(CAST(100 * "Murder" / "Assault" AS NUMERIC), 1) AS "MurderAssaultRatio" FROM private.arrests ORDER BY "MurderAssaultRatio" DESC LIMIT 10;
     State     | Murder | Assault | MurderAssaultRatio 
---------------+--------+---------+--------------------
 Hawaii        |    5.3 |      46 |               11.5
 Kentucky      |    9.7 |     109 |                8.9
 Georgia       |   17.4 |     211 |                8.2
 West Virginia |    5.7 |      81 |                7.0
 Tennessee     |   13.2 |     188 |                7.0
 Indiana       |    7.2 |     113 |                6.4
 Texas         |   12.7 |     201 |                6.3
 Louisiana     |   15.4 |     249 |                6.2
 Mississippi   |   16.1 |     259 |                6.2
 Ohio          |    7.3 |     120 |                6.1
(10 rows)

MyUsername=> \q

See also: