Syllabus - brianhigh/data-workshop GitHub Wiki

Data Management for Researchers

This is a tentative syllabus for a collaborative, hands-on course in data management for academic scientific research projects.

Please see the rationale page for why we are running this course. You may also wish to view some participant profiles to get an idea of who might want to be involved in a course like this. Throughout the course, we will develop case studies based on real research projects our participants are currently managing.

We will meet once a week for a presentation and workshop, followed by a quick summary discussion. Before parting, we will agree on action items (i.e., "homework") to prepare for the next meeting.

10 min. - Review of last meeting and "homework"
15 min. - Presentation of new material (see outline below)
15 min. - A guided "hands-on exercise" (laptop or pen/paper)
10 min. - Discussion: share exercise results and choose action items
-------
50 min.

Textbook

We will be using the following as a textbook for our workshop sessions:

Practical Computing for Biologists

The handy reference tables from the appendices can be downloaded freely here:

http://practicalcomputing.org/files/PCfB_Appendices.pdf

We will not have time to review much of this material during our workshops. Instead we will be assigning readings from this text and will refer to (and use) the information and techniques described in the text. Ideally, this material would have already been covered in a previous course, as it lays a foundation in computer skills needed for data management and analysis. These skills include navigating filesystems, use of a command-line interface (CLI) known as the "shell" (Terminal), matching text with regular expressions, creating data pipelines, shell scripting, and installing software. We will have some time in our meetings to answer questions about these topics. The chapter on relational databases, however, will be covered in our workshops and expanded upon with material from other sources.

Skipping the Python Chapters

Sadly, we will have to skip the middle section of the book which is about programming with the Python language. It is a great language choice for the book and for your data work, but we simply do not have time to cover this topic. We encourage you to consider learning Python, along with R, if you do not already know those two important data science languages.

Optional "Clearly Explained" Database Books

For a more in-depth coverage of database design and SQL, please consider (optional):

... both by Jan L. Harrington, who really does "clearly explain" things. The used prices for these are very affordable - $8 to $12 each.

eBooks from SPL (O'Reilly Safari)

Most other course materials will be available freely over the Internet. Some resources, however, will be accessed as eBooks through the Seattle Public Library. If you do not already have a SPL card, you can register to get one here (restrictions apply):

http://www.spl.org/using-the-library/get-started/get-a-library-card

We should also recognize that the UW offers access to some eBook collections with material on data management topics. However, the selection is comparatively limited.

Preparing Your Computer

The textbook recommends some software in Chapter 1 and Appendix 1. We encourage you to install the editors and command-line utilities mentioned in those sections. In addition, those who will be using a Linux server through an X2Go connection should install X2Go-Client. Since we have loaded all of the software you might need for the course on our departmental bioinformatics server, those who have departmental accounts can do all of their coursework on this server through an X2Go connection or simply an SSH (Terminal) connection. While OSX (and Linux) comes with a suitable SSH client, Windows users may wish to install PuTTY. The textbook describes the procedure for creating a virtual machine running Linux (with VirtualBox) for those who do not already have access to a Linux environment.

Learning Objectives

Participants in this course should expect to learn:

When to consider the use of a database system for scientific research projects
How to determine project requirements and anticipate disk, memory and processing needs
The basics of data security in networked environments
Practical skills in managing, converting, and processing data files
How to use a command-line-interface (CLI), such as the Bash, R, and SQL interactive consoles
Basic database programming using the SQL language
How to design and implement a relational database
How to connect to and use a database from various statistical applications
How websites are built on (and from) database systems (and other web technologies)
Basic systems administration skills such as installing software and configuring services
Familiarity with virtual machine (VM) technology and how to use it for data system development
How to use collaborative project management applications and revision control systems

Course Outline

Exact topics, exercises, dates and times TBD.

Session 1: Data System Essentials
Session 2: Mobile Data Collection
Session 3: Systems Analysis and Design
Session 4: Introduction to Relational Databases
Session 5: Building Database Tables
Session 6: Database Applications and SQL
Session 7: Web-enabled Data: Applications and Frameworks
Session 8: Cleaning Data and Applied SQL
Session 9: Project Management (PM) and Version Control Systems (VCS)
Session 10: Data Security and System Administration