Data System Essentials - brianhigh/data-workshop GitHub Wiki
Course Introduction
Participant Introductions
Please introduce yourself and share your:
- Degree program
- Research topic (in general)
- Your particular research project
- The types of data or data systems you will be using
- What you hope to get out of this course
Course Structure
- Meeting once a week for ten weeks
- Casual "guided" study-group approach
- Presentations, demos, hands-on exercises, discussions and "homework"
- Materials: A textbook, eBooks, websites, and online videos
Presentation
Types of data systems
- Different types
- Pluses and minuses
- Examples
Sidebar: Data-driven Websites
Most interactive websites are "database-driven":
Cloud Databases
Some interactive "cloud" DBs where you can analyze and visualize data online...
Some web applications where you can analyse your own data and those of others with sophisticated query languages and programming support:
- eScience SQLShare - slide presentation
- Google Fusion Tables - examples
- YQL Data Tables - intro screencast
Think twice about storing sensitive data in the cloud. If in doubt (and you should be), store your sensitive data in a secured environment, managed by your own organization, according to your specific security requirements, and those required by the funding agency, your research institution, and any applicable laws.
Databases versus Spreadsheets
Databases
- Manageable
- Organized
- Standardized
- Scalable
- Accessible
Spreadsheets
- Convenient
- Interactive
- Visual
- Flexible
- Portable
Sidebar: File types and formats
- Line-endings (Windows: CRLF, Mac-OSX/Unix/Linux: LF)
- Human-readable and computer-readable formats
- Text editors versus word processors
Hands-on Group Exercise
Use Case Diagrams (Wikipedia) - focus on the "what" and not the "how"
Working as a group, make a list of the ways that you and others ("actors") will interact ("actions") with your data system. Draw a simple use case diagram with stick figures (actors) and circles (actions). All of the circles should be enclosed in a "system boundary" box, with the stick figures outside of the box. Lines should connect the actors to their actions. Label the lines with the information communication associated with them.
Some examples that might appear in a (rather contrived) use case diagram for a research project:
- Researcher proposes experimental design.
- Principal investigator approves experimental design.
- Researcher creates survey.
- Subject takes survey.
- Subject provides survey results.
- Researcher analyses statistical results.
- Researcher produces manuscript.
- Principal investigator reviews manuscript.
A sample diagram will be shown on the projection screen.
Discussion: We will project your drawings on the screen and discuss them.
Action Items (readings, videos and tasks)
Tasks
Database behind your favorite website
Find out through Internet research what database system (product name, database type, etc.) underlies your favorite or most-visited website. Examples might be a webmail, search, social, video/movie/music/store, blog, forum, or news website. (Since there are links to information about this on Facebook below, pick another site if that was your favorite.) If the site is popular, you will likely find a blog, news article or conference presentation mentioning the technology that the site uses, including it's back-end database system. Look up the database system product name in Wikipedia. Try to determine why that product was chosen over the other alternatives. Be ready to share this information in the next class session in a one-minute verbal presentation.
Limits to Excel as a database
Find out the actual limits on MS Excel (max. file size, number of rows, etc.) that would make it unusable as a database if those limits were exceeded. For the Excel experts (bonus points): How do you link spreadsheets by matching columns, control the allowed values which can entered in a column, protect cells which contain formulas from being changed, restrict who can modify or view certain spreadsheets, and access the linked spreadsheets from other applications (like websites or statistics programs) over a network?
Use case diagram for your project
Produce a Use Case Diagram for your research study data system. You will present this (for one minute) in the next class session.
Document your data sources in a wiki
Use your wiki in Redmine (or GitHub) to document the list of the data sources you will be working with in your project. Note the file names and locations, file types/applications, organizations/persons/processes they came from, and what you will use them for (i.e. what you will do to/with them). Estimate how much storage space your project will consume (megabytes, gigabytes?), how you will need to access your data (from campus, from off campus, from a mobile device, using what software?) and what sorts of security protections you will need (human subject identifiers?). The wiki language supports tables, which might be a good way to format the information in the wiki. Otherwise sections headings and lists might work okay too. Later you will use this wiki to further elucidate your "data dictionary". Be ready to verbally summarize this in the next session (1 minute presentation).
Get example files for textbook exercises
Download Examples from the textbook and extract the example files from the "pcfb_examples.zip" file to the folder "pcfb". Put that folder in whichever environment you will be working. For now, this will probably be your "Documents" folder on your own computer or in your "home directory" on Plasmid. (We will discuss how to transfer files to and from Plasmid in class.)
Readings
- Read: In the PCfB textbook: "Before You Begin", pp. 1-6; Chapters 1-3, pp. 9-43; and Appendix 1, pp. 451-453 (for Windows and Linux users only). Work through the examples on your own computer (or Plasmid).
- Skim: Data Management (Wikipedia)
- Skim: Data System (Wikipedia)
- Explore: ODK
Watch
- Watch: Database programming tutorial: What are databases? Video
- Watch (one or two): ODK Videos
- Watch (one or two): Google Fusion Tables Example Videos
Quotes
It is estimated that 40% of the defects that make it into the testing phase of enterprise software have their root cause in errors in the original requirements documents.
From: Obamacare's Website Is Crashing Because Backend Was Doomed In The Requirements Stage (Forbes)
See also
- Google and Facebook Team Up to Modernize Old-School Databases
- WebScaleSQL: MySQL for Facebook-sized databases
- What database actually FACEBOOK uses?
- What database does Facebook use?
- NYT: Healthcare.gov Project Chaos Due Partly To Unorthodox Database Choice (Slashdot)
- DailyViz Fusion Tables Examples
- Google Fusion Table map visualization tutorial Video
- Google Fusion Tables Tutorial With Circle of Blue Video
- Topics in Data Management
- That information, These data?
- Is "Data" Singular or Plural?