Course Rationale - brianhigh/data-workshop GitHub Wiki

Experiences from a University Department (UW DEOHS)

Researchers in our department may discover they have difficulty managing data for three reasons:

Popular tools such as Excel spreadsheets can become overburdened with large amounts of data.
Collaboration can be difficult when you need to share, import, and export data of different formats.
Without automating common tasks, work can become tedious, laborious, and inefficient.

These problems can be addressed successfully through the use of appropriate tools and sufficient training in the use of those tools. We will illustrate this with a case study (based on a true story).

A. Case Study: Needing Help

A new researcher has come to the department to begin a new project. She will be working with heath and environmental data and will perform statistical analysis on these data. In her education or her previous work, she had used a statistics program like Stata, SPSS, or SAS. Starting with the environmental data she opens her statistics program and discovers she cannot read the data files and there is no import function or conversion tool available for those files, since the format is non-standard.

So, she spends a few days or weeks working on an import script in her statistics program and is finally able to import the data. Running her program, it takes several days to import all of the data. Once it is imported, the statistics program becomes too slow to use, as the data have consumed all of the computer's memory. Adding enough extra memory to fill all available slots in the computer helps a little, but not enough. A simple query of the data still takes several hours to run.

B. Case Study: Getting Help

She decides to ask a friend for help. The friend, one of her collaborators, looks at the raw data and her import script and realizes the code could have been much shorter, simpler, and easier to debug, had she used standard language features like loops and functions to minimize copy-and-paste coding. Secondly, the friend sees that the script tries to read all of the data into memory, all at once, before it begins to convert the format of each data record. Realizing that each line of raw data can be converted independently of the other lines, the friend recommends re-coding the script to read in data, only one line at a time, convert it using a function, then write the output to a separate file, all of this taking place within a program loop.

While the data can be read very quickly this way, the system is still slow at performing queries on the output file. The researcher is also unsure how this file can be linked with her health data records. So the two of them investigate the possibility of using a database to store and link the data, as a "back end" to the statistics program. After getting help setting up this database, they find queries take only seconds, or at most a few minutes, data sets can easily be linked to other data, and now her collaborators can also access the database simultaneously through the network, so they can all use the data without interfering with each other, even though they each use different statistics programs.

C. Where to Get Help

While DEOHS offers computing support, we are not staffed adequately to manage everyone's data -- i.e., creating programs and databases for each research project and supporting those customized tools throughout the life of the project. What we can do is provide some consultation and pointers to help you manage your own databases and code.

Many researchers in our department don't have significant experience in more formal data management and relatively few have any formal training in programming and programming "best practices". So, despite years of training in scientific research methods and statistics, very few are adequately prepared to manage the large amounts of data that their research can produce, or do so in an efficient way that works well in a "multi-user" collaborative environment.

This mismatch between training and skill requirements will only become more serious as new techniques produce increasingly greater amounts of data. As we seek to analyze more and more of this new information, we should also bridge the skills gap with more training in data management.

The first place to start should be to find the community of users of your favorite statistics applications and ask them how they manage data issues, larger projects, and collaborate with other users, from within that particular software application. Although you may find experts willing to help you on campus, these communities can generally be found online with an internet search. Identify community forums where you find these types of discussions and join in, posting your questions to the community.

Ask about programming using that application's environment, as most statistics programs have an embedded scripting language of some kind. Active community members may be very skilled with coding for that specific application and may offer expert advice, or might point you to helpful blogs and tutorials online.

Also consider using a different application if you find the user community frequently going elsewhere for satisfying solutions to your data management challenges. Use the best tool for the job even if it requires learning how to use a new tool. Avoid messy, costly work-arounds resulting from clinging too long to the "one true tool".

Secondly, invest some time in data management training. Learn the basics of relational database technology and the nearly-universal SQL database language. While you can learn the basics from books and tutorials on the web, we encourage you to take a course or two on the subject. These are valuable skills for all technical fields and having them will enhance your career.

D. Data Management Courses

You may find training specific to a particular software application offered directly by the software manufacturer. However, a more general and widely-applicable background in data management may be obtained from taking courses offered locally here at the UW, through nearby community colleges, or online through "distance learning" offerings.

Some possibilities offered on the UW campus for "non-majors" are: CSE 414; INFO 240, and INFO 245; IS 310 and IS 445; as well as a certificate program in Database Management. Also, Seattle Central Community College offers four courses on this topic (ITC 220, 222, 224, 226). Bellevue Community College offers two programs in data management with several course offerings. You may also consider online courses such as the UW Information School's INFX 502, INFX 543, and INFX 563 series.

Data management technology has become nearly universal through the common use of "relational database" structures and a standardized language called SQL. Most common statistics applications support a relational database back-end, accessed through embedded SQL commands or through a more "native" interface. Taking an introductory course on database design and SQL will give you a data management foundation which can be applied directly to any of these popular statistics programs.