Meet your dataset - YKul/Tutorials GitHub Wiki

The idea of data might conjure notions of measurements and recorded information. Throughout this program you will come to appreciate that data can tell 'stories', and like any communication, it only conveys meaning when the information is structured in a way that we can parse and decode.

Before we start looking at our data, consider the following dataset of 100 samples:
image

It's hard for us to parse the unstructured data to see anything meaningful.

What if we had recorded this data as a tally?
image
(Note: Each "0" represents an observation of the value to the left of "|" on the corresponding row. Those who are already familiar with stem-plots will understand that this is not technically the correct interpretation, but it is close enough for this simple example)

By categorically sorting the data points and encoding the observations in a way that is optimized for our computationally-fast visual system, we can quickly calculate an approximate middle-point and sense of distribution without much thought at all.

While computers are better at processing numbers than visual data, the organization of data similarly helps them (and their users) parse feature-rich datasets, optimize algorithms, and make interpretations.

In this tutorial we will explore representative example files for this project in order to better understand how the data is organized.

Learning Objectives

Tasks

1. Access the data set

Note: The data sets that were sent around by e-mail may have encoding that has unstable compatibility with the NGS shiny platform. They should work if you just downloaded the dataset. However, if you open the files to view their contents, they may crash NGS when you try to upload them again afterwards.

I have a more stable version of the dataset that was sent to me in a separate e-mail earlier. These files do not exhibit the same instability. I have a suspicion about what's going on, which I will touch on later, because it is somewhat relevant to the topic...

  1. The data is hosted at https://github.com/YKul/Tutorials. Simply click the green <> Code button and select "Download ZIP"
  2. Extract the downloaded zip file. You may use whatever method you find easiest. There are ways to do this through bash terminal, but they may vary from system to system.
  3. Be aware that mosom_ex_otu_input.csv is a very LONG file (4113 characters per line). Some graphical spreadsheet/text editor software can have trouble loading long files like this. If you want to open this file, save any work you have going in case you're forced to restart. This is one situation where we might choose to work in command-line, but the commands might differ slightly depending on your system. I will upload screenshots of the file to avoid the trouble.

Open the files

  1. Before you open the files, notice the file extensions (you may have to enable a setting to show file extensions in your own file manager)
    image
    .txt is probably familiar to us. This is just a plain text-file. Unlike a MS Word .docx, these files have minimal formatting data and can only include text characters.
    .csv and .tsv are 'Comma Separated ValuesandTab Separated Values. These are special text documents that represent data tables (like a spreadsheet). Each line in the file represents a row in the table, and each cell in the row is divided by a designated character (,for.csvand atabwhite-space for.tsv`). This will become clear when you see the same file open as a plain-text and as a spreadsheet.

  2. Open mosom_ex_metadata.txt as both a plain-text and a spreadsheet. Because of its .txt extension your computer will probably default to opening it with a text editor. However, you can still launch Excel and import it (or right-click and go through open with... or your system's equivalent). You may need to tell Excel that this is a delimited file with Tab delimiters (Try This).

  3. It's quite easy to see how that file translates into a table. Now here's what mosom_ex_otu_input.csv looks like:
    Screenshot from 2024-01-22 21-47-46
    Note that the line count is on the far left. Even though my text editor wraps long lines for readability, they are still considered one continuous line. This is possible because newlines, tabs, and all other whitespace are actually encoded as special characters in the text line. This is how Excel is able to tell Tab delimiters apart from spaces in text.
    Here's a truncated screenshot of mosom_ex_otu_input.csv open as a spreadsheet:
    image
    Notice that the first row has incrementing numbers. They go up to 1042. What do you think these represent?

  4. Now open mosom_ex_taxonomy.tsv any way you want and have a look around.

3. Connect some dots

How are these three files organized so the computer can connect the information with the samples? Use the three files to answer the following:

(Click to show answer) How many OTUs of Family: Rhodocyclaceae Genus: Thauera were in sample_2? 15
(Click to show answer) What temperature is associated with this data? Ammonia (NH3-N)? Temperature = 97, NH3-N = 0

4. Describe what you have learned about the dataset

Now that you have some familiarity with how the NGS input files are organized, write a brief set of instructions on how to format and organize a dataset for input to NGS. This should be intended for someone who, like yourself, has never done this sort of analysis before, and wishes to use the NGS platform to analyze their own data. You may find it helpful to use a spreadsheet to make a generalized version of each file using variables like $SAMPLE_NAME_INT to indicate where labels and headings correspond between the different files. Be sure to explain how columns are separated, and whether fields can contain characters, or numbers only (for our purposes we will assume that fields which only contain numbers in our file must only contain numbers). Also specify if those numbers contain decimals, and how many places.

This will be a starting point for instructions that we will build into a wiki, but for now it is more of an exercise to make encourage engagement with the files, and to understand how they are organized and how the algorithms will navigate them to pull out and match data. You are free to do this independently, or to collaborate.

⚠️ **GitHub.com Fallback** ⚠️