Meet your dataset - YKul/Tutorials GitHub Wiki
The idea of data might conjure notions of measurements and recorded information. Throughout this program you will come to appreciate that data can tell 'stories', and like any communication, it only conveys meaning when the information is structured in a way that we can parse and decode.
Before we start looking at our data, consider the following dataset of 100 samples:
It's hard for us to parse the unstructured data to see anything meaningful.
What if we had recorded this data as a tally?
(Note: Each "0" represents an observation of the value to the left of "|" on the corresponding row. Those who are already familiar with stem-plots will understand that this is not technically the correct interpretation, but it is close enough for this simple example)
By categorically sorting the data points and encoding the observations in a way that is optimized for our computationally-fast visual system, we can quickly calculate an approximate middle-point and sense of distribution without much thought at all.
While computers are better at processing numbers than visual data, the organization of data similarly helps them (and their users) parse feature-rich datasets, optimize algorithms, and make interpretations.
In this tutorial we will explore representative example files for this project in order to better understand how the data is organized.
Note: The data sets that were sent around by e-mail may have encoding that has unstable compatibility with the NGS shiny platform. They should work if you just downloaded the dataset. However, if you open the files to view their contents, they may crash NGS when you try to upload them again afterwards.
I have a more stable version of the dataset that was sent to me in a separate e-mail earlier. These files do not exhibit the same instability. I have a suspicion about what's going on, which I will touch on later, because it is somewhat relevant to the topic...
- The data is hosted at
https://github.com/YKul/Tutorials
. Simply click the green<> Code
button and select "Download ZIP" - Extract the downloaded zip file. You may use whatever method you find easiest. There are ways to do this through bash terminal, but they may vary from system to system.
- Be aware that
mosom_ex_otu_input.csv
is a very LONG file (4113 characters per line). Some graphical spreadsheet/text editor software can have trouble loading long files like this. If you want to open this file, save any work you have going in case you're forced to restart. This is one situation where we might choose to work in command-line, but the commands might differ slightly depending on your system. I will upload screenshots of the file to avoid the trouble.
-
Before you open the files, notice the file extensions (you may have to enable a setting to show file extensions in your own file manager)
.txt
is probably familiar to us. This is just a plain text-file. Unlike a MS Word.docx
, these files have minimal formatting data and can only include text characters.
.csv
and.tsv
are 'Comma Separated Valuesand
Tab Separated Values. These are special text documents that represent data tables (like a spreadsheet). Each line in the file represents a row in the table, and each cell in the row is divided by a designated character (
,for
.csvand a
tabwhite-space for
.tsv`). This will become clear when you see the same file open as a plain-text and as a spreadsheet. -
Open
mosom_ex_metadata.txt
as both a plain-text and a spreadsheet. Because of its .txt extension your computer will probably default to opening it with a text editor. However, you can still launch Excel and import it (or right-click and go throughopen with...
or your system's equivalent). You may need to tell Excel that this is a delimited file withTab
delimiters (Try This). -
It's quite easy to see how that file translates into a table. Now here's what
mosom_ex_otu_input.csv
looks like:
Note that the line count is on the far left. Even though my text editor wraps long lines for readability, they are still considered one continuous line. This is possible because newlines, tabs, and all other whitespace are actually encoded as special characters in the text line. This is how Excel is able to tell Tab delimiters apart from spaces in text.
Here's a truncated screenshot ofmosom_ex_otu_input.csv
open as a spreadsheet:
Notice that the first row has incrementing numbers. They go up to 1042. What do you think these represent? -
Now open
mosom_ex_taxonomy.tsv
any way you want and have a look around.
How are these three files organized so the computer can connect the information with the samples?
Use the three files to answer the following:
(Click to show answer) How many OTUs of Family: Rhodocyclaceae Genus: Thauera were in sample_2?
15(Click to show answer) What temperature is associated with this data? Ammonia (NH3-N)?
Temperature = 97, NH3-N = 0Now that you have some familiarity with how the NGS input files are organized, write a brief set of instructions on how to format and organize a dataset for input to NGS. This should be intended for someone who, like yourself, has never done this sort of analysis before, and wishes to use the NGS platform to analyze their own data. You may find it helpful to use a spreadsheet to make a generalized version of each file using variables like $SAMPLE_NAME_INT to indicate where labels and headings correspond between the different files. Be sure to explain how columns are separated, and whether fields can contain characters, or numbers only (for our purposes we will assume that fields which only contain numbers in our file must only contain numbers). Also specify if those numbers contain decimals, and how many places.
This will be a starting point for instructions that we will build into a wiki, but for now it is more of an exercise to make encourage engagement with the files, and to understand how they are organized and how the algorithms will navigate them to pull out and match data. You are free to do this independently, or to collaborate.