Course 7‐2 - Forestreee/Data-Analytics GitHub Wiki
Google Data Analytics Professional
[Data Analysis with R Programming]
Programming using RStudio
WEEK2 -Using R can help you complete your analysis efficiently and effectively. In this part of the course, you’ll explore the fundamental concepts associated with R. You’ll learn about functions and variables for calculations and other programming. In addition, you'll discover R packages, which are collections of R functions, code and sample data that you’ll use in RStudio.
Learning Objectives
- Describe the contents and components of the tidyverse package for R
- Describe the concept of packages in R programming language
- Describe the use of operators to complete calculations in the R programming language
- Describe the fundamental concepts associated with programming in R including functions, variables, data types, pipes, and vectors
- Install and load the tidyverse package
- Use the browseVignettes("packagename") function to read through vignettes of a loaded package
- Locate resources for help using R
Understand basic Programming concepts
Programming using RStudio
Hi and welcome back. We've given you a big-picture overview of R and RStudio. Now we'll turn our focus to the actual programming and coding you'll do using RStudio. I went pretty far in my career not knowing programming before it became clear, I needed to learn it. Getting to know R was such a valuable learning experience. It took some time, and I reached out to more-experienced R users with lots of questions. Eventually, it all came together for me. Being open to learning new skills is such an important part of your career. Now I'm able to help you learn some new skills too.
I'll start by sharing the fundamentals of programming using R in RStudio.
Earlier, we explained how R is like the engine of a car and RStudio is like the accelerator, steering wheel, and dashboard all in one. Getting to know fundamentals will help you keep your R car running smoothly.
These fundamentals are both alike and different from the other analysis platforms you've come to know well: spreadsheets and SQL. Then we'll move on to coding in RStudio. We'll discuss the syntax for performing calculations and the standards and naming conventions for all code.
We'll also explore the R tool known as a pipe, which you'll use to make a sequence of code easier to work with and read.
Then we'll check out R packages. While these packages won't be delivered to your door, they are delivered by the R community. These packages contain reusable functions and more, and are usually built by users for users like yourself.
We'll get to know a collection of packages called the Tidyverse. You'll learn how to install the Tidyverse so you can start using it in RStudio. We'll also work with some of the more popular Tidyverse packages like ggplot2 for visualization. You'll be able to carry over what you've learned about RStudio to the next part of the program, where you'll start working with data.
As we explained earlier, for this program, we'll use the in-browser version of RStudio: RStudio Cloud. But RStudio is also available to be downloaded. So let's get going. See you soon.
Programming fundamentals
Hey there. Anytime you're learning a new skill from cooking to driving to dancing, you should always start with the fundamentals. Programming with R is no different.
To build this foundation, you'll get familiar with the basic concepts of R, including functions, comments, variables, data types, vectors, and pipes. Some of these terms might sound familiar. For example, we've come across functions in spreadsheets and SQL.
As a quick refresher, functions are a body of reusable code used to perform specific tasks in R. Functions begin with function names like print or paste, and are usually followed by one or more arguments in parentheses.
An argument is information that a function in R needs in order to run. Here's a simple function in action. Feel free to join in and try it yourself in RStudio using your cloud account. Check out the reading for more details on how to get started.
You can pause the video anytime you need to. We'll open RStudio Cloud to get started.
We'll start our function in the console with the function name print. This function name will return whatever we include in the values in parentheses. We'll type an open parenthesis followed by a quotation mark. Both the close parenthesis and end quote automatically pop up because RStudio recognizes this syntax. Now we just have to add the text string. We'll type Coding in R. Then we'll press enter. Success! The code returns the words "Coding in R."
If you want to find out more about the print function or any function, all you have to do is type a question mark, the function name, and a set of parentheses. This returns a page in the Help window, which helps you learn more about the functions you're working with.
Keep in mind that functions are case-sensitive, so typing Print with a Capital P brings back an error message.
Functions are great, but it can be pretty time-consuming to type out lots of values. To save time, we can use variables to represent the values. This lets us call out the values any time we need to with just the variable. Earlier, we learned about variables in SQL. A variable is a representation of a value in R that can be stored for use later during programming. Variables can also be called objects.
As a data analyst, you'll find variables are very useful when programming. For example, if you want to filter a dataset, just assign a variable to the function you used to filter the data. That way, all you have to do is use that variable to filter the data later.
When naming a variable in R, you can use a short phrase. A variable name should start with a letter and can also contain numbers and underscores. So the variable 5penguin wouldn't work well because it starts with a number. Also just like functions, variable names are case-sensitive. Using all lower case letters is good practice whenever possible.
Now, before we get to coding a variable, let's add a comment. Comments are helpful when you want to describe or explain what's going on in your code. Use them as much as possible so that you and everyone can understand the reasoning behind it. Comments should be used to make an R script more readable. A comment shouldn't be treated as code, so we'll put a # in front of it. Then we'll add our comment. Here's an example of a variable.
Now let's go ahead with our example. It makes sense to use a variable name to connect to what the variable is representing. So we'll type the variable name first_variable. Then after the variable name, we'll type a < sign, followed by a -. This is the assignment operator. It assigns the value to the variable. It looks like an arrow, which makes sense, since it's pointing from the value to the variable. There are other assignment operators that work too, but it's always good to stick with just one type in your code.
Next, we'll add the value that our variable will represent. We'll use the text, "This is my variable." If we type the variable and hit Run, it will return the value that the variable represents. This is a very basic way of using a variable. You'll learn more ways of using variables in your code soon.
Question:
For now, let's assign a variable to a different data type, numeric. We'll name this second_variable, and type our assignment operator. We'll give it the numeric value 12.5. The Environment pane in the upper- right part of our work space now shows both of our variables and their values.
There are other data types in R like logical, date, and date time. R has a few options for dealing with these data types. We'll explore them later. With functions, comments, variables, and data types, you've got a good foundation for working with R. We'll revisit these throughout this program, and show you how they're used in different ways during analysis.
Let's finish up with two more fundamental concepts, vectors and pipes. Simply put, a vector is a group of data elements of the same type stored in a sequence in R. You can make a vector using the combined function. In R this function is just the letter c followed by the values you want in your vector inside parentheses.
All right, let's create a vector. Imagine this vector is for a measurement data that we need to analyze. We'll start our code with the variable vec_1 to assign to the vector. Then we'll type c and the open parenthesis. Then we'll type our list of numbers separated by commas. We'll then close our parentheses and press enter.
This time when we type our variable and press enter, it returns our vector. We can use this vector anywhere in our analysis with only its variable name vec_1. The values in the vector will automatically be applied to our analysis.
That brings us to the last of our fundamentals, pipes. A pipe is a tool in R for expressing a sequence of multiple operations. A pipe is represented by a % sign, followed by a > sign, and another % sign. It's used to apply the output of one function into another function. Pipes can make your code easier to read and understand.
For example, this pipe filters and sorts the data. Later, we'll learn how each part of the pipe works.
So there they are, the super six fundamentals: functions, comments, variables, data types, vectors, and pipes. They all work together as a foundation for using R. It's a lot to take in, so feel free to watch any of these videos again if you need a refresher. When you're ready, there's so much more to know about R and RStudio. So let's get to it.
Vectors and lists in R (Reading)
Dates and times in R (Reading)
Other common data structures (Reading)
Test your knowledge on programming concepts (Practice Quiz)
Explore coding in R
Operators and calculations
Hi again. We've shown you how your work as a data analyst can be done in different ways using different tools. That's true in this program, and it'll be just as true when you start your job. Operations and calculations are two concepts we've checked out before. Coming up, we'll go back to them and learn how to use operators in R for a range of tasks, including calculation. An operator is one of the key components of a calculation.
If we've got a bunch of sales figures that we want to include in a vector, we can use an assignment operator to assign them to a variable. Here's an example. Now, whenever we want to use these sales figures, we just type the variable we assigned. Next, let's check out arithmetic operators. These operators are used to complete math calculations, and they might seem familiar. Plus signs do addition on variables, and minus signs do subtraction. We use an asterisk to perform multiplication, and a slash performs division. There are other arithmetic operators too, but these are enough to get you started.
Let's try a calculation for our sales data in R Studio. Feel free to follow along on your own. As we go through these steps, we'll complete our work in a script to make sure our calculations are saved. As an analyst developing code in R, you'll spend most of your time in scripts. When you save a script, you'll have a complete record of your work. You'll use the console mostly to show the results of your programming. Also, even though we're not doing a deep analysis here, it's still a good idea to save our work for easy access later if we need it.
You may notice the calculations in R work in a similar way to calculations in spreadsheets and SQL. It's helpful to make connections across the tools that you're working with.
Let's do one more calculation using our total sales from the first two quarters, represented by mid-year_sales. We'll multiply it by 2 to get a general idea of total sales for the year. We'll use an asterisk as our arithmetic operator. You'll find there are other ways to perform these types of calculations, but these are great examples of how the operators work, both for calculations and other operations.
For now, let's save our script so that we can use these same variables again if we need to do more work in our sales data.
Just like in other formats, we simply click "Save As," then type a file name, and our file extension is automatically applied. We'll close our script. When we're ready for more sales data analysis, we can open it again using the file menu.
There are other categories of operators that you'll learn about later, but knowing how assignment and arithmetic operators help you program calculations is a good place to start. We're definitely moving forward in R & R Studio. Let's keep it rolling by learning more about pipes. Another great tool in R. See you soon.
Logical operators and conditional statements (Reading)
Guide: Keeping your code readable (Reading)
Hands-On Activity: R sandbox (Practice Quiz)
Queries and programming (Discussion Prompt)
Now that you have written queries using SQL and used code to program in R, you may have noticed some similarities between the two. Submit a response of two or more paragraphs (100-150 words total) discussing any similarities you may have come across. Then, visit the discussion forum to review what others have written, and respond to at least two posts with your own thoughts.
Basic Concepts of R (Ungraded Plugin)
Test your knowledge on coding in R (Practice Quiz)
Learning about R packages
The gift that keeps on giving
Hello there. I have to say, getting a package delivered to you is one of life's simple pleasures. It doesn't matter if it's a surprise package or something you ordered yourself; it's exciting to open your package to discover what's inside. No wonder those unboxing videos on YouTube are so popular.
R has a different kind of package that R users can open. These packages are units of reproducible code, and they make it easier to keep track of code. They're created by members of the R community to keep track of the R functions that they write and reuse. These community members might then make the packages available to other users. It's one of the great things about being part of this community.
Packages in R include reusable R functions and documentation about the function, including how to use them. They also contain sample data sets and tests for checking your code to make sure it does what you want it to do.
By default, R includes a set of packages called "base R" that are available to use in our studio when you start your first programming session. There are also recommended packages that are loaded but not installed. Before using functions from one of these packages, you'd have to load it with the library command, like Library("foo"), for example.
Let's find out which packages we already have in our studio. We'll work in our console instead of a script for now because we're practicing and don't need to save this code for later.
To check out our packages, we'll just run the command installed.packages, and there's our list.
Let's focus on the "Package" and "Priority" columns.
The "Package" column gives the name of the package, like "cluster" or "graphics." The "Priority" column tells us what's needed to use functions from the package. If you come across the word "base" in a "Priority" column and the package is already installed and loaded, you can use all of the functions of that package as soon as you open RStudio. If you find the word "recommended," then the package is installed but not loaded.
You'll also notice a list of packages in the bottom right part of our workspace. This list includes a brief description of each package.
To load "class" and other uninstalled packages, we'll need to use the library function, followed by the name of the package. Now the "class" package has a check next to it, so it's been successfully loaded for use. If you want to learn even more about your loaded packages, you can click on their names in the "Packages" tab. This opens the "Help" tab and shows topics related to the package you selected.
You can also use the help function in your programming to call up the "Help" tab. All the pre-installed packages give you tons of useful functions. There's even more packages that'll further expand your programming abilities.
You can find thousands of R packages just by doing an online search. One of the most commonly used sources of packages is CRAN. CRAN stands for Comprehensive R Archive Network. It's an online archive with R packages' source code, manuals, and documentation. When you start working with R, you'll be able to do your own searches to find packages in CRAN or elsewhere. It's almost always easier to just search with your favorite search engine, though. Packages are a pretty big part of R; they give you most of what you need to complete your programming throughout the data analysis process. Who knows, you might even turn your own code into packages for others to use.
Up next, we'll keep unpacking R packages. See you soon.
Question:
Available R packages (Reading)
Welcome to the tidyverse
Welcome back. As we discussed earlier, packages are a big part of what makes R so great.
Packages offer a helpful combination of code, reusable R functions, descriptive documentation, tests for checking operability, and sample data sets. And for lots of data analysts, at the top of the list of useful packages is tidyverse.
Tidyverse is actually a collection of packages in R with a common design philosophy for data manipulation, exploration, and visualization.
Using tidyverse can help you work your way through pretty much the entire data analysis process. The packages in tidyverse work together naturally. I started learning about tidyverse when I was working on a survey project. It felt like I was stepping into a more advanced zone of R. I understood the basics, but now I was finding out how the tidyverse improves on the basics. That's when I got even more excited about working in R. I realized that the more I put into learning about the tidyverse, the more I get out of it. On top of that, the community support for tidyverse is strong too. It's one of the reasons why tidyverse is considered a key part of programming for most R users. The principles associated with tidyverse, which you'll learn both here and at your job, have been widely adopted by the R community. You'll find lots of tutorials and examples related to the tidyverse online that show you these principles and how they're applied to data analytics.
Okay, let's install the tidyverse. You can follow along on your own, using your RStudio cloud account. Check out the reading for more details. Earlier, you learned how to find Base R packages using the function install packages.
To install packages like the tidyverse that aren't in Base R, we'll use the install packages function. As we discussed earlier, this function calls the tidyverse and other packages from CRAN. Let's talk about why CRAN was created. Since packages not in Base R are mostly made by R users, people need a reliable way to check and validate submitted code. CRAN makes sure any R content open to the public meets the required quality standards. So, if it's sourced through CRAN, you can feel good that the package is authentic and valid.
Another major source of packages and other R content is GitHub.
Now, we'll get back to installing the tidyverse. We'll first type install.packages. Then, between the parentheses, we'll type tidyverse in quotes. The quotes aren't always necessary, but best practice is to use quotes to make sure that we are accurate. We'll press Enter and wait for RStudio to install tidyverse.
When we click on our packages tab, we come across a lot of new packages on the list. That's tidyverse. You might have noticed that none of the packages are checked off. We need to load them first before we can use them. But that's a mighty long list. So, let's just load the package named tidyverse for now, using the library function. The return shows that not only was tidyverse loaded, but eight other packages were too. It also shows a list of conflicts. Conflicts happen when packages have functions with the same names as other functions. Basically, the last package loaded is the one whose functions will be used, so we'll stick with the tidyverse functions. But it's important to note that these messages only appear once. So, as you get more used to R, you'll be able to figure out if you want to use certain functions over others. The loaded packages are ggplot2, tibble, tidyr, readr, purrr, dplyr, stringr, and forcats. These packages are the core of the tidyverse because you'll use them in almost every analysis. All of them work together to make your data analysis smooth and efficient.
With these packages, tidyverse helps you do everything from importing and transforming data to exploring and visualizing it. We'll check out this core of packages soon, and we'll use them even more as we continue working in RStudio. If you're working on your own in R, you can check out some of the other packages too. The packages available in tidyverse change a lot, but you can always check for updates by running tidyverse_update() in your console. You can then update the packages in a couple of ways. If you use the update packages function, it'll update all of your packages. That might take a while. So, if you just want to update one package, you can use the install packages function again with the package name as your argument in parentheses. You should update packages regularly to make sure you've got the latest version in your code.
Conflict notifications are just one type of message that can show up in the console. You might find warnings and error messages as well. A quick search using the help tab will usually tell you what the message means and what, if anything, you'll need to do to address it.
Coming up, we'll keep moving through the tidyverse. You'll find out more about why tidyverse is such an integral part of R. See you.
Question:
Tidyverse is a collection of packages in R with a common design philosophy. The tidyverse packages are especially useful for data manipulation, exploration, and visualization.
Hands-On Activity: Installing and loading tidyverse (Practice Quiz)
Test your knowledge on R packages (Reading)
Explore the tidyverse
More on the tidyverse
Great, you're back. Have you ever taken a tour of a famous landmark or an unfamiliar city? It can be pretty exciting. You get to learn all about the features of the landmark or city. Eventually, you get to know them pretty well, and you can share what you learned with others. Well we're here to take a different kind of tour: a tour of the tidyverse. For this tour, we won't be traveling anywhere special, but we will help you learn about the exciting tidyverse features. And once you know them a little better, you can most definitely share what you learned with others.
For this tour we'll focus on the core packages of tidyverse we discussed earlier: ggplot2, tidyr, readr, dplyr, tibble, purrr, stringr and forcats. We also learned how to install and load them in RStudio. Once they're loaded, you won't need to do anything else with their actual packages. They'll do their thing as you program. So what is their thing?
Well, it depends, but there's four packages that are an essential part of the workflow for data analysts: ggplot2, dplyr, tidyr and readr. You'll most likely use these more often than the others.
Ggplot2 is used for data visualization, specifically plots.
With ggplot2, you can create a variety of data viz by applying different visual properties to the data variables.
Here's an example of ggplot2 in action. You'll have your own chance to use ggplot2 later.
Tidyr is a package used for data cleaning to make tidy data. We covered tidy or clean data earlier, but as a quick reminder, it's data where every part of a data table or data frame is the right type in the right place.
Tidyr works with wide and long data to make sure this happens.
Next, we have readr, which is used for importing data. The most common function from readr is read_csv. This will import a CSV file into R. A CSV file contains data separated by commas in a table format.
To accurately read a dataset with readr, you combine the function with a column specification. The column specification describes how each column should be converted to the most appropriate data type. It's good to keep in mind this isn't usually necessary because readr will figure it out for you automatically. We'll come across readr functions as we continue to explore R.
Now on to dplyr. Dplyr offers a consistent set of functions that help you complete some common data manipulation tasks. For example, the select function picks variables based on their names, and the filter function finds cases where certain conditions are true.
And, yes, dplyr is another package we'll get to later. There's plenty to look forward to, so that's the fab four of the tidyverse. They'll all make your programming in R more straightforward and efficient.
The other four packages are definitely useful, too, but you might not use them as often.
Tibble works with data frames.
Purrr works with functions and vectors helping make your code easier to write and more expressive.
Stringr includes functions that make it easier to work with strings.
Forcats provides tools that solve common problems with factors.
As a quick reminder, factors store categorical data in R where the data values are limited and usually based on a finite group like country or year. Using the tidyverse and its packages will help you fine-tune your analysis.
And besides tidyverse, you also learned the fundamentals of R from variables to vectors and more.
You explored the different operators in R and saw how they can help you complete calculations.
You had the chance to check out pipes and how they can make your programming more efficient.
And you unpacked packages to find out how they're a big part of what you can do in R.
We've covered a lot of ground in just a few videos, so this might be a good time for you to do a little review. You can rewatch videos and revisit any other resources that can help you get an even better grasp of all the terms, concepts and processes that are part of R. Looking ahead, you'll start working with data in R including a more thorough exploration of how tidyverse impacts your process. You'll see tibble, readr and other tidyverse packages in action. And you'll find out how to clean and organize your data in R. All this and more coming up. I'll see you soon.
Question:
The ggplot2 package is used for data visualization, specifically plots. You can use ggplot2 to create a lot of different visualizations by applying different properties to the data variables.
Question2:
The read_csv() function is a part of the readr package. It imports a .CSV file for use in R.
Working with pipes
Hi again. Earlier, we introduced something called pipes. A pipe is a tool in R that helps make your code more efficient and easier to read and understand. In this video, we'll explore pipes in more detail.
As a quick reminder, pipes express a sequence of multiple operations. In other words, it takes the output of one statement and makes it the input of the next statement. So, instead of typing out functions contained inside other actions, you could use the pipe operator to do the same work. In programming, we describe this as nested. Nested describes code that performs a particular function and is contained within code that performs a broader function.
You can think of a pipe as a way to code the phrase and then say. You've got sales data, and you need to find the mean or average. You can create a pipe by calling up the data, then grouping the data, and then summarizing the group data using a mean function.
Let's check out an example. First, we'll open our studio. Then we'll start a new script so we can save our work. We'll save it as "Tooth Growth Exploration." We'll use the Tooth Growth data set, which is already installed in R. This data set contains data about the effect of vitamin C on the growth of teeth in guinea pigs. It's a well-known data set that'll help us learn about how pipes work.
To load any data set already installed, we use the data function. We then add the name of the data set, "ToothGrowth." Now that the data is loaded, we can check it out with the view function. Notice that View begins with the capital V. It's a good reminder that functions and variables case-sensitive in R.
In a script, we use the Run button to run our code. The return usually shows up in the console, but with view, a new tab appears in the script showing the contents of the data set.
Now, let's say we need to filter and sort this data to organize it for analysis. Without pipes, we could do this either by nesting commands (we'll look more into data frames soon). Let's start by filtering the data set. Note that we'll want to first install and load the correct filter function, which comes as part of a package. Installing a package may take a few moments. This function comes as part of the dplyr package.
We'll assign a name to the new data set and then apply the filter function. This filters the data so that we only see rows where the dose of vitamin C is exactly 0.5. This includes both types of vitamin C used in the study: orange juice or OJ in our data set, or OJ and ascorbic acid or VC.
** Next, we'll sort it with the arrange function.** We'll include the name of the filtered data set followed by the column name we want to sort by — in this case, "len" stands for "length of tooth". When we run this, the return appears in the console. The data is arranged in ascending order by length. The return only shows rows where the dose amount is 0.5, so the data has been filtered and sorted based on our code.
**Let's try another way to get the same return. ** We'll use a nested function, which is a function that is completely contained within another function. Here's the nested function for filtering and sorting this data set. Notice that the filter function from our previous code is the nested function. With nested functions, we read from the inside out. The code filters the data first, then it arranges or sorts it. Now, let's run this. We tweaked the code, but we get the same result.
Now we'll use a pipe. As a quick reminder, the operator used to call out a pipe is a percentage sign followed by a greater than sign and another percentage. You can also use keyboard shortcuts to insert pipe operators (Control + Shift + M for PCs and Chromebooks, and Command + Shift + M for Macs). We'll start this pipe by assigning it to a variable. Then we'll type the name of the data set we're pulling data from, "ToothGrowth." We'll use our keyboard shortcut to add the pipe operator after that.
Now we can press Enter to go to the next line. Our studio automatically indents the next line, recognizing that it's part of the pipe. Next, we'll filter the data. We don't have to call out the data set inside parentheses like we did in the earlier example because we started our pipe with it. The pipe automatically applies the data set to each step.
All right, let's finish up our cpipe on a new line with the arrange function and sort the data. Since this is our last line of code, we don't need a pipe operator. Finally, click run and presto, we get the same return as our other methods.
Our pipe is set up to call the data set, filter the data set, and then sort the data set. All three methods work, but you can see how pipes help make your programming more efficient and less cluttered. This means fewer chances for mistakes and better readability for anyone looking at your code. And because of the structure of a pipe, we can easily add to or change the code without having to start over.
So, let's do that. Building on our example, let's say we also wanted to compute the average tooth length for each of the two supplements used in the study: orange juice (OJ) and ascorbic acid (VC). We'll replace the arrange function with the group_by function. This will group our results by the two supplements, so we type them in the parentheses and add a pipe (or adding a pipe this time because we have another line of code to add). So, we group by and then we summarize our argument, which comes after the function. Summarize looks pretty complex, but it basically tells R what to do with missing values and to make sure the data is grouped the right way when we add the summarize function. Now we'll run our new pipe and get the average length of tooth when the dose is equal to 0.5 for each of our supplements. Nice!
Now, there are a couple of things to remember when using pipes.
First, it's important to add the pipe operator at the end of each line of the pipe operation except the last one.
Another good rule of thumb is to check your code after you've programmed your pipe. Remember, our studio automatically indents lines of code that are part of a pipe. If a line in your code isn't indented, it probably hasn't been added to the pipe. That could lead to an error statement.
Then you can revisit the pipe operation to check for parts of your code to fix. With the other methods we showed you, it'd be more of a challenge to find the messy parts. Another reason to use pipes when you can: pipes or piping and the functions that are part of the piping process are building blocks for putting together analyses in R.
Question:
A nested function is a function contained within code that performs a broader function. The nested function performs its own specific function within the code.
Question2:
The pipe operator is %>%. You can use it in R programming to call out a pipe to express a sequence of multiple operations.
In upcoming videos, you'll learn how you can use these building blocks to clean, transform and analyze your data. For now, feel free to take your time reviewing and maybe even practicing with the functions, operations, and other elements in R and R Studio that we've already covered.
R resources for more help (Reading)
Connor: Coding tips
Hi, I'm Connor and I'm a Marketing Analytics Manager at Google Cloud. I was running into barriers of not being able to do certain analysis because it was too time-consuming with my limited technical knowledge. So I started to teach myself things like SQL to help me access data through the current company's database that I had so that I could manipulate that data to better understand it. I can tell you at first it is an incredibly frustrating thing to move through because it takes a lot of time and effort to do something that seems very simple or something that would be very easy to do in spreadsheets, but may be very difficult to do at first as you're learning how to code. But also one of the most fulfilling things that I've ever done because once you're able to understand something, it opens up an entire new realm.
Learning coding was revolutionary for my job. I remember when I first started as an analyst, all of the data that I used was in spreadsheets and I had to run analysis and create formulas to manipulate the data, understand the data, and even analyze the data. Now, when we started to get more and more data, the formulas that I would have run would take hours, and I remember at one point I spent a few hours creating a formula and then executed it, and it took over ten hours to run. So I left my computer open and let it run throughout the night, woke up and it was still running. Fast forward, a year later, after I'd learned SQL and Python, I was able to run the same type of analysis in milliseconds. So really understanding what it is that you're trying to do. Coding helps you manipulate and analyze data at a rate that previously or without a coding knowledge would be very difficult to do.
An important aspect of any type of script, or when you are coding is to structure it for overall readability. More often than not, you're going to be working on a team. Now it's important that when you're writing a script, that you understand how it works, but also that somebody else who you work with can also come and understand what it is that you are trying to do within that script. Now, it's very important that it not only works and is efficient, but it is also not too verbose, meaning that it is not overly complex. So an important aspect of readability is if you are looking through your code and you realize that I've written the same thing multiple times, or I'm using the same logic or algorithm multiple times, that is a point in time where you can really consolidate your code and make it a lot more concise, which vastly helps with readability, vastly helps anybody who comes and is trying to read your code, and that includes you two weeks from now. Because I can promise you when you start coding, you will realize that what makes sense to you right now may not make sense to you three weeks from now.
An important aspect for readability and overall understanding of your code is using comments. Comments are a way to write something out in a standardized language like English, and a way that somebody can understand it, but the computer doesn't pick up as actual code. So for explaining every line that you write or explaining an entire section of your code in a comment allows somebody to walk themselves through your code and read exactly what it is that you are trying to accomplish with the code that you've written. Now without comments, you are leaving it up to the person to really follow your code and understand it themselves, which may not be an easy task for somebody because they may have a different way of coding, the same thing that you are doing. Documenting your work is an important aspect.
Documentation will explain in depth exactly what your code is doing, why it was built, what is the purpose for it, and any limitations.
The last one is a rather difficult concept to understand as you are first diving into learning a coding language, and that is building it for scalability as well as making it dynamic. Now when I say building something for scalability, what I mean is, if you are building a specific script of code to solve a task that you're running now, what you want to be sure of and answer is, will this code, or could this code be used in the future for something else? Now if it is, it's important that you make your code available to be scalable. That means that it is efficiently run so that if the size of the data that it is running any manipulations on increases, it doesn't bog down your code too much and that it can handle large data loads as well as small. Another aspect to that is making your code dynamic. What that means is not hard-coding any values within your code that don't change when they need to. So these are just a few of the best practices and as you continue down your path as a data analyst, you'll pick up many, many more. There's always more to learn, there's always more to understand, but this should help you in beginning down your path to understanding coding.