Course 3‐1 231119.sun. 231121,22am(tue‐wed) - Forestreee/Data-Analytics GitHub Wiki
Google Data Analytics Professional
Prepare Data for Exploration
WEEK1 - Data types and structures
We all generate lots of data in our daily lives. In this part of the course, you’ll check out how we generate data and how analysts decide which data to collect for analysis. You’ll also learn about structured and unstructured data, data types, and data formats as you start thinking about how to prepare your data for exploration.
Learning Objectives
- Explain how data is generated as a part of our daily activities with reference to the types of data generated.
- Explain factors that should be considered when making decisions about data collection.
- Explain the difference between structured and unstructured data.
- Discuss the difference between data and data types.
- Explain the relationship between data types, fields, and values.
- Discuss wide and long data formats with references to organization and purpose.
Data exploration
Introduction to data exploration
You're working on a project. You've asked all the right questions, applied structured thinking, and you're completely in sync with your stakeholders. You're off to a great start. But there's another step in the process: preparing the data correctly. This is where understanding the different types of data and data structures comes in. Knowing this lets you figure out what type of data is right for the question you're answering. Plus, you'll gain practical skills about how to extract, use, organize, and protect your data.
" Now we'll learn more about the data that you'll need to tell the best story possible. But before we do that, I'd love to tell you my story. I use analytics to help healthcare companies develop digital marketing solutions that make their business and their brands stronger. My team and I find business and media opportunities based on the latest industry and data insights. I've been working in healthcare for about five years, and it's great. I really enjoy being able to use data to help spark change in such an important industry. As you'll discover in this course, data can be the main character in a very powerful story. I absolutely love using analysis to tell that story in a way that's compelling and informative. Here's a real life example of how I've used data to tell a story. In my job, we analyze Medicare enrollment data over time and make connections to how people research Medicare plans on Google. As people 65 and older become more informed decision makers for their health, I use the data to learn if there's an increase in Medicare enrollments and what part Google searches play if there is an increase in demand. Now it's very important that I make sure the data is relevant and valid. I also have to pay attention to questions around access and equity while maintaining the privacy of those conducting searches. The happy ending of my story is that the data in my findings is useful to medical professionals and their patients."
You'll learn to identify how data is generated and collected, and you'll explore different formats, types and structures of data. We'll make sure you know how to choose and use data that'll help you understand and respond to a business problem. And because not all data fits each need, you'll learn how to analyze data for bias and credibility. We'll also explore what clean data means. But wait, there's more. You'll also get up close and personal with databases. We'll cover what they are and how analysts use them. You'll even get to extract your own data from a database using a couple of tools that you're already familiar with: spreadsheets and SQL. The last few things we'll cover are the basics of data organization and the process of protecting your data. Data works best when it's organized. And if you're organizing your data, you'll want to protect it too. I'll show you how to do both and apply it to your own analysis.
Course Syllabus
Course Content Course 3 - Prepare Data for Exploration
-
Understanding data types and structures: We all generate lots of data in our daily lives. In this part of the course, you will check out how we generate data and how analysts decide which data to collect for analysis. You’ll also learn about structured and unstructured data, data types, and data formats as you start thinking about how to prepare your data for exploration.
-
Understanding bias, credibility, privacy, ethics, and access: When data analysts work with data, they always check that the data is unbiased and credible. In this part of the course, you will learn how to identify different types of bias in data and how to ensure credibility in your data. You will also explore open data and the relationship between and importance of data ethics and data privacy.
-
Databases: Where data lives: When you are analyzing data, you will access much of the data from a database. It’s where data lives. In this part of the course, you will learn all about databases, including how to access them and extract, filter, and sort the data they contain. You will also check out metadata to discover the different types and how analysts use them.
-
Organizing and protecting your data: Good organization skills are a big part of most types of work, and data analytics is no different. In this part of the course, you will learn the best practices for organizing data and keeping it secure. You will also learn how analysts use file naming conventions to help them keep their work organized.
-
Engaging in the data community (optional): Having a strong online presence can be a big help for job seekers of all kinds. In this part of the course, you will explore how to manage your online presence. You will also discover the benefits of networking with other data analytics professionals.
-
Completing the Course Challenge: At the end of this course, you will be able to apply what you have learned in the Course Challenge. The Course Challenge will ask you questions about the key concepts and then will give you an opportunity to put them into practice as you go through two scenarios.
Hallie: Fascinating data insights
"Healthcare is just a really fascinating place in the US. It's a really incredible industry to work in because it is so historically traditional, and healthcare companies, unlike other tech companies, just really have not used data to inform decisions. When I was in college, I had a professor who didn't want us to have textbooks because he just said the healthcare industry was changing so rapidly, and it wouldn't make sense to have a textbook, which is just a static piece of text when things were just really evolving. So I would say healthcare and data and the two together is a newer concept using big data, using machine learning, and artificial intelligence to help the healthcare industries. I started analyzing large sums of patient data. That was the first time I had really worked with such huge datasets, and I found it really fascinating that we can take all of these datasets and synthesize them and allow us to really deliver some cool insights and trends to our hospital systems. That was the first time I started thinking about data analysis, data analytics, as a possible career for me. That's really what brought me to this analytical lead role at Google where I could take that knowledge and that skill set of analyzing datasets and do that on a daily basis, so that really, every conversation I was having with the client was a data-informed conversation. I work within the healthcare vertical. We have companies who market on our platforms, like Google Search and YouTube. We help them understand the healthcare industry so that they can better market to the audience that they're trying to reach. Whether you're a healthcare insurer or you're a health care provider, maybe a hospital system, they all have different needs on how they want to reach their audience using Google's platforms. We help them optimize their marketing spend, but we also do a lot of research in the healthcare industry. Some user research, some understanding of how users are really just searching on Google to give them a sense of what's really happening in the industry and how they can market effectively. I would say that my technical skills with data analytics came with time. The most important skill I found, which has also come with time and grown with me, is just the creativity side of data analysis. I mean, you can really learn a lot of the SQL skills and R, and I know some of that is within the course. But really, the creativity side is something that just comes with experience. When you're looking at a dataset, you might look at it one way and analyze it one way and then have someone else look at it or look at it a week later, and then all of a sudden the trend that you're seeing is completely different. You have to take a lot of these pieces of information, these nuggets, I like to call them, and just piece together a really nice narrative using data. That skill set is something I learned when I was working in consulting, and I've taken that to Google and really been able to polish a lot of those skills and some of the more technical skills. Technical and the creative side are what I've grown to love. My name is Hallie. I'm an analytical lead at Google working specifically in the healthcare vertical."
Collecting data
Data collection in our world
Right now data is being generated all around the world and we're talking tons of data. Every minute of every day millions of texts and hundreds of millions of emails are sent. On top of that, millions of online searches are made and videos viewed and those numbers are only growing. That's a lot of data. Let's learn more about how it's made and used. In this video, we'll talk about the ways that data can be generated and how industries collect data themselves. Every piece of information is data. All that data is usually generated as a result of our activity in the world. These days, we spend a lot of time online. With social media and mobile devices, millions and millions of people are adding to the huge amount of data out there, each and every day. Think about it like this. Every digital photo online is one piece of data. Every photo itself holds even more data, from the number of pixels to the colors contained in each of those pixels. But that's not the only way data is made. We can also generate data by collecting information. This data generation and collection comes with a few more things to think about. It needs to be done with consideration to ethics so that we maintain people's rights and privacy. We'll learn more about that later on.
For now, let's check out a real world example. The United States Census Bureau uses forms to collect data about the country's population. This data is used for a number of reasons, like funding for schools, hospitals, and fire departments.
The Bureau also collects information about things like U.S. businesses, creating their own data in the process. The great thing about this is that others can then use the data for their own needs, including analysis. The annual business survey is used to figure out the needs of businesses and how to provide them with resources to help them succeed.
I actually generate data in the analytics I do for the health care industry. We run a lot of surveys to learn how patients feel about certain things related to their health care. For example, one survey asked how patients feel about telemedicine versus in-person doctor visits. The data we collected help the companies we work with improve the care that their patients receive. Survey data is just one example.
There's all kinds of data being generated all the time, and there's lots of different ways to collect it.
Even something as simple as an interview can help someone collect data. Imagine you're in a job interview. To impress the hiring manager, you want to share information about yourself. The hiring manager collects that data and analyzes it to help them decide whether to hire you or not. But it goes both ways. You could also collect your own data about the company to help you decide if the company is a good fit for you.
Or you can use the data you collect to come up with thoughtful questions to ask the interviewer. Scientists also generate data. They use a lot of observations in their work. For example, they might collect data by studying animal behavior or looking at bacteria under a microscope. Earlier we talked about the forms that the U.S. Census Bureau uses to collect data. Forms, questionnaires and surveys are commonly used ways to collect and generate data.
One thing to note: data that's generated online doesn't always happen directly. Have you ever wondered why some online ads seem to make really accurate suggestions or how some websites remember your preferences? This is done using cookies, which are small files stored on computers that contain information about users. Cookies can help inform advertisers about your personal interests and habits based on your online surfing, without personally identifying you.
As a real world analyst, you'll have all kinds of data right at your fingertips and lots of it too. Knowing how it's been generated can help add context to the data, and knowing how to collect it can make the data analysis process more efficient.
Determining what data to collect
As a data analyst, you'll need to decide what kind of data to collect and use for every project. With a nearly endless amount of data out there, this can be quite a bit of a data dilemma, but there's good news. In this video, you'll learn which factors to consider when collecting data. Usually, you'll have a head start in figuring out the right data for the job, because the data you need will be given to you, or your business task or problem will narrow down your choices.
Let's start with a question like, what's causing increased rush hour traffic in your city?
- How the data will be collected First, you need to know how the data will be collected. You might use observations of traffic patterns to count the number of cars on city streets during particular times.
- Choose the data sources You notice that cars are getting backed up on a specific street. That brings us to data sources.
In our traffic example, your observations would be first-party data. This is data collected by an individual or group using their own resources. Collecting first-party data is typically the preferred method because you know exactly where it came from.
You might also have second-party data, which is data collected by a group directly from its audience and then sold. In our example, if you aren't able to collect your own data, you might buy it from an organization that's led traffic pattern studies in your city. This data didn't start with you, but it's still reliable because it came from a source that has experience with traffic analysis.
The same can't always be said about third-party data or data collected from outside sources who did not collect it directly. This data might have come from a number of different sources before you investigated it. It might not be as reliable, but that doesn't mean it can't be useful. You'll just want to make sure you check it for accuracy, bias, and credibility.
- Decide what data to use As a data analyst, it's your job to decide what data to use, and that means choosing the data that can help you find answers and solve problems and not getting distracted by other data. In our traffic example, financial data probably wouldn't be that helpful, but existing data about high volume traffic times would be.
- How much data to collect Now let's talk about how much data to collect. In data analytics, a population refers to all possible data values in a certain data set. If you're analyzing data about car traffic in a city, your population would be all the cars in that area. But collecting data from the entire population can be pretty challenging.
That's why a sample can be useful. You might collect a data sample about one spot in the city and analyze the traffic there, or you might pull a random sample from all existing data in the population. How you choose your sample will depend on your project.
- Select the right data type As you collect data, you'll also want to make sure you select the right data type. For traffic data, an appropriate data type could be the dates of traffic records stored in a date format. The dates could help you figure what days of the week there is likely to be a high volume of traffic in the future.
- Determine the time frame Finally, you need to determine the time frame for data collection. In our example, if you needed an answer immediately, you'd have to use historical data, which is data that already exists. But let's say you needed to track traffic patterns over a long period of time. That might affect the other decisions you make during data collection.
Selecting the right data
Following are some data-collection considerations to keep in mind for your analysis:
How the data will be collected Decide if you will collect the data using your own resources or receive (and possibly purchase it) from another party. Data that you collect yourself is called first-party data.
Data sources If you don’t collect the data using your own resources, you might get data from second-party or third-party data providers. Second-party data is collected directly by another group and then sold. Third-party data is sold by a provider that didn’t collect the data themselves. Third-party data might come from a number of different sources.
Solving your business problem Datasets can show a lot of interesting information. But be sure to choose data that can actually help solve your problem question. For example, if you are analyzing trends over time, make sure you use time series data — in other words, data that includes dates.
How much data to collect If you are collecting your own data, make reasonable decisions about sample size. A random sample from existing data might be fine for some projects. Other projects might need more strategic data collection to focus on certain criteria. Each project has its own needs.
Time frame If you are collecting your own data, decide how long you will need to collect it, especially if you are tracking trends over a long period of time. If you need an immediate answer, you might not have time to collect new data. In this case, you would need to use historical data that already exists.
Use the flowchart below if data collection relies heavily on how much time you have:
Differentiate between data formats and structures
Discover data formats
I don't know about you, but when I'm choosing a movie to watch, I sometimes get stuck between a couple of choices. If I'm in the mood for excitement or suspense, I might go for a thriller, but if I need a good laugh, I'll choose a comedy. If I really can't decide between two movies, I might even use some of my data analysis skills to compare and contrast them. Come to think of it, there really needs to be more movies about data analysts. I'd watch that, but since we can't watch a movie about data, at least not yet, we'll do the next best thing: watch data about movies! We're going to take a look at this spreadsheet with movie data. We know we can compare different movies and movie genres. Turns out, you can do the same with data and data formats.
Let's use our movie data spreadsheet to understand how that works. We'll start with quantitative and qualitative data.
If we check out column A, we'll find titles of the movies. This is qualitative data because it can't be counted, measured, or easily expressed using numbers. Qualitative data is usually listed as a name, category, or description.
In our spreadsheet, the movie titles and cast members are qualitative data.
Next up is quantitative data, which can be measured or counted and then expressed as a number. This is data with a certain quantity, amount, or range.
In our spreadsheet here, the last two columns show the movies's budget and box office revenue. The data in these columns is listed in dollars, which can be counted, so we know that data is quantitative.
We can go even deeper into quantitative data and break it down into discrete or continuous data.
Let's check out discrete data first. This is data that's counted and has a limited number of values. Going back to our spreadsheet, we'll find each movie's budget and box office returns in columns M and N. These are both examples of discrete data that can be counted and have a limited number of values. For example, the amount of money a movie makes can only be represented with exactly two digits after the decimal to represent cents. There can't be anything between one and two cents. *** ***
Continuous data can be measured using a timer, and its value can be shown as a decimal with several places.
Let's imagine a movie about data analysts that I'm definitely going to star in someday. You could express that movie's run time as 110.0356 minutes. You could even add fractional data after the decimal point if you needed to.
There's also nominal and ordinal data.
Nominal data is a type of qualitative data that's categorized without a set order. In other words, this data doesn't have a sequence. Here's a quick example. You ask people if they've watched a given movie. Their responses would be in the form of nominal data. They could respond "Yes," "No," or "Not sure." These choices don't have a particular order.
Ordinal data, on the other hand, is a type of qualitative data with a set order or scale. If you asked a group of people to rank a movie from 1 to 5, some might rank it as a 2, others a 4, and so on. These rankings are in order of how much each person liked the movie.
Now let's talk about internal data, which is data that lives within a company's own systems.
For example, if a movie studio had compiled all of the data in the spreadsheet using only their own collection methods, then it would be their internal data. The great thing about internal data is that it's usually more reliable and easier to collect.
But in this spreadsheet, it's more likely that the movie studio had to use data owned or shared by other studios and sources because it includes movies they didn't make. That means they'd be collecting external data. External data is, you guessed it, data that lives and is generated outside of an organization.
External data becomes particularly valuable when your analysis depends on as many sources as possible.
A great thing about this data is that it's structured. Structured data is data that's organized in a certain format, such as rows and columns.
Spreadsheets and relational databases are two examples of software that can store data in a structured way. You might remember our earlier exploration of structured thinking, which helps you add a framework to a problem so that you can solve it in an organized and logical manner.
You can think of structured data in the same way. Having a framework for the data makes the data easily searchable and more analysis-ready. As a data analyst, you'll work with a lot of structured data, which will usually be in the form of a table, spreadsheet or relational database.
But sometimes you'll come across unstructured data. This is data that is not organized in any easily identifiable manner. Audio and video files are examples of unstructured data because there's no clear way to identify or organize their content. Unstructured data might have internal structure, but the data doesn't fit neatly in rows and columns like structured data.
Data formats in practice
When you think about the word "format," a lot of things might come to mind. Think of an advertisement for your favorite store. You might find it in the form of a print ad, a billboard, or even a commercial. The information is presented in the format that works best for you to take it in. The format of a dataset is a lot like that, and choosing the right format will help you manage and use your data in the best way possible.
Data format examples As with most things, it is easier for definitions to click when we can pair them with real life examples. Review each definition first and then use the examples to lock in your understanding of each data format.
Primary vs. Secondary
Internal vs. External
Continuous vs Discrete
Qualitative vs. Quantitative
Nominal vs. Ordinal
Structured vs. Unstructured
Self-Reflection: Unstructured data
Data structuring with Quick, Draw! In this self-reflection, you will explore the nature of unstructured data through a crowd-sourced dataset.
Quick, Draw! is a neural network dataset that has millions of pictures drawn by people separated into categories like plants, animals, or vehicles. On the Quick, Draw! website, you can view a large dataset of hundreds of thousands of real doodles made by people on the internet. You can also draw your own. Through this process, you can train a neural network to recognize objects and learn more about the importance of structured data.
- Visit the Quick, Draw! website.
-> GitHub page:Get the data
Reflection
Understanding structured data
Most of the data being generated right now is actually unstructured. Audio files, video files, emails, photos, and social media are all examples of unstructured data. These can be harder to analyze in their unstructured format.
But here's the good news, you'll be working with structured data most of the time. For example, if you need to analyze data about the unstructured data in emails, photos, and social media sites, it'll most likely be structured for analysis before you even get to it. Because of that, I want to explore structured data a bit more.
As a quick refresher, structured data is data organized in a format like rows and columns. But there's definitely more to it than that.
Structured data works nicely within a data model, which is a model that is used for organizing data elements and how they relate to one another.
What are data elements? They're pieces of information, such as people's names, account numbers, and addresses. Data models help to keep data consistent and provide a map of how data is organized. Data models help to keep data consistent and provide a map of how data is organized. This makes it easier for analysts and other stakeholders to make sense of their data and use it for business purposes. In addition to working well within data models, structured data is also useful for databases. This makes it easy for analysts to enter, query, and analyze the data whenever they need to. This also helps make data visualization pretty easy because structured data can be applied directly to charts, graphs, heat maps, dashboards and most other visual representations of data.
Alright, so now we know that spreadsheets and databases that store data sets are widely used sources of structured data. After you explore some other data structures, you'll check out more data types using a spreadsheet. The adventure continues!
The structure of data
Data is everywhere and it can be stored in lots of ways. Two general categories of data are:
- Structured data: Organized in a certain format, such as rows and columns.
- Unstructured data: Not organized in any easy-to-identify way.
For example, when you rate your favorite restaurant online, you're creating structured data. But when you use Google Earth to check out a satellite image of a restaurant location, you're using unstructured data.
Here's a refresher on the characteristics of structured and unstructured data:
Structured data As we described earlier, structured data is organized in a certain format. This makes it easier to store and query for business needs. If the data is exported, the structure goes along with the data.
Unstructured data Unstructured data can’t be organized in any easily identifiable manner. And there is much more unstructured than structured data in the world. Video and audio files, text files, social media content, satellite imagery, presentations, PDF files, open-ended survey responses, and websites all qualify as types of unstructured data.
The fairness issue The lack of structure makes unstructured data difficult to search, manage, and analyze. But recent advancements in artificial intelligence and machine learning algorithms are beginning to change that. Now, the new challenge facing data scientists is making sure these tools are inclusive and unbiased. Otherwise, certain elements of a dataset will be more heavily weighted and/or represented than others. And as you're learning, an unfair dataset does not accurately represent the population, causing skewed outcomes, low accuracy levels, and unreliable analysis.
Differentiating Data Types
Data type definitions
Test: data formats and structures
Explore data types, fields, and values
Know the type of data you're working with
By now you've learned a lot about data. From generated data, to collected data, to data formats, it's good to know as much as you can about the data you'll use for analysis. We'll talk about another way you can describe data: the data type.
A data type is a specific kind of data attribute that tells what kind of value the data is. In other words, a data type tells you what kind of data you're working with. Data types can be different depending on the query language you're using. For example, SQL allows for different data types depending on which database you're using. For now though, let's focus on the data types that you'll use in spreadsheets.
To help us out, we'll use a spreadsheet that's already filled with data. We'll call it "Worldwide Interests in Sweets through Google Searches."
Now a data type in a spreadsheet can be one of three things: a number, a text or string, or a Boolean. You might find spreadsheet programs that classify them a bit differently or include other types, but these value types cover just about any data you'll find in spreadsheets.
We'll look at all of these in just a bit. Looking at columns B, D, and F, we find number data types. Each number represents the search interest for the terms "cupcakes," "ice cream," and "candy" for a specific week. The closer a number is to 100, the more popular that search term was during that week. One hundred represents peak popularity. Keep in mind that in this case, 100 is a relative value, not the actual number of searches. It represents the maximum number of searches during a certain time. Think of it like a percentage on a test. All other searches are then also valued out of 100. You might notice this in other data sets as well. Gold star for 100!
If you needed to, you could change the numbers into percents or other formats, like currency. These are all examples of number data types.
In column H, the data shows the most popular treat for each week, based on the search data. So as we'll find in cell H4 for the week beginning July 28th, 2019, the most popular treat was ice cream.
This is an example of a text data type, or a string data type, which is a sequence of characters and punctuation that contains textual information. In this example, that information would be the treats and people's names. These can also include numbers, like phone numbers or numbers in street addresses. But these numbers wouldn't be used for calculations. In this case they're treated like text, not numbers.
In columns C, E, and G, it seems like we've got some text. But the text here isn't a text or string data type. Instead, it's a Boolean data type.
Instead, it's a Boolean data type. A Boolean data type is a data type with only two possible values: true or false. Columns C, E, and G show Boolean data for whether the search interest for each week, is at least 50 out of 100.
Here's how it works. To get this data, we've created a formula that calculates whether the search interest data in columns B, D, and F is 50 or greater.
In cell B4, the search interest is 14. In cell C4, we find the word false because, for this week of data, the search interest is less than 50. For each cell in columns C, E, and G, the only two possible values are true or false. We could change the formula so other words appear in these cells instead, but it's still Boolean data. You'll get a chance to read more about the Boolean data type soon. Let's talk about a common issue that people encounter in spreadsheets: mistaking data types with cell values.
For example, in cell B57, we can create a formula to calculate data in other cells. This will give us the average of the search interests in cupcakes across all weeks in the dataset, which is about 15. The formula works because we calculated using a number data type.
But if we tried it with a text or string data type, like the data in column C, we'd get an error. Error-values usually happen if a mistake is made in entering the values in the cells. The more you know your data types and which ones to use, the less errors you'll run into.
Understanding Boolean logic
Boolean logic example Imagine you are shopping for shoes, and are considering certain preferences:
- You will buy the shoes only if they are pink and grey
- You will buy the shoes if they are entirely pink or entirely grey, or if they are pink and grey
- You will buy the shoes if they are grey, but not if they have any pink
Below are Venn diagrams that illustrate these preferences. AND is the center of the Venn diagram, where two conditions overlap. OR includes either condition. NOT includes only the part of the Venn diagram that doesn't contain the exception.
The AND operator Your condition is “If the color of the shoe has any combination of grey and pink, you will buy them.” The Boolean statement would break down the logic of that statement to filter your results by both colors. It would say “IF (Color=”Grey”) AND (Color=”Pink”) then buy them.” The AND operator lets you stack multiple conditions.
Below is a simple truth table that outlines the Boolean logic at work in this statement. In the Color is Grey column, there are two pairs of shoes that meet the color condition. And in the Color is Pink column, there are two pairs that meet that condition. But in the If Grey AND Pink column, there is only one pair of shoes that meets both conditions. So, according to the Boolean logic of the statement, there is only one pair marked true. In other words, there is one pair of shoes that you can buy.
The OR operator The OR operator lets you move forward if either one of your two conditions is met. Your condition is “If the shoes are grey or pink, you will buy them.” The Boolean statement would be “IF (Color=”Grey”) OR (Color=”Pink”) then buy them.” Notice that any shoe that meets either the Color is Grey or the Color is Pink condition is marked as true by the Boolean logic. According to the truth table below, there are three pairs of shoes that you can buy.
The NOT operator Finally, the NOT operator lets you filter by subtracting specific conditions from the results. Your condition is "You will buy any grey shoe except for those with any traces of pink in them." Your Boolean statement would be “IF (Color="Grey") AND (Color=NOT “Pink”) then buy them.” Now, all of the grey shoes that aren't pink are marked true by the Boolean logic for the NOT Pink condition. The pink shoes are marked false by the Boolean logic for the NOT Pink condition. Only one pair of shoes is excluded in the truth table below.
The power of multiple conditions For data analysts, the real power of Boolean logic comes from being able to combine multiple conditions in a single statement. For example, if you wanted to filter for shoes that were grey or pink, and waterproof, you could construct a Boolean statement such as: “IF ((Color = “Grey”) OR (Color = “Pink”)) AND (Waterproof=“True”).” Notice that you can use parentheses to group your conditions together.
Whether you are doing a search for new shoes or applying this logic to your database queries, Boolean logic lets you create multiple conditions to filter your results. And now that you know a little more about how Boolean logic is used, you can start using it!
Additional Reading/Resources
-
Learn about who pioneered Boolean logic in this historical article: Origins of Boolean Algebra in the Logic of Classes.
-
Find more information about using AND, OR, and NOT from these tips for searching with Boolean operators.
Data table components
Here's a riddle for you. What do a music playlist, a calendar agenda, and an email inbox have in common? I'll give you a hint. It's not a weekly jam session. The answer is they're all arranged in tables. Go ahead and check out your email inbox or a favorite playlist, or look at your calendar agenda. There's tables in every one! A data table, or tabular data, has a very simple structure.
You can call the rows "records" and the columns "fields." They basically mean the same thing, but records and fields can be used for any kind of data table, while rows and columns are usually reserved for spreadsheets. When talking about structured databases, people in data analytics usually go with "records" and "fields." Sometimes a field can also refer to a single piece of data, like the value in a cell. Sometimes a field can also refer to a single piece of data, like the value in a cell. In any case, you'll hear both versions of these terms used throughout this program and your job.
Let's go back to our playlist example. We'll use the new terms we just introduced. So each song is a record. Each record has the same fields as the other records in the same order. In other words, the playlist has the same information about each song.
Each song characteristic, like the title and the artist, is a field. Each separate field has the same data type, but different fields can have different types. Let me show you what I mean. For the song list, the song titles are a text or string type, while the song's length could be a number type if you're using it for calculations. Or it could be a date and time type. The column for favorites is Boolean since it has two possible values: favorite or not favorite.
The records in a spreadsheet might be about all sorts of things: clients, products, invoices, or anything else. Each record has several fields, which reveal more about the clients, products, or invoices. The value in every cell contains a specific piece of data, like the address of a client or the dollar amount of an invoice.
As a data analyst, lots of data will come your way, and records, fields, and values in data tables will help you navigate analysis. Understanding the structures of the tables you're working with is a part of that. And hopefully, while you're working hard on your analysis and those tables, you can have a little fun with a different data table: the one with your favorite playlist!
Hands-On Activity: Applying a function
You will find the spreadsheet calculated zero for the sum. This is because the program was asked to sum strings. When a given cell contains a string, the program considers the numerical value of the cell as zero.
Template: Example Spreadsheet - Entertainment Expenses
Meet wide and long data
You probably use the words "wide" and "long" all the time. You might use "wide" to describe the size of something from side to side, like a wide river. But a river can also travel great distances, so you might call it "long" as well. But the words "wide" and "long" can be used to describe data, too. So I am here to help you understand wide data and long data. So far you've dealt with data arranged mostly in a wide format.
Wide data With wide data, every data subject has a single row with multiple columns to hold the values of various attributes of the subject. Here's some wide data in a spreadsheet.
You might remember we discussed this data about the population of Latin and Caribbean countries earlier. For this data set, each row provides all of the population information about one country. Each column shows the population for a different year.
Wide data lets you easily identify and quickly compare different columns. In our example, the data is arranged alphabetically by country, so you can compare the annual populations of Antigua and Barbuda, Aruba, and the Bahamas by just checking out the values in each column.
The wide data format also makes it easy to find and compare the countries' populations at different periods of time.
For example, by sorting the data, we discover that Brazil had the highest population of all countries in 2010, and the British Virgin Islands had the lowest population of all countries in 2013.
Long data
Okay, now let's explore this data in a long format. Here the data is no longer organized into columns by year. All the years are now in one column with each country, like Argentina, appearing in multiple rows, one for each year of data. This is how long data usually looks.
Long data is data in which each row is a one-time point per subject, so each subject will have data in multiple rows. Our spreadsheet is formatted to show each year of population data.
Here we see Antigua and Barbuda first. Long data is a great format for storing and organizing data when there's multiple variables for each subject at each time point that we want to observe. With this long data format, we can store and analyze all of this data using fewer columns. Plus, if we added a new variable, like the average age of a population, we'd only need one more column. If we'd use a wide data format instead, we would have needed 10 more columns, one for each year. The long data format keeps everything nice and compact. If you're wondering which format you should use, the simple answer is, "it depends."
Sometimes you'll have to transform wide data into a long data format, or other times vice versa. You'll probably work with both formats in your job. And you'll definitely revisit both formats again later in this program. That reminds me: earlier we define data as a collection of facts. As you've discovered over the last few videos, that collection of facts can take on lots of different formats, structures, types, and more.
Learning about all of the ways that data can be presented will be a big help to you throughout the data analysis process. The more you work with data in all its forms, the quicker you'll start to recognize which data to use, and when to use it. And in just a bit, you'll use all that data stored in your brain to help you take an assessment. After that, you'll learn how to identify and avoid bias in data and how to embrace credibility, integrity and ethics. The data adventure moves forward. I'm so glad you're moving with it!
Transforming data
What us data transformation?
In this reading, you will explore how data is transformed and the differences between wide and long data. Data transformation is the process of changing the data’s format, structure, or values. As a data analyst, there is a good chance you will need to transform data at some point to make it easier for you to analyze it.
Data transformation usually involves:
- Adding, copying, or replicating data
- Deleting fields or records
- Standardizing the names of variables
- Renaming, moving, or combining columns in a database
- Joining one set of data with another
- Saving a file in a different format. For example, saving a spreadsheet as a comma-separated values (CSV) file.
Why transform data? Goals for data transformation might be:
- Data organization: better organized data is easier to use
- Data compatibility: different applications or systems can then use the same data
- Data migration: data with matching formats can be moved from one system to another
- Data merging: data with the same organization can be merged together
- Data enhancement: data can be displayed with more detailed fields
- Data comparison: apples-to-apples comparisons of the data can then be made
Data transformation example: data merging Mario is a plumber who owns a plumbing company. After years in the business, he buys another plumbing company. Mario wants to merge the customer information from his newly acquired company with his own, but the other company uses a different database. So, Mario needs to make the data compatible. To do this, he has to transform the format of the acquired company’s data. Then, he must remove duplicate rows for customers they had in common. When the data is compatible and together, Mario’s plumbing company will have a complete and merged customer database.
Data transformation example: data organization (long to wide) To make it easier to create charts, you may also need to transform long data to wide data. Consider the following example of transforming stock prices (collected as long data) to wide data.
Long data is data where each row contains a single data point for a particular item. In the long data example below, individual stock prices (data points) have been collected for Apple (AAPL), Amazon (AMZN), and Google (GOOGL) (particular items) on the given dates.
Long data example: Stock prices
Wide data is data where each row contains multiple data points for the particular items identified in the columns.
Wide data example: Stock prices
With data transformed to wide data, you can create a chart comparing how each company's stock changed over the same period of time.
You might notice that all the data included in the long format is also in the wide format. But wide data is easier to read and understand. That is why data analysts typically transform long data to wide data more often than they transform wide data to long data. The following table summarizes when each format is preferred:
Hands-on Activity: Introduction to Kaggle
- Check out this brief introductory video to learn more about Kaggle.
- Go to kaggle.com
- If you want some inspiration, check out the profile of Kaggle’s Community Advocate, Jesse Mostipak!
Explore Kaggle notebooks
Step 4: Review suggested notebooks If you’re looking for specific suggestions, check out the following notebooks:
- gganimate by Meg Risdal
- Getting staRted in R by Rachael Tatman
- Writing Hamilton Lyrics with TensorFlow/R by Ana Sofia Uzsoy
- Dive into dplyr (tutorial #1) by Jesse Mostipak
Spend some time checking out a couple of notebooks to get an idea of the work that Kagglers share online—and that you’ll be able to create by the time you’ve finished this course!
Edit a notebook
Working with datasets in notebooks
Confirmation and reflection
Test on data types, fields, and values
Weekly Challenge
by mistake !