3.1.3.Differentiate between data formats and structures - quanganh2001/Google-Data-Analytics-Professional-Certificate-Coursera GitHub Wiki

Data formats in practice

When you think about the word "format," a lot of things might come to mind. Think of an advertisement for your favorite store. You might find it in the form of a print ad, a billboard, or even a commercial. The information is presented in the format that works best for you to take it in. The format of a dataset is a lot like that, and choosing the right format will help you manage and use your data in the best way possible. lpSSp7kPSMqUkqe5D6jKhQ_d475227147854cadb95f7724129bc6f1_C3M1L2R1

Data format examples

As with most things, it is easier for definitions to click when we can pair them with real life examples. Review each definition first and then use the examples to lock in your understanding of each data format.

ifEfO0z-RkGxHztM_iZBRg_07768c34e01e4fc2a93dfa9d9cd5e614_Screen-Shot-2021-01-08-at-2 47 37-PM

Data Format Classification Definition Examples
Primary data Collected by a researcher from first-hand sources Data from an interview you conducted Data from a survey returned from 20 participants Data from questionnaires you got back from a group of workers
Secondary data Gathered by other people of from other research Data you bought from a local data analytics firm's customer profiles Demographic data collected by a university Census data gathered by the federal government

RNZnrZ-jSmKWZ62fo5piaw_c66ab2f970114fb3a53535639dd1519a_Screen-Shot-2021-01-08-at-2 34 17-PM

Data Format Classification Definition Examples
Internal data Data that lives inside a company's own systems - Wages of employees across different business units tracked by HR - Sales data by store location - Product inventory levels across distribution centers
External data Data that lives outside of a company or organization - National average wages for the various positions throughout your organization - Credit reports for customers of an auto dealership

kuniHwd-Smyp4h8HfqpsoA_b6e0575dd964430588ef47c54492bfc6_Screen-Shot-2021-01-08-at-2 34 28-PM

Data Format Classification Definition Examples
Continuous data Data that is measured and can have almost any numeric value - Height of kids in third grade classes (52.5 inches, 65.7 inches) - Runtime markers in a video - Temperature
Discrete data Data that is counted and has a limited number of values - Number of people who visit a hospital on a daily basis (10, 20, 200) - Room's maximum capacity allowed - Tickets sold in the current month

55R5ehYkRE6UeXoWJOROSQ_1ddf2f94895f4dd1beaefee843c4ddf2_Screen-Shot-2021-01-08-at-2 34 41-PM

Data Format Classification Definition Examples
Qualitative Subjective and explanatory measures of qualities and characteristics - Exercise activity most enjoyed - Favorite brands of most loyal customers - Fashion preferences of young adults
Quantitative Specific and objective measures of numerical facts - Percentage of board certified doctors who are women - Population of elephants in Africa - Distance from Earth to Mars

t6Q5PxMiT8CkOT8TIv_AmQ_83ca376a87fb43b18488b1d9a5b160b8_Screen-Shot-2021-01-08-at-2 34 49-PM

Data Format Classification Definition Examples
Nominal A type of qualitative data that isn't categorized with a set order - First time customer, returning customer, regular customer - New job applicant, existing applicant, internal applicant - New listing, reduced price listing, foreclosure
Ordinal A type of qualitative data with a set order or scale - Movie ratings (number of stars: 1 star, 2 stars, 3 stars) - Ranked-choice voting selections (1st, 2nd, 3rd) - Income level (low income, middle income, high income)

8F4tIN1WR1ieLSDdVtdYNw_e48de072de81458e95dcd35195376976_Screen-Shot-2021-01-08-at-2 35 00-PM

Data Format Classification Definition Examples
Structured data Data organized in a certain format, like rows and columns - Expense reports - Tax returns - Store inventory
Unstructured data Data that isn't organized in any easily identifiable manner - Social media posts - Emails - Videos

Self-Reflection: Unstructured data

Question 1

Overview

S6e1WViNSjentVlYjXo3GA_4599d480fd7b49f08cb7516d9194c636_line-y

Now that you have learned about unstructured data, you can pause for a moment and apply what you are learning. In this self-reflection, you will complete tasks with a neural network, consider your thoughts about data structuring, and respond to brief questions.

This self-reflection will help you develop insights into your own learning and prepare you to apply your knowledge of data structures to your interactions with unstructured data. As you complete tasks with a neural network website, you will explore concepts, practices, and principles to help refine your understanding and reinforce your learning. You’ve done the hard work, so make sure to get the most out of it: This reflection will help your knowledge stick!

Data structuring with Quick, Draw!

S6e1WViNSjentVlYjXo3GA_4599d480fd7b49f08cb7516d9194c636_line-y

In this self-reflection, you will explore the nature of unstructured data through a crowd-sourced dataset.

Quick, Draw! is a neural network dataset that has millions of pictures drawn by people separated into categories like plants, animals, or vehicles. On the Quick, Draw! website, you can view a large dataset of hundreds of thousands of real doodles made by people on the internet. You can also draw your own. Through this process, you can train a neural network to recognize objects and learn more about the importance of structured data.

  1. Visit the Quick, Draw! website.
  2. In the upper left-hand corner, you will notice a drop-down menu like this:

qMKtZETJSB-CrWREyXgfeA_ba5af21b5d704934902733f72357e9f1_Quickdraww

Select a type of doodle to begin.

  1. Click on different pictures to see details about the images on your screen. For example, there are more than one hundred thousand different drawings of elephants. Scroll through the list and see if there are any that don’t belong. If you find one that doesn’t match the intended object, click on it and select Flag as inappropriate.
  2. Explore other categories of drawings. Select three categories that interest you and check out their doodles.
  3. Optional: Explore further. Click Get the data to visit the GitHub page containing the entire dataset. As you become more familiar with data projects and start creating your own, you can return to this dataset and analyze it yourself. Click Play the game to draw your own doodles and contribute to Quick, Draw!’s dataset.
  4. When you’re done, answer the reflection questions below.

UlzCCN6WTcucwgjeli3Lqw_87521ea2e9de4c808f3533ae04a6faf1_Quickdraw

Reflection

S6e1WViNSjentVlYjXo3GA_4599d480fd7b49f08cb7516d9194c636_line-y

Consider the doodles you found in the Quick, Draw! Dataset:

  • What do you notice as you explored drawings in different categories? Are there consistent themes among the pictures in a category?
  • If you didn’t know the category labels, how would you distinguish the pictures from each other? What would you look for?

Now, write 2-3 sentences (40-60 words) in response to each of these questions. Type your response in the text box below.

Question 2

Consider what you know about structured and unstructured data and how it connects to the Quick, Draw! website:

  • How would you describe the Quick, Draw! doodles you explored from a data point of view?
  • How are these doodles different from or similar to other types of data that you have previously encountered?
  • What about this data makes it unstructured?

Now, write 2-3 sentences (40-60 words) in response to each of these questions. Type your response in the text box below.

Please do it yourself.

The structure of data

Data is everywhere and it can be stored in lots of ways. Two general categories of data are:

  • Structured data: Organized in a certain format, such as rows and columns.
  • Unstructured data: Not organized in any easy-to-identify way.

For example, when you rate your favorite restaurant online, you're creating structured data. But when you use Google Earth to check out a satellite image of a restaurant location, you're using unstructured data.

Here's a refresher on the characteristics of structured and unstructured data:

bG2cUYmWTg6tnFGJli4OAQ_29c5e99b864645df8cb1f3c5900b3c3e_DA_C3M1L3R2

Structured data

As we described earlier, structured data is organized in a certain format. This makes it easier to store and query for business needs. If the data is exported, the structure goes along with the data.

Unstructured data

Unstructured data can’t be organized in any easily identifiable manner. And there is much more unstructured than structured data in the world. Video and audio files, text files, social media content, satellite imagery, presentations, PDF files, open-ended survey responses, and websites all qualify as types of unstructured data.

The fairness issue

The lack of structure makes unstructured data difficult to search, manage, and analyze. But recent advancements in artificial intelligence and machine learning algorithms are beginning to change that. Now, the new challenge facing data scientists is making sure these tools are inclusive and unbiased. Otherwise, certain elements of a dataset will be more heavily weighted and/or represented than others. And as you're learning, an unfair dataset does not accurately represent the population, causing skewed outcomes, low accuracy levels, and unreliable analysis.

Data modeling levels and techniques

This reading introduces you to data modeling and different types of data models. Data models help keep data consistent and enable people to map out how data is organized. A basic understanding makes it easier for analysts and other stakeholders to make sense of their data and use it in the right ways.

Important note: As a junior data analyst, you won't be asked to design a data model. But you might come across existing data models your organization already has in place.

What is data modeling?

Data modeling is the process of creating diagrams that visually represent how data is organized and structured. These visual representations are called data models. You can think of data modeling as a blueprint of a house. At any point, there might be electricians, carpenters, and plumbers using that blueprint. Each one of these builders has a different relationship to the blueprint, but they all need it to understand the overall structure of the house. Data models are similar; different users might have different data needs, but the data model gives them an understanding of the structure as a whole.

Levels of data modeling

Each level of data modeling has a different level of detail.

CFnznQrXRhaZ850K1_YW7w_32345423f15a4115961a22948e080610_Screen-Shot-2021-01-08-at-2 28 52-PM

  1. Conceptual data modeling gives a high-level view of the data structure, such as how data interacts across an organization. For example, a conceptual data model may be used to define the business requirements for a new database. A conceptual data model doesn't contain technical details.
  2. Logical data modeling focuses on the technical details of a database such as relationships, attributes, and entities. For example, a logical data model defines how individual records are uniquely identified in a database. But it doesn't spell out actual names of database tables. That's the job of a physical data model.
  3. Physical data modeling depicts how a database operates. A physical data model defines all entities and attributes used; for example, it includes table names, column names, and data types for the database.

More information can be found in this comparison of data models.

Data-modeling techniques

There are a lot of approaches when it comes to developing data models, but two common methods are the Entity Relationship Diagram (ERD) and the Unified Modeling Language (UML) diagram. ERDs are a visual way to understand the relationship between entities in the data model. UML diagrams are very detailed diagrams that describe the structure of a system by showing the system's entities, attributes, operations, and their relationships. As a junior data analyst, you will need to understand that there are different data modeling techniques, but in practice, you will probably be using your organization’s existing technique.

You can read more about ERD, UML, and data dictionaries in this data modeling techniques article.

Data analysis and data modeling

Data modeling can help you explore the high-level details of your data and how it is related across the organization’s information systems. Data modeling sometimes requires data analysis to understand how the data is put together; that way, you know how to map the data. And finally, data models make it easier for everyone in your organization to understand and collaborate with you on your data. This is important for you and everyone on your team!

Test your knowledge on data formats and structures

Question 1

Fill in the blank: The running time of a movie is an example of _____ data.

A. qualitative

B. nominal

C. continuous

D. discrete

Explain: Running times of movies are an example of continuous data, which is measured and can have almost any numeric value.

Question 2

What are the characteristics of unstructured data? Select all that apply.

  • Is not organized
  • May have an internal structure
  • Has a clearly identifiable structure
  • Fits neatly into rows and columns

Explain: Unstructured data is not organized, although it may have an internal structure.

Question 3

Structured data enables data to be grouped together to form relations. This makes it easier for analysts to do what with the data? Select all that apply.

  • Analyze
  • Search
  • Store
  • Rewrite

Explain: Structured data that is grouped together to form relations enables analysts to more easily store, search, and analyze the data.

Question 4

Which of the following is an example of unstructured data?

A. Contact saved on a phone

B. Email message

C. Rating of a local favorite restaurant

D. GPS location

The correct answer is B. Email message. Explain: An example of unstructured data is an email message. Other examples of unstructured data are video files and social media content.