3.2.2.Data ethics and privacy & Understanding open data - sj50179/Google-Data-Analytics-Professional-Certificate GitHub Wiki

Data ethics and privacy

Data ethics refers

  • Well- founded standards of right and wrong that dictate how data is collected, shared, and used.

GDPR - General Data Protection Regulation of the European Union

Aspects of data ethics

  • Ownership - Individuals who own the raw data they provide, and they have primary control over its usage, how it's processed and how it's shared
  • Transaction transparency - All data processing activities and algorithms should be completely explainable and understood by the individual who provides their data
  • Consent - An individual's right to know explicit details about how and why their data will be used before agreeing to provide it
  • Currency - Individuals should be aware of financial transactions resulting from the use of their personal data and the scale of these transactions
  • Privacy - Preserving a data subject's information and activity any time a data transaction occurs
    • Protection from unauthorized access to our private data
    • Freedom from inappropriate use of our data
    • The right to inspect, update, or correct our data
    • Ability to give consent to use our data
    • Legal right to access our data.
  • Openness (or open data) - Free access, usage, and sharing of data

Data anonymization

What is data anonymization?

You have been learning about the importance of privacy in data analytics. Now, it is time to talk about data anonymization and what types of data should be anonymized. Personally identifiable information, or PII, is information that can be used by itself or with other data to track down a person's identity.

Data anonymization is the process of protecting people's private or sensitive data by eliminating that kind of information. Typically, data anonymization involves blanking, hashing, or masking personal information, often by using fixed-length codes to represent data columns, or hiding data with altered values.

Your role in data anonymization

Organizations have a responsibility to protect their data and the personal information that data might contain. As a data analyst, you might be expected to understand what data needs to be anonymized, but you generally wouldn't be responsible for the data anonymization itself. A rare exception might be if you work with a copy of the data for testing or development purposes. In this case, you could be required to anonymize the data before you work with it.

What types of data should be anonymized?

Healthcare and financial data are two of the most sensitive types of data. These industries rely a lot on data anonymization techniques. After all, the stakes are very high. That’s why data in these two industries usually goes through de-identification, which is a process used to wipe data clean of all personally identifying information.

Data anonymization is used in just about every industry. That is why it is so important for data analysts to understand the basics. Here is a list of data that is often anonymized:

  • Telephone numbers
  • Names
  • License plates and license numbers
  • Social security numbers
  • IP addresses
  • Medical records
  • Email addresses
  • Photographs
  • Account numbers

For some people, it just makes sense that this type of data should be anonymized. For others, we have to be very specific about what needs to be anonymized. Imagine a world where we all had access to each other’s addresses, account numbers, and other identifiable information. That would invade a lot of people’s privacy and make the world less safe. Data anonymization is one of the ways we can keep data private and secure!

Test your knowledge on data ethics and privacy

TOTAL POINTS 3

Question 1

Fill in the blank: _____ states that all data-processing activities and algorithms should be completely explainable and understood by the individual who provides their data.

  • Transaction transparency
  • Currency
  • Privacy
  • Openness

Correct. Transaction transparency states that all data-processing activities and algorithms should be completely explainable and understood by the individual who provides their data.

Question 2

A data analyst removes personally identifying information from a dataset. What task are they performing?

  • Data visualization
  • Data collection
  • Data sorting
  • Data anonymization

Correct. They are performing data anonymization, which is the process of protecting people's private or sensitive data by eliminating identifying information.

Question 3

Before agreeing to complete a survey, an individual reads information about how and why the data they provide will be used. What is this concept called?

  • Currency
  • Openness
  • Consent
  • Privacy

Correct. This concept is called consent. Consent is the aspect of data ethics that presumes an individual’s right to know how and why their personal data will be used before agreeing to provide it.

Understanding open data

Interoperability is key to open data's success.

The open-data debate

Just like data privacy, open data is a widely debated topic in today’s world. Data analysts think a lot about open data, and as a future data analyst, you need to understand the basics to be successful in your new role.

In data analytics, open data is part of data ethics, which has to do with using data ethically. Openness refers to free access, usage, and sharing of data. But for data to be considered open, it has to:

  • Be available and accessible to the public as a complete dataset
  • Be provided under terms that allow it to be reused and redistributed
  • Allow universal participation so that anyone can use, reuse, and redistribute the data

Data can only be considered open when it meets all three of these standards.

The open data debate: What data should be publicly available?

One of the biggest benefits of open data is that credible databases can be used more widely. Basically, this means that all of that good data can be leveraged, shared, and combined with other data. This could have a huge impact on scientific collaboration, research advances, analytical capacity, and decision-making. But it is important to think about the individuals being represented by public, open data, too.

PII and licensing, third party data, and privacy

Third-party data is collected by an entity that doesn’t have a direct relationship with the data. You might remember learning about this type of data earlier. For example, third parties might collect information about visitors to a certain website. Doing this lets these third parties create audience profiles, which helps them better understand user behavior and target them with more effective advertising.

Personal identifiable information (PII) is data that is reasonably likely to identify a person and make information known about them. It is important we keep this data safe***.*** PII can include a person’s address, credit card information, social security number, medical records, and more. We all want to keep this type of information about ourselves private. So it is important to find a balance between privacy and openness in public data.

Sites and resources for open data

Luckily for data analysts, there are lots of trustworthy sites and resources available for open data. It is important to remember that even reputable data needs to be constantly evaluated, but these websites are a useful starting point:

  1. U.S. government data site: Data.gov is one of the most comprehensive data sources in the US. This resource gives users the data and tools that they need to do research, and even helps them develop web and mobile applications and design data visualizations.
  2. U.S. Census Bureau: This open data source offers demographic information from federal, state, and local governments, and commercial entities in the U.S. too.
  3. Open Data Network: This data source has a really powerful search engine and advanced filters. Here, you can find data on topics like finance, public safety, infrastructure, and housing and development.
  4. Google Cloud Public Datasets: There are a selection of public datasets available through the Google Cloud Public Dataset Program that you can find already loaded into BigQuery.
  5. Dataset Search: The Dataset Search is a search engine designed specifically for data sets; you can use this to search for specific data sets.

Test your knowledge on open data

TOTAL POINTS 3

Question 1

What aspect of data ethics promotes the free access, usage, and sharing of data?

  • Privacy
  • Transaction transparency
  • Openness
  • Consent

Correct. Openness is the aspect of data ethics that promotes the free access, usage, and sharing of data.

Question 2

What are the main benefits of open data? Select all that apply.

  • Open data makes good data more widely available.
  • Open data restricts data access to certain groups of people.
  • Open data combines data from different fields of knowledge.
  • Open data increases the amount of data available for purchase.

Correct. The benefits of open data include making good data more widely available and combining data from different fields of knowledge.

Question 3

Universal participation is a standard of open data. What are the key aspects of universal participation? Select all that apply.

  • Certain groups of people must share their private data.
  • All corporations are allowed to sell open data.
  • Everyone must be able to use, re-use, and redistribute open data.
  • No one can place restrictions on data to discriminate against a person or group.

Correct. The key aspects of universal participation are that everyone must be able to use, reuse, and redistribute open data. Also, no one can place restrictions on data to discriminate against a person or group.