4.2.2.1.Hands On Activity: Clean data with spreadsheet functions - sj50179/Google-Data-Analytics-Professional-Certificate GitHub Wiki

Hands-On Activity: Clean data with spreadsheet functions

San-Francisco-Boba-Tea-Shop-Location-Info.csv

Identify the dirty elements in your data

As a data analyst, your job is to present data that is readable, accurate, and visually appealing. Cleaning your data helps you achieve this goal. The first step is to identify the dirty elements in your data.

  1. Rename your spreadsheet. Click Untitled Spreadsheet and enter a new name. You can use the name sf_boba_tea_shop_data or a similar name that describes the data your spreadsheet contains.
  2. If you want to get a better view of your data, you can make the columns wider by dragging the right boundary of the column heading. This may apply to the name (B), address (D), and lat-long (F) columns.
  3. Now, review your data and consider any problems you may need to address. The following are examples of errors that you can quickly identify and fix. This is not a comprehensive list of every potential problem, but is a great starting point for data cleaning.
  • First, there is at least one duplicate line (rows 8 and 9) in your dataset.

https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/FXraE4D0S9W62hOA9BvVhQ_3fa9a75ec2334248a489c7736dd69ff1_Screenshot-2021-06-30-2.29.37-PM.png?expiry=1625961600000&hmac=oQgzZuPEfUp2-ozMa2yc6u2uvQ4br-6nqIPh-vlE6jw

  • Second, all Yelp ratings should fall between 0 and 5. However, at least one rating (in cell C8) falls outside of that range.

https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/nirrPfsRRm2q6z37EcZtxg_f0a737934eab42099bebfcb516abbdf1_Screenshot-2021-06-30-2.30.37-PM.png?expiry=1625961600000&hmac=CmgmQeRONKhX-0_C6cRdvS2QqWHKMfl7mRzVPtfvPps

  • Finally, the data for latitude and longitude is contained in a single column (F). In order for someone to be able to use this data for analysis, the two values should be in separate columns.

https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/NEiZs-ZmQCaImbPmZlAmJQ_cf51dde76ccd4bd7821357a6d912f6f1_Screenshot-2021-06-30-2.31.35-PM.png?expiry=1625961600000&hmac=4_fPPmeBwjc_ZiIWH0-P3QCskH6yq-ZJLORwfY2D_JU

Now you know what issues to focus your attention on during the cleaning process.

Clean your data

Your goal is to fix these errors and help create a clean dataset for analysis. You can address each issue in turn.

Remove duplicates

The first step is to eliminate any duplicate entries from your dataset. In this case, you want to remove duplicate ID numbers and boba shop names. As a best practice, duplicates should be removed even if they are not readily apparent (as rows 8-9 happen to be).

  1. To start, select columns A and B.
  2. Then, in the menu bar, choose Data and Remove duplicates.
  3. In the pop-up window, click Data has header row. In the Columns to analyze section, make sure all the columns (id, name) are selected.

https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/qCFzadquTXqhc2nark16Og_4b8fd33d0c03404580c2b3c1443fbcf1_Screenshot-2021-07-08-12.32.12-PM.png?expiry=1625961600000&hmac=zyQPZw5h4F6SFvJm6RmiqaEaR4QS3iWCKQki9q6TyS0

  1. Once everything has been selected, click Remove duplicates.

  2. If done correctly, 3 duplicate rows will be removed and 604 rows will remain. (There are duplicate entries in rows 8-9, rows 20-21, and rows 25-26).

Correct the ratings data

Next, clean up any data that does not make sense. Yelp ratings should be less than 5 and greater than 0. Now, you will determine how many entries are inaccurate and correct them. You can use the COUNTIF function to perform this task.

  1. The COUNTIF function quickly counts how many items in a range of cells meet a given criterion. In cell I2, type =COUNTIF(C:C,">5"). The first entry (C:C) refers to the range where you are counting the data. In this case, the range is the entire rating column (C), which contains the Yelp ratings. The second entry refers to the criterion (>5), and tells the function to count all the values greater than 5.
  2. Press Enter. You’ll notice that the function returns a value of 9. This tells you that your dataset contains 9 entries that have a rating greater than 5.

https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/Jf8mhZISRri_JoWSEra4dA_c73549e7506a4efa83cf6ed7f6ed21f1_Screenshot-2021-06-30-2.38.08-PM.png?expiry=1625961600000&hmac=yXJGMQjR45Ty5y09WmSKOBrrMRgzfAdAQ-hxXgUOyk8

As a data analyst, it's your job to decide what to do with incorrect values or to ask the dataset owner for advice if you’re unsure. In this case, one effective approach would be to search on Yelp for the actual ratings. For this activity, you can just replace the incorrect ratings with the number 5. An efficient way to replace the ratings is to sort the data numerically from largest to smallest rating.

  1. Select columns A through F.

  2. Then, from the menu bar, choose Data and Sort range.

  3. In the pop-up window, check the box next to Data has header row. Sort by rating from Z →A. This way, the highest ratings will be listed first.

https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/pvoPt_PkSba6D7fz5Hm2OQ_376639396af44a30b48ead45ed9dd8f1_Screenshot-2021-06-30-2.39.21-PM.png?expiry=1625961600000&hmac=Z4yIzbHXmyzkfpo4CQyGnvWBkYX1KU3tNvIVzsXvdms

  1. Click Sort. Check out your spreadsheet. At the start of the rating column, you should now find the 9 rows that have incorrect values (rating > 5).

https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/pPk8ejNsSZO5PHozbNmTQA_88214bfc74464d62b40bed78ff6145f1_Screenshot-2021-06-30-2.40.18-PM.png?expiry=1625961600000&hmac=gL0E4EH2PsN7cTrEJsxJZCxk0q76zEfSU7EXVVjAsSQ

  1. Next, select the range of cells C2:C10. Press delete to delete the values that are greater than 5.

  2. Replace all the values with the number 5. In cell C2, type 5. Then, drag the fill handle down to cell C10 to fill the remaining cells with 5.

  3. After replacing the incorrect ratings with the number 5, you may notice that the new value in cell I2 is 0. The output of the COUNTIF function now reflects the changes in your dataset. This confirms that the rating column no longer contains any values greater than 5.

  4. FInally, delete the formula from cell I2 since you don’t need this information anymore.

Clean up the latitude and longitude data

Next, clean up the latitude and longitude data by placing each value in a separate column. You can use the SPLIT function to accomplish this task.

  1. The SPLIT function divides text around a specified character or string, and puts each fragment of text into a separate cell in the row. The SPLIT function will split the single lat-long column into two separate columns, one for latitude and the other for longitude. In cell G2, type =SPLIT(F2,"-"). The first entry (F2) refers to the cell where the text is located. The second entry (“-”) refers to the fact that you are dividing the text based on the minus sign.

https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/ln7G6OF_SSq-xujhfykqfg_6c0b145b4ba146b8b1d47996d57116f1_Screenshot-2021-06-30-2.42.26-PM.png?expiry=1625961600000&hmac=04T5aDfTfsCQePMp-fTypcFBydGKzG84-FRMZIHxU4A

  1. Press Enter. The result shows each fragment of text in a different cell.

https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/o7HEj_-oQMyxxI__qFDMBg_968abf1e70e945858978b76cea3899f1_Screenshot-2021-06-30-2.43.09-PM.png?expiry=1625961600000&hmac=2cwurwAdZmSo6Z_5P6MBgWQBbkHk9G8rAT4VpxPfZLI

  1. Select cell G2 again. In cell G2, double-click on the fill handle to split all the remaining lat-long entries.

  2. Now add column headers to the two new columns (G and H). In cell G1, type lat. In cell H1, type long.

  3. Next, replace the original lat-long data in column F with the new split entries in columns G and H. Select columns G and H, right-click, and choose Copy.

https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/8aKXdyYBQrmil3cmAeK5Kg_1d0fb3eacea345da80f242acc00a4bf1_Screenshot-2021-06-30-2.46.26-PM.png?expiry=1625961600000&hmac=aeAooaa7tfeVjmNthBoJPBqjWusLJdxdYPe21d2BCrY

  1. Then, select Column F, right-click, and choose Paste special and Paste values only.

https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/yJGweVJzSFORsHlSc-hTkA_9248bb8f70394c328746110472257df1_Screenshot-2021-06-30-2.47.25-PM.png?expiry=1625961600000&hmac=e9KUvRxL2BdFB8UZOprnBCY_fOFgKoj8Y-e08pAm1a0

  1. Now the new lat column is column F, and the new long column is column G. Adjust the width of the lat column (F) to fit the data by dragging the right boundary of the column heading.

https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/wPR2exShTq60dnsUoW6uLA_459eb3a115c14002adedea32be1b35f1_Screenshot-2021-07-08-2.28.03-PM.png?expiry=1625961600000&hmac=dmJ1NqmL3IHeBwxDGfTQQ5PyFZD0tGiKgYj8w7gWrp0

  1. Next, select column H, right-click, and choose Delete column.

  2. Finally, the longitude values should be negative so that they are accurate coordinates for mapping. To make the values in the long column negative, multiply them ***by -1. In cell H2, type =G2-1. The asterisk is the operator for multiplication. Press Enter.

  3. Still in cell H2, double-click on the fill handle to fill in the rest of the values.

  4. Next, add a column header. In cell H1, type: long.

  5. Now, replace the longitude data in column G with the new data in column H. Select column H, right-click, and choose Copy.

  6. Select Column G, right-click, and choose Paste special and Paste values only.

  7. The, select column H, right-click, and choose Delete column.

Columns F and G should look like this:

https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/8zCkhpznQyewpIac51MnBA_26b68c8ae9e54a80b4378daf952a78f1_Screenshot-2021-07-08-2.26.21-PM.png?expiry=1625961600000&hmac=TUFGYfNetmc2vAotkkwoKj18E26cBii1ffnXl9gigcs

Now your data is cleaner, clearer, and easier to use.

Which of the following functions divides text around a specified character or string and puts each fragment of text into a separate cell in the row?

  • The SPLIT function

  • The CONCATENATE function

  • The TRIM function

  • The COUNTIF function

  • Correct. The SPLIT function divides text around a specified character or string, and puts each fragment of text into a separate cell in the row. Spreadsheet functions are useful tools for data cleaning, and knowing how to use functions effectively is a key part of every data analyst’s skill set.

As a data analyst, why is it important for the data you present to be readable, accurate, and visually appealing? And How can spreadsheet functions help you clean data more efficiently and effectively?

  • Cleaning data is an important part of the data analysis process. If data analysis is based on bad or “dirty” data, it may be biased, erroneous, and uninformed. Knowing how to effectively use spreadsheet functions to work with data is an essential skill for every data analyst. In upcoming activities, you will continue to learn more about spreadsheet functions and how they can help you analyze your data.