R Workout 10 - Cleaning and Transforming Data with dplyr

Title: R’s Data Dojo: Cleaning and Transforming Data with dplyr

Description:

In the R ecosystem, dplyr stands out as a go-to tool for data wrangling. Embark on this workout to learn the basics of the dplyr package, mastering essential techniques for data cleaning and transformation.

Scenario:

You’ve been handed a dataset containing sales records for a company. The data has missing values, unwanted columns, and needs aggregation based on product categories. How can you employ dplyr functions to make this dataset analysis-ready?

Objectives:

By the end of this workout, you should be able to:

  1. Understand and apply core dplyr verbs such as select, filter, mutate, and summarize.

  2. Clean and transform datasets using dplyr functions.

  3. Gain insights into the importance of data wrangling in the data analysis process.

Interactive Task:

Given your beginner’s understanding of dplyr in R, answer the following:

  1. How would you use dplyr to select only the columns “Product” and “Sales” from a dataset named sales_data?

    • Your Approach: ________________________
  2. If you want to filter out rows where the “Sales” column has values below 100, how would you achieve this using dplyr?

    • Your Approach: ________________________
  3. To compute the average sales for each product category in the dataset, which dplyr functions would you chain together?

    • Your Answer: ________________________

Questions:

  1. In dplyr, if you want to create or modify a column in a dataset, which function would you use?

    • i) select()

    • ii) change()

    • iii) mutate()

    • iv) adjust()

  2. When using dplyr to summarize data, which function allows you to compute summary statistics like mean, sum, or count for specific columns?

    • i) compute()

    • ii) summarize()

    • iii) aggregate()

    • iv) collect()

Duration: 20 minutes

Difficulty: Beginner

Period:
This workout is released on Tuesday, October 10, 2023, and will end on Friday, October 20, 2023. But you can always come back to any of the workouts and solve them.

Hi There,

Solution to this workout:

Questions:

  1. In dplyr, if you want to create or modify a column in a dataset, which function would you use?

Answer:

  • iii) mutate()
  1. When using dplyr to summarize data, which function allows you to compute summary statistics like mean, sum, or count for specific columns?

Answer:

  • ii) summarize()

Interactive Task:

  1. How would you use dplyr to select only the columns “Product” and “Sales” from a dataset named sales_data ?

Approach:
You can use the select() function from the dplyr package to choose specific columns from a dataset. To select only the columns “Product” and “Sales” from a dataset named sales_data, you can use the following code:

library(dplyr)

selected_data <- sales_data %>%
  select(Product, Sales)

In this code:

  • library(dplyr) loads the dplyr package.
  • sales_data is the name of your dataset.
  • %>% is the pipe operator, which passes the sales_data dataset to the select() function.
  • select(Product, Sales) specifies that you want to keep only the “Product” and “Sales” columns in the selected_data dataset.

After running this code, selected_data will contain only the “Product” and “Sales” columns from your sales_data dataset.

  1. If you want to filter out rows where the “Sales” column has values below 100, how would you achieve this using dplyr ?

Approach:
To filter out rows in a dataset where the “Sales” column has values below 100 using dplyr, you can use the filter() function. Here’s how you can do it:

library(dplyr)

filtered_data <- sales_data %>%
  filter(Sales >= 100)

In this code:

  • library(dplyr) loads the dplyr package.
  • sales_data is the name of your dataset.
  • %>% is the pipe operator, which passes the sales_data dataset to the filter() function.
  • filter(Sales >= 100) specifies that you want to keep only the rows where the “Sales” column is greater than or equal to 100.

After running this code, the filtered_data dataset will contain only the rows where the “Sales” column has values of 100 or more.

  1. To compute the average sales for each product category in the dataset, which dplyr functions would you chain together?

Answer:
To compute the average sales for each product category in the dataset using dplyr, you can chain together the group_by() and summarize() functions. Here’s how you can do it:

library(dplyr)

average_sales_by_category <- sales_data %>%
  group_by(ProductCategory) %>%
  summarize(AvgSales = mean(Sales, na.rm = TRUE))

In this code:

  • library(dplyr) loads the dplyr package.
  • sales_data is the name of your dataset.
  • %>% is the pipe operator, which passes the sales_data dataset to the chained operations.
  • group_by(ProductCategory) groups the data by the “ProductCategory” column.
  • summarize(AvgSales = mean(Sales, na.rm = TRUE) calculates the average sales for each product category and creates a new column named “AvgSales” to store the results.

After running this code, the average_sales_by_category dataset will contain the average sales for each product category.

Thanks for the workout.

Keith