R Workout 9 - Visualizing Data: Mastering `ggplot2`

Title: Visualizing Data with R: Mastering ggplot2

Description:

The ggplot2 package in R provides a robust and flexible platform for creating rich visualizations. In this workout, delve into the essential components of ggplot2 and learn to craft compelling data stories.

Scenario:

Imagine you’re analyzing sales data for a retailer. The dataset contains monthly sales figures, product categories, and regions. How would you use ggplot2 to visualize sales trends, compare product performance, and analyze regional differences?

Objectives:

By the end of this workout, you should be able to:

  1. Construct various plot types using ggplot2, such as scatter plots, bar plots, and line charts.

  2. Customize and refine visualizations using ggplot2 aesthetics and themes.

  3. Interpret and communicate insights derived from visualizations.

Interactive Task:

Given your understanding of ggplot2, answer the following:

  1. How would you create a line chart to visualize monthly sales trends using a dataframe sales_df with columns “Month” and “Sales”?

    • Your Code: ________________________
  2. If you want to compare sales across different product categories using a bar chart, how would you code it given the columns “Category” and “Sales” in sales_df?

    • Your Code: ________________________
  3. To analyze regional sales differences using a box plot for the columns “Region” and “Sales”, how might you go about it?

    • Your Code: ________________________

Questions:

  1. In ggplot2, which function allows you to modify plot themes and customize the appearance of your plots?

    • i) modify_theme()

    • ii) adjust_theme()

    • iii) theme()

    • iv) theme_set()

  2. When you want to overlay data points on a line chart in ggplot2, which aesthetic would you commonly use?

    • i) overlay()

    • ii) add_point()

    • iii) geom_point()

    • iv) scatter_overlay()

Duration: 20 minutes

Difficulty: Intermediate

Period :
This workout will be released on Wednesday, September 20, 2023, and will end on Thursday, October 05, 2023. But you can always come back to any of the workouts and solve them.

HI @EnterpriseDNA,

Here is my solution to this workout.

Questions:

  1. In ggplot2, which function allows you to modify plot themes and customize the appearance of your plots?
    Answer:
  • iii) theme()
  1. When you want to overlay data points on a line chart in ggplot2, which aesthetic would you commonly use?
    Answer:
  • iii) geom_point()
  1. How would you create a line chart to visualize monthly sales trends using a dataframe sales_df with columns “Month” and “Sales”?

Code:
library(ggplot2)

Assuming ‘sales_df’ is your dataframe with columns “Month” and “Sales”

Make sure your ‘Month’ column is in Date format or factor for proper ordering

ggplot(data = sales_df, aes(x = Month, y = Sales)) +
geom_line() +
labs(x = “Month”, y = “Sales”) +
ggtitle(“Monthly Sales Trends”)

  1. You load the ggplot2 library using library(ggplot2).
  2. You use the ggplot() function to specify the data source (data = sales_df) and set the aesthetics (aes(x = Month, y = Sales)), mapping “Month” to the x-axis and “Sales” to the y-axis.
  3. You add the line to the plot using geom_line().
  4. You set the x-axis and y-axis labels using labs().
  5. You set the title of the plot using ggtitle().

Make sure that your “Month” column is in an appropriate format for ggplot2 to interpret as a date or a factor for proper ordering along the x-axis.

  1. If you want to compare sales across different product categories using a bar chart, how would you code it given the columns “Category” and “Sales” in sales_df ?

Code:
library(ggplot2)

Assuming ‘sales_df’ is your dataframe with columns “Category” and “Sales”

ggplot(data = sales_df, aes(x = Category, y = Sales, fill = Category)) +
geom_bar(stat = “identity”) +
labs(x = “Category”, y = “Sales”) +
ggtitle(“Sales by Category”) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

In this code:

  1. You load the ggplot2 library using library(ggplot2).
  2. You use the ggplot() function to specify the data source (data = sales_df) and set the aesthetics (aes(x = Category, y = Sales, fill = Category)), mapping “Category” to the x-axis, “Sales” to the y-axis, and using “Category” for fill colors.
  3. You add the bars to the plot using geom_bar(stat = "identity"). The stat = "identity" argument ensures that the y-values in the “Sales” column directly represent the bar heights.
  4. You set the x-axis and y-axis labels using labs().
  5. You set the title of the plot using ggtitle().
  6. You can use theme() to customize the appearance of the plot. In this example, I’ve added theme(axis.text.x = element_text(angle = 45, hjust = 1)) to rotate the x-axis labels for better readability if you have many categories.

This code will create a bar chart that compares sales across different product categories, with each category represented by a colored bar.

  1. To analyze regional sales differences using a box plot for the columns “Region” and “Sales”, how might you go about it?

Your Code:
library(ggplot2)

Assuming ‘sales_df’ is your dataframe with columns “Region” and “Sales”

ggplot(data = sales_df, aes(x = Region, y = Sales)) +
geom_boxplot(fill = “lightblue”, color = “blue”) +
labs(x = “Region”, y = “Sales”) +
ggtitle(“Regional Sales Differences”)

In this code:

  1. You load the ggplot2 library using library(ggplot2).
  2. You use the ggplot() function to specify the data source (data = sales_df) and set the aesthetics (aes(x = Region, y = Sales)), mapping “Region” to the x-axis and “Sales” to the y-axis.
  3. You add the box plots to the plot using geom_boxplot(). You can customize the fill color and outline color of the boxes by specifying the fill and color arguments, respectively.
  4. You set the x-axis and y-axis labels using labs().
  5. You set the title of the plot using ggtitle().

This code will create a box plot that visualizes the distribution of sales across different regions, helping you analyze regional sales differences and identify potential outliers or variations in the data.

Thanks for the workout.
Keith