R Workout 01 - Tidy data

The goal of this task is to manipulate the data set in such a way that:

  1. Each variable corresponds to a column.
  2. Each observation is a row.
  3. Each value is a cell.

Raw dataset:

Expected result:

20230424

Tasks:

1.Leave only headers which refers to country and import

2.Transpose the table so that the “Country” column remains unchanged, and the other columns are gathered into rows.

3.The second column should be called “Year of import” and contain the headers of the pre-existing table.

4.The “Value” column should contain numeric values.

  1. The “Year of import” column should contain only 4 digits indicating the year

  2. Sort values in ascending order by country and year

Extra task: use ggplot2 library to create a clustered bar chart like one below:

Tip: Use https://rdrr.io/snippets/, the online R editor, if you don’t have the R environment installed.

Submission

Simply post your code and a screenshot of your results.

Please format your R code and blur it or place it in a hidden section.

Dataset:

#Load below data frame
df ←
data.frame(Country = c(“Poland”,“USA”),
Import_2020 = c(12123,3232),
Import_2021 = c(2323,54546),
Population_2020 = c(37900000,320950000),
Population_2021 = c(3775000,321900000))

tibble::view(df) #Preview of dataset

Tip: Use https://rdrr.io/snippets/, the online R editor, if you don’t have the R environment installed.

This workout will be released on Monday April 24, 2023, and the author’s solution will be posted on Sunday April 30, 2023.

1 Like
library(dplyr)
library(reshape)
library(ggplot2)
library(stringr)

Country <- c("Poland", "USA")
Import_2020 <- c(12123, 3232)
Import_2021 <- c(2323, 54546)
Population_2020 <- c(37900000, 320950000)
Population_2021 <- c(3775000, 321900000)

df <- data.frame(Country, Import_2020, Import_2021, Population_2020, Population_2021 )
df <- select(df, contains("Country"), contains("Import"))
df <- melt(df, id=c("Country"))

colnames(df) <- c("Country", "year_of_import", "value")

df$year_of_import <- str_replace(df$year_of_import , "Import_", "")
as.numeric(as.character(df$year_of_import ))
as.numeric(as.character(df$value))

ggplot(df, aes(x=year_of_import , y=value, fill=Country))  + 
    geom_bar(stat="identity", position=position_dodge()) + 
    theme_light() +
    theme(legend.position = "top") + 
    labs(y = "Value of Import", x = "Year")

Used https://rdrr.io/snippets/

1 Like

Good way to start the week. Here is my code. I will say that I was a little rusty having not having used R in awhile so @AntrikshSharma’s helped kick start mine. And due to limited time, I skipped the plotting.

Summary

library(dplyr)
library(tidyr)
library(ggplot2)
library(reshape2)
library(stringr)

#Load below data frame
Country ← c(“Poland”, “USA”)
Import_2020 ← c(12123,3232)
Import_2021 ← c(2323,54546)
Population_2020 ← c(37900000,320950000)
Population_2021 ← c(3775000,321900000)

df ← data.frame(Country, Import_2020, Import_2021, Population_2020, Population_2021)

df ← select(df, Country, contains(“Import”))
df_tidy ← df %>%
gather(key = “Year”, value = “Import”, 2:3) %>%
mutate(Year = str_replace(Year, “Import_”, “”))

colnames(df_tidy) ← c(“Country”, “Year_of_Import”, “Value”)

df_tidy

3 Likes

@tweinzapfel and @AntrikshSharma ,

FYI - here’s a timesaver re: loading packages. If you install tidyverse once per the install command below, every subsequent time you can just call library(tidyverse) to load all the core tidyverse packages below

install.packages(“tidyverse”)


`library(tidyverse)` will load the core tidyverse packages:

* [ggplot2](https://ggplot2.tidyverse.org/), for data visualisation.
* [dplyr](https://dplyr.tidyverse.org/), for data manipulation.
* [tidyr](https://tidyr.tidyverse.org/), for data tidying.
* [readr](https://readr.tidyverse.org/), for data import.
* [purrr](https://purrr.tidyverse.org/), for functional programming.
* [tibble](https://tibble.tidyverse.org/), for tibbles, a modern re-imagining of data frames.
* [stringr](https://github.com/tidyverse/stringr), for strings.
* [forcats](https://github.com/tidyverse/forcats), for factors.
* [lubridate](https://github.com/tidyverse/lubridate), for date/times.
2 Likes

Hi @KrzysztofNowak ,

Thanks for the workout. Below is my solution.

Solution
library(tidyverse)
library(ggplot2)

#Load below data frame
df <- data.frame(Country = c("Poland","USA"),
           Import_2020 = c(12123,3232),
           Import_2021 = c(2323,54546),
           Population_2020 = c(37900000,320950000),
           Population_2021 = c(3775000,321900000))
#Data frame with import data only
df_import = df[,-c(4,5)]
df_import
#Unpivot data 
df_import = df_import %>% 
  pivot_longer(cols=starts_with("import"), names_to = "year_of_import",names_prefix="Import_", values_to = "values")
df_import
#Plot clustered bar chart with ggplot2 
ggplot(data = df_import, aes(x=year_of_import, y=values,fill=Country)) +
  geom_bar(position="dodge",stat="identity") +
    labs(title = "Import Statistics", subtitle = "Split by year and country",
         y= "Value of Import") + theme(legend.position = "top") + theme_classic()

2 Likes

I enjoyed giving this a go. My first real attempt at anything to do with R. Loved it. Used AI to assist me and problem solve on a number of steps

# Create the df data frame
df <- data.frame(
  Country = c("Poland", "USA"),
  Import_2020 = c(12123, 3232),
  Import_2021 = c(2323, 54546),
  Population_2020 = c(37900000, 320950000),
  Population_2021 = c(3775000, 321900000)
)

# Step 1: Filter columns related to country and import
df_filtered <- df[, c("Country", "Import_2020", "Import_2021")]

# Step 2: Reshape the data frame using gather (tidyverse package)
library(tidyverse)
df_reshaped <- df_filtered %>%
  gather(key = "Year of import", value = "Value", -Country)

# Step 3: Convert values to numeric and format year
df_reshaped$Value <- as.numeric(df_reshaped$Value)
df_reshaped$Year <- as.numeric(stringr::str_extract(df_reshaped$`Year of import`, "\\d+"))

# Step 4: Sort values in ascending order by country and year
df_sorted <- df_reshaped[order(df_reshaped$Country, df_reshaped$Year), ]

# Extra task: Create a clustered bar chart using ggplot2
library(ggplot2)
ggplot(df_sorted, aes(x = Country, y = Value, fill = `Year of import`)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Country", y = "Value", fill = "Year of import") +
  theme_minimal()

1 Like

Kind of crazy how you can ask chatGPT to interpret the results and then also tell you what is important and interesting about the data

1 Like

@SamMcKay - well done! Great to see you jumping into R. The combination of Power BI and R is nearly limitless from an analytics standpoint. Probably my favorite feature of Power BI…

@KrzysztofNowak ,

Great kickoff workout! Here’s my solution – threw little extra flair in with the ggthemes package.

Click for R code
library(tidyverse)
library(ggthemes)

#Load below data frame
df <- data.frame(
  Country = c("Poland", "USA"),
  Import_2020 = c(12123, 3232),
  Import_2021 = c(2323, 54546),
  Population_2020 = c(37900000, 320950000),
  Population_2021 = c(37750000, 321900000))


df2 <- df %>%
  select(Country, contains("Import")) %>%
  
  pivot_longer(
    cols = contains("Import"),
    names_to = "Year_of_Import",
    values_to = "Value") %>% 
  
   mutate(Year_of_Import=
       substr(
         .$Year_of_Import, 
         nchar(.$Year_of_Import) 
         - 3, 
         nchar(.$Year_of_Import)))

head( df2, 5)

ggplot(df2, aes(x = Year_of_Import, y = Value, fill = Country)) + 
  geom_col(position = "dodge") + 
  labs(x = "Year of Import", y = "Value",
  title = "Import Statistics", subtitle = "by Year and Country") +
  theme_economist_white()

image

Interesting to note on re-read this gpt output that it actually in wrong on point 3. Both countries DON’T increase yoy. Says it with such confidence too. That’s what they call a hallucination I guess!

I wonder why and/or how it gets such an obvious things wrong??

Hi!

All good! 2 notes. melt() function works great, you can be also interested in gather()/pivot_longer() - they allow to define names of variables.

1 Like

Thank you, all good! I like use of pipe!

Hello, thanks for participating.

Results are good, but I have 2 notes: 1) in first step You decided to hardcode positions. in future dataset could have more columns. More dynamic solution is to refer to headers names. You can move chart legend on top with theme(legend.position = “top”). But thanks for answering for both challange parts!

1 Like

Hi Sam, great results.

Few hints: If You believe that in future we can have more columns for next years, You may want to filter for all columns which contain/ starts with “Import”.

You may be interested in using pipe (|> or %>%) to execute code in sequences.

Also, if you want to avoid this annoying referencing to columns with " `` ", You might be interested in janitor::clean_names() function

Thank You Brian, great results:

theme_economist_white() is very interesting. I like that it makes subtitle smaller that title.

1 Like

Thanks all for participation, I am really impressed. Below is my solution

df ←
data.frame(Country = c(“Poland”,“USA”),
Import_2020 = c(12123,3232),
Import_2021 = c(2323,54546),
Population_2020 = c(37900000,320950000),
Population_2021 = c(3775000,321900000))

#Intermediate table
SelectedColumns ←
df |>
select(Country,starts_with(“Import”))
#The starts_with parameter of the select function allows us to select columns
#whose name begins with a defined string of characters

Result ← SelectedColumns |>
gather(key = “Year of import”,value = “Value”,-Country) |>
janitor::clean_names() |>
mutate(year_of_import = str_extract(string = year_of_import,pattern =“\d+”)) |>
arrange(country,year_of_import)

#The gather function allows us to perform an operation during which column headers #become rows.
#The key and value parameters specify the names of the new columns to be created.After #the “-” sign,
#we select the column that is the “anchor” of the table".

janitor replaces spaces between words with “_”. It is easier to refer to values now. Note it #makes it lowercase

str_extract fuctions extracts part of value by pattern. My pattern refers to digits (d). “\” #are escapes characters.

#Thanks to them, R does not read “d” as a letter of the alphabet but as a special #character.
#“+” means that the expression can appear once or more times.
#arrange sorts values bt country and year

__________chart ______
ggplot(data = Result,aes(x = year_of_import,y = value,fill = country)) +
geom_bar(position=“dodge”, stat=“identity”,width = 0.8) +
scale_x_discrete(name = NULL) +
scale_y_continuous(name = “Value of import”) +
labs(title = “Import statistics”,
subtitle = “Split by year and country”,
caption = “Created by Krzysztof Nowak”) +
theme_light() +
guides(fill = guide_legend(title = “Country”))
theme(legend.position = “top”)

3 Likes

@KrzysztofNowak ,

Thank you for the great workout and the detailed solution writeup.

Definitely agree that the Janitor package is a fantastic toolset for data cleaning. Just recently found that and have been learning its capabilities.

I noticed you are using the newer native pipe rather than the magrittr pipe. Are there particular reasons you prefer one to the other?

  • Brian

Looking forward to workout #2!

@KrzysztofNowak ,

Yeah, the ggthemes package has some nice embedded options. Also, check out my solution on Data Analysis #5 - between the ggpubr package (one of my favorites) and Quarto, it’s easy to create some really nice looking tables with less code than it would take in ggplot2.

  • Brian

Hi Brian, am currently trying to change my habit and use classic to minimize dependencies;)