R Workout 01 - Tidy data

KrzysztofNowak · April 24, 2023, 6:11am

The goal of this task is to manipulate the data set in such a way that:

Each variable corresponds to a column.
Each observation is a row.
Each value is a cell.

Raw dataset:

Expected result:

20230424

Tasks:

1.Leave only headers which refers to country and import

2.Transpose the table so that the “Country” column remains unchanged, and the other columns are gathered into rows.

3.The second column should be called “Year of import” and contain the headers of the pre-existing table.

4.The “Value” column should contain numeric values.

The “Year of import” column should contain only 4 digits indicating the year
Sort values in ascending order by country and year

Extra task: use ggplot2 library to create a clustered bar chart like one below:

Tip: Use https://rdrr.io/snippets/, the online R editor, if you don’t have the R environment installed.

Submission

Simply post your code and a screenshot of your results.

Please format your R code and blur it or place it in a hidden section.

Dataset:

#Load below data frame
df ←
data.frame(Country = c(“Poland”,“USA”),
Import_2020 = c(12123,3232),
Import_2021 = c(2323,54546),
Population_2020 = c(37900000,320950000),
Population_2021 = c(3775000,321900000))

tibble::view(df) #Preview of dataset

Tip: Use https://rdrr.io/snippets/, the online R editor, if you don’t have the R environment installed.

This workout will be released on Monday April 24, 2023, and the author’s solution will be posted on Sunday April 30, 2023.

AntrikshSharma · April 24, 2023, 8:03am

library(dplyr)
library(reshape)
library(ggplot2)
library(stringr)

Country <- c("Poland", "USA")
Import_2020 <- c(12123, 3232)
Import_2021 <- c(2323, 54546)
Population_2020 <- c(37900000, 320950000)
Population_2021 <- c(3775000, 321900000)

df <- data.frame(Country, Import_2020, Import_2021, Population_2020, Population_2021 )
df <- select(df, contains("Country"), contains("Import"))
df <- melt(df, id=c("Country"))

colnames(df) <- c("Country", "year_of_import", "value")

df$year_of_import <- str_replace(df$year_of_import , "Import_", "")
as.numeric(as.character(df$year_of_import ))
as.numeric(as.character(df$value))

ggplot(df, aes(x=year_of_import , y=value, fill=Country))  + 
    geom_bar(stat="identity", position=position_dodge()) + 
    theme_light() +
    theme(legend.position = "top") + 
    labs(y = "Value of Import", x = "Year")

Used https://rdrr.io/snippets/

tweinzapfel · April 24, 2023, 12:00pm

Good way to start the week. Here is my code. I will say that I was a little rusty having not having used R in awhile so @AntrikshSharma’s helped kick start mine. And due to limited time, I skipped the plotting.

Summary

library(dplyr)
library(tidyr)
library(ggplot2)
library(reshape2)
library(stringr)

#Load below data frame
Country ← c(“Poland”, “USA”)
Import_2020 ← c(12123,3232)
Import_2021 ← c(2323,54546)
Population_2020 ← c(37900000,320950000)
Population_2021 ← c(3775000,321900000)

df ← data.frame(Country, Import_2020, Import_2021, Population_2020, Population_2021)

df ← select(df, Country, contains(“Import”))
df_tidy ← df %>%
gather(key = “Year”, value = “Import”, 2:3) %>%
mutate(Year = str_replace(Year, “Import_”, “”))

colnames(df_tidy) ← c(“Country”, “Year_of_Import”, “Value”)

df_tidy

BrianJ · April 24, 2023, 4:35pm

@tweinzapfel and @AntrikshSharma ,

FYI - here’s a timesaver re: loading packages. If you install tidyverse once per the install command below, every subsequent time you can just call library(tidyverse) to load all the core tidyverse packages below

install.packages(“tidyverse”)


`library(tidyverse)` will load the core tidyverse packages:

* [ggplot2](https://ggplot2.tidyverse.org/), for data visualisation.
* [dplyr](https://dplyr.tidyverse.org/), for data manipulation.
* [tidyr](https://tidyr.tidyverse.org/), for data tidying.
* [readr](https://readr.tidyverse.org/), for data import.
* [purrr](https://purrr.tidyverse.org/), for functional programming.
* [tibble](https://tibble.tidyverse.org/), for tibbles, a modern re-imagining of data frames.
* [stringr](https://github.com/tidyverse/stringr), for strings.
* [forcats](https://github.com/tidyverse/forcats), for factors.
* [lubridate](https://github.com/tidyverse/lubridate), for date/times.

Ondrej · April 25, 2023, 7:48am

Hi @KrzysztofNowak ,

Thanks for the workout. Below is my solution.

Solution

library(tidyverse)
library(ggplot2)

#Load below data frame
df <- data.frame(Country = c("Poland","USA"),
           Import_2020 = c(12123,3232),
           Import_2021 = c(2323,54546),
           Population_2020 = c(37900000,320950000),
           Population_2021 = c(3775000,321900000))

#Data frame with import data only
df_import = df[,-c(4,5)]
df_import

#Unpivot data 
df_import = df_import %>% 
  pivot_longer(cols=starts_with("import"), names_to = "year_of_import",names_prefix="Import_", values_to = "values")
df_import

#Plot clustered bar chart with ggplot2 
ggplot(data = df_import, aes(x=year_of_import, y=values,fill=Country)) +
  geom_bar(position="dodge",stat="identity") +
    labs(title = "Import Statistics", subtitle = "Split by year and country",
         y= "Value of Import") + theme(legend.position = "top") + theme_classic()

SamMcKay · April 28, 2023, 9:28am

I enjoyed giving this a go. My first real attempt at anything to do with R. Loved it. Used AI to assist me and problem solve on a number of steps

# Create the df data frame
df <- data.frame(
  Country = c("Poland", "USA"),
  Import_2020 = c(12123, 3232),
  Import_2021 = c(2323, 54546),
  Population_2020 = c(37900000, 320950000),
  Population_2021 = c(3775000, 321900000)
)

# Step 1: Filter columns related to country and import
df_filtered <- df[, c("Country", "Import_2020", "Import_2021")]

# Step 2: Reshape the data frame using gather (tidyverse package)
library(tidyverse)
df_reshaped <- df_filtered %>%
  gather(key = "Year of import", value = "Value", -Country)

# Step 3: Convert values to numeric and format year
df_reshaped$Value <- as.numeric(df_reshaped$Value)
df_reshaped$Year <- as.numeric(stringr::str_extract(df_reshaped$`Year of import`, "\\d+"))

# Step 4: Sort values in ascending order by country and year
df_sorted <- df_reshaped[order(df_reshaped$Country, df_reshaped$Year), ]

# Extra task: Create a clustered bar chart using ggplot2
library(ggplot2)
ggplot(df_sorted, aes(x = Country, y = Value, fill = `Year of import`)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Country", y = "Value", fill = "Year of import") +
  theme_minimal()

SamMcKay · April 28, 2023, 9:32am

Kind of crazy how you can ask chatGPT to interpret the results and then also tell you what is important and interesting about the data

BrianJ · April 28, 2023, 4:07pm

@SamMcKay - well done! Great to see you jumping into R. The combination of Power BI and R is nearly limitless from an analytics standpoint. Probably my favorite feature of Power BI…

BrianJ · April 28, 2023, 8:56pm

@KrzysztofNowak ,

Great kickoff workout! Here’s my solution – threw little extra flair in with the ggthemes package.

Click for R code

library(tidyverse)
library(ggthemes)

#Load below data frame
df <- data.frame(
  Country = c("Poland", "USA"),
  Import_2020 = c(12123, 3232),
  Import_2021 = c(2323, 54546),
  Population_2020 = c(37900000, 320950000),
  Population_2021 = c(37750000, 321900000))


df2 <- df %>%
  select(Country, contains("Import")) %>%
  
  pivot_longer(
    cols = contains("Import"),
    names_to = "Year_of_Import",
    values_to = "Value") %>% 
  
   mutate(Year_of_Import=
       substr(
         .$Year_of_Import, 
         nchar(.$Year_of_Import) 
         - 3, 
         nchar(.$Year_of_Import)))

head( df2, 5)

ggplot(df2, aes(x = Year_of_Import, y = Value, fill = Country)) + 
  geom_col(position = "dodge") + 
  labs(x = "Year of Import", y = "Value",
  title = "Import Statistics", subtitle = "by Year and Country") +
  theme_economist_white()

SamMcKay · April 29, 2023, 6:45am

Interesting to note on re-read this gpt output that it actually in wrong on point 3. Both countries DON’T increase yoy. Says it with such confidence too. That’s what they call a hallucination I guess!

I wonder why and/or how it gets such an obvious things wrong??

KrzysztofNowak · April 30, 2023, 9:26am

Hi!

All good! 2 notes. melt() function works great, you can be also interested in gather()/pivot_longer() - they allow to define names of variables.

KrzysztofNowak · April 30, 2023, 9:29am

Thank you, all good! I like use of pipe!

KrzysztofNowak · April 30, 2023, 9:34am

Hello, thanks for participating.

Results are good, but I have 2 notes: 1) in first step You decided to hardcode positions. in future dataset could have more columns. More dynamic solution is to refer to headers names. You can move chart legend on top with theme(legend.position = “top”). But thanks for answering for both challange parts!

KrzysztofNowak · April 30, 2023, 9:41am

Hi Sam, great results.

Few hints: If You believe that in future we can have more columns for next years, You may want to filter for all columns which contain/ starts with “Import”.

You may be interested in using pipe (|> or %>%) to execute code in sequences.

KrzysztofNowak · April 30, 2023, 9:43am

Also, if you want to avoid this annoying referencing to columns with " `` ", You might be interested in janitor::clean_names() function

KrzysztofNowak · April 30, 2023, 9:46am

Thank You Brian, great results:

theme_economist_white() is very interesting. I like that it makes subtitle smaller that title.

KrzysztofNowak · April 30, 2023, 9:49am

Thanks all for participation, I am really impressed. Below is my solution

df ←
data.frame(Country = c(“Poland”,“USA”),
Import_2020 = c(12123,3232),
Import_2021 = c(2323,54546),
Population_2020 = c(37900000,320950000),
Population_2021 = c(3775000,321900000))

#Intermediate table
SelectedColumns ←
df |>
select(Country,starts_with(“Import”))
#The starts_with parameter of the select function allows us to select columns
#whose name begins with a defined string of characters

Result ← SelectedColumns |>
gather(key = “Year of import”,value = “Value”,-Country) |>
janitor::clean_names() |>
mutate(year_of_import = str_extract(string = year_of_import,pattern =“\d+”)) |>
arrange(country,year_of_import)

#The gather function allows us to perform an operation during which column headers #become rows.
#The key and value parameters specify the names of the new columns to be created.After #the “-” sign,
#we select the column that is the “anchor” of the table".

janitor replaces spaces between words with “_”. It is easier to refer to values now. Note it #makes it lowercase

str_extract fuctions extracts part of value by pattern. My pattern refers to digits (d). “\” #are escapes characters.

#Thanks to them, R does not read “d” as a letter of the alphabet but as a special #character.
#“+” means that the expression can appear once or more times.
#arrange sorts values bt country and year

__________chart ______
ggplot(data = Result,aes(x = year_of_import,y = value,fill = country)) +
geom_bar(position=“dodge”, stat=“identity”,width = 0.8) +
scale_x_discrete(name = NULL) +
scale_y_continuous(name = “Value of import”) +
labs(title = “Import statistics”,
subtitle = “Split by year and country”,
caption = “Created by Krzysztof Nowak”) +
theme_light() +
guides(fill = guide_legend(title = “Country”))
theme(legend.position = “top”)

BrianJ · April 30, 2023, 5:09pm

@KrzysztofNowak ,

Thank you for the great workout and the detailed solution writeup.

Definitely agree that the Janitor package is a fantastic toolset for data cleaning. Just recently found that and have been learning its capabilities.

I noticed you are using the newer native pipe rather than the magrittr pipe. Are there particular reasons you prefer one to the other?

Brian

Looking forward to workout #2!

BrianJ · April 30, 2023, 11:06pm

@KrzysztofNowak ,

Yeah, the ggthemes package has some nice embedded options. Also, check out my solution on Data Analysis #5 - between the ggpubr package (one of my favorites) and Quarto, it’s easy to create some really nice looking tables with less code than it would take in ggplot2.

Brian

KrzysztofNowak · May 2, 2023, 7:09am

Hi Brian, am currently trying to change my habit and use classic to minimize dependencies;)