Data Analysis Workout 06: Roller Coaster Analysis

Level of Difficulty:

Objective: This workout provides some practice in data cleaning and exploratory data analysis.

Download the dataset here: https://buff.ly/3k6woDG

Challenge Questions:

  1. How many columns and rows are in the dataset?
  2. Is there any missing data?
  3. Display the summary statistics of the numeric columns using the describe method.
  4. Rename the following columns:
    • coaster_name :arrow_right: Coaster_Name
    • year_introduced :arrow_right: Year_Introduced
    • opening_date_clean :arrow_right: Opening_Date
    • speed_mph :arrow_right: Speed_mph
    • height_ft :arrow_right: Height_ft
    • Inversions_clean :arrow_right: Inversions
    • Gforce_clean :arrow_right: Gforce
  5. Are there any duplicated rows?
  6. What are the top 3 years with the most roller coasters introduced?
  7. What is the average speed? Also display a plot to show it’s distribution.
  8. Explore the feature relationships. Are there any positively or negatively correlated relationships?
  9. Create your own question and answer it.

Simply post your code and a screenshot of your results.

Please format your Python code and blur it or place it in a hidden section.

This workout will be released on Monday May 1, 2023, and the author’s solution will be posted on Sunday May 7, 2023.

1 Like

Thank you for the workout @kedeisha1

Please see my subsmission below:

1 Like

@kedeisha1 ,

Most of the questions asked got answered in one line of R code using the dlookr package for automated Exploratory Data Analysis reporting.

Here’s the R code that generated the dlookr report and a PDF of the report itself.

library(tidyverse)
library(dlookr)

df <- read_csv("https://raw.githubusercontent.com/kedeisha1/Challenges/main/coaster_db.csv")

EDAReport <- eda_web_report(
  df, output_format = "html,
  author = "Brian Julius",
  title = "eDNA Data Analysis Workout",
  subtitle = "Rollercoaster Dataset",
  
  )

dlookr_ EDA.html (7.9 MB)

So, given that I thought this was a really cool dataset, I wanted to try something new - telling the story of the evolution of speed, materials, and geography over the past 100 years of rollercoasters. I think it turned out pretty well…

(Be sure to unmute and crank up the volume before hitting Play…:musical_note:)

Thanks to Kedeisha Bryan for another fantastic workout with a really interesting dataset.

For those interested, this was done in Power BI using three custom visuals:

:small_orange_diamond:James Dales’ phenomenal Icon Map mapping visual

:small_orange_diamond:The Wishyoulization Animated Bar Chart Race

:small_orange_diamond:The Play Axis to synch the timing of the two different visuals

All three can be found in AppSource for free.

  • Brian
1 Like

I asked Bard to analyze the data and answer the questions. Did not answer them all correctly, but this was an interesting experiment given the advancements in AI so far.


2 Likes

Hi,

tried to practice some regular expressions to clean the data.

Data Analysis Challenge 06 - Roller Coaster Analysis.html (1.1 MB)

O.

Well, well, well, looks like someone had a roller coaster of a time with that dataset! I can’t believe you managed to combine speed, materials, geography, AND rollercoasters all in one project. I mean, talk about a wild ride! And with Power BI, no less? You must have nerves of steel to tackle such a project. But hey, it sounds like it turned out to be a real scream. Or should I say, a real “scream machine”! :roller_coaster:

num_rows = df.shape[0]
num_cols = df.shape[1]
print("Number of rows:", num_rows)
print("Number of columns:", num_cols)
missing_data = df.isnull().sum().sum()
if missing_data > 0:
    print("There are missing values in the dataset.")
else:
    print("There are no missing values in the dataset.")
numeric_columns = ['speed_mph', 'height_ft', 'num_inversions', 'g_force']
summary_stats = df[numeric_columns].describe()
print(summary_stats)
df.rename(columns={'coaster_name': 'Coaster_Name',
                   'year_introduced': 'Year_Introduced',
                   'opening_date_clean': 'Opening_Date',
                   'speed_mph': 'Speed_mph',
                   'height_ft': 'Height_ft',
                   'Inversions_clean': 'Inversions',
                   'Gforce_clean': 'Gforce'}, inplace=True)
duplicates = df.duplicated().sum()
if duplicates > 0:
    print("There are duplicated rows in the dataset.")
else:
    print("There are no duplicated rows in the dataset.")
top_years = df['Year_Introduced'].value_counts().head(3)
print("Top 3 years with the most roller coasters introduced:")
print(top_years)
average_speed = df['Speed_mph'].mean()
print("Average speed:", average_speed)

plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='Speed_mph', kde=True)
plt.title("Distribution of Roller Coaster Speed")
plt.xlabel("Speed (mph)")
plt.ylabel("Count")
plt.show()
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()