The dataset used for this analysis is titled “Flight Delay Data for U.S. Airports by Carrier August 2013 - August 2023” and is accessible through Kaggle.
This dataset provides detailed information on flight arrivals and delays for U.S. airports, categorized by carriers. The data includes metrics such as the number of arriving flights, delays over 15 minutes, cancellation and diversion counts, and the breakdown of delays attributed to carriers, weather, NAS (National Airspace System), security, and late aircraft arrivals. Explore and analyze the performance of different carriers at various airports during this period. Use this dataset to gain insights into the factors contributing to delays in the aviation industry.
Our mission is to uncover the most critical factors contributing to flight delays over 15 minutes across U.S. airports.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggthemes)
library(tidyr)
library(knitr)
library(summarytools)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(patchwork)
library(ggrepel)
Let’s start our journey into the analysis that aims to uncover the most critical factors contributing to flight delays over 15 minutes across U.S. airports.
delay <- read.csv("/Users/farzane/Downloads/NewStatistical-2025/Airline_Delay_Cause 4.csv")
dim(delay) # Number of Rows and Columns of the dataset respectively
## [1] 171666 21
names(delay) # Displays the Column names of the dataset
## [1] "year" "month" "carrier"
## [4] "carrier_name" "airport" "airport_name"
## [7] "arr_flights" "arr_del15" "carrier_ct"
## [10] "weather_ct" "nas_ct" "security_ct"
## [13] "late_aircraft_ct" "arr_cancelled" "arr_diverted"
## [16] "arr_delay" "carrier_delay" "weather_delay"
## [19] "nas_delay" "security_delay" "late_aircraft_delay"
head(delay, 10) # Displays the first 10 lines of the dataset
## year month carrier carrier_name airport
## 1 2023 8 9E Endeavor Air Inc. ABE
## 2 2023 8 9E Endeavor Air Inc. ABY
## 3 2023 8 9E Endeavor Air Inc. AEX
## 4 2023 8 9E Endeavor Air Inc. AGS
## 5 2023 8 9E Endeavor Air Inc. ALB
## 6 2023 8 9E Endeavor Air Inc. ATL
## 7 2023 8 9E Endeavor Air Inc. AUS
## 8 2023 8 9E Endeavor Air Inc. AVL
## 9 2023 8 9E Endeavor Air Inc. AZO
## 10 2023 8 9E Endeavor Air Inc. BDL
## airport_name arr_flights
## 1 Allentown/Bethlehem/Easton, PA: Lehigh Valley International 89
## 2 Albany, GA: Southwest Georgia Regional 62
## 3 Alexandria, LA: Alexandria International 62
## 4 Augusta, GA: Augusta Regional at Bush Field 66
## 5 Albany, NY: Albany International 92
## 6 Atlanta, GA: Hartsfield-Jackson Atlanta International 1636
## 7 Austin, TX: Austin - Bergstrom International 75
## 8 Asheville, NC: Asheville Regional 59
## 9 Kalamazoo, MI: Kalamazoo/Battle Creek International 62
## 10 Hartford, CT: Bradley International 30
## arr_del15 carrier_ct weather_ct nas_ct security_ct late_aircraft_ct
## 1 13 2.25 1.60 3.16 0 5.99
## 2 10 1.97 0.04 0.57 0 7.42
## 3 10 2.73 1.18 1.80 0 4.28
## 4 12 3.69 2.27 4.47 0 1.57
## 5 22 7.76 0.00 2.96 0 11.28
## 6 256 55.98 27.81 63.64 0 108.57
## 7 12 5.62 0.97 4.41 0 1.00
## 8 7 3.32 0.00 0.42 0 3.26
## 9 13 6.53 0.94 3.54 0 1.99
## 10 4 0.00 0.82 0.00 0 3.18
## arr_cancelled arr_diverted arr_delay carrier_delay weather_delay nas_delay
## 1 2 1 1375 71 761 118
## 2 0 1 799 218 1 62
## 3 1 0 766 56 188 78
## 4 1 1 1397 471 320 388
## 5 2 0 1530 628 0 134
## 6 32 11 29768 9339 4557 4676
## 7 0 0 843 535 170 111
## 8 2 0 324 117 0 25
## 9 0 0 707 470 77 87
## 10 1 0 1421 0 532 0
## security_delay late_aircraft_delay
## 1 0 425
## 2 0 518
## 3 0 444
## 4 0 218
## 5 0 768
## 6 0 11196
## 7 0 27
## 8 0 182
## 9 0 73
## 10 0 889
The dataset consists of 171666 observations and 21 variables.
summary(delay)
## year month carrier carrier_name
## Min. :2013 Min. : 1.000 Length:171666 Length:171666
## 1st Qu.:2016 1st Qu.: 4.000 Class :character Class :character
## Median :2019 Median : 7.000 Mode :character Mode :character
## Mean :2019 Mean : 6.494
## 3rd Qu.:2021 3rd Qu.: 9.000
## Max. :2023 Max. :12.000
##
## airport airport_name arr_flights arr_del15
## Length:171666 Length:171666 Min. : 1.0 Min. : 0.00
## Class :character Class :character 1st Qu.: 50.0 1st Qu.: 6.00
## Mode :character Mode :character Median : 100.0 Median : 17.00
## Mean : 362.5 Mean : 66.43
## 3rd Qu.: 250.0 3rd Qu.: 47.00
## Max. :21977.0 Max. :4176.00
## NA's :240 NA's :443
## carrier_ct weather_ct nas_ct security_ct
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.0000
## 1st Qu.: 2.16 1st Qu.: 0.00 1st Qu.: 1.00 1st Qu.: 0.0000
## Median : 6.40 Median : 0.40 Median : 3.91 Median : 0.0000
## Mean : 20.80 Mean : 2.25 Mean : 19.38 Mean : 0.1571
## 3rd Qu.: 17.26 3rd Qu.: 1.86 3rd Qu.: 11.71 3rd Qu.: 0.0000
## Max. :1293.91 Max. :266.42 Max. :1884.42 Max. :58.6900
## NA's :240 NA's :240 NA's :240 NA's :240
## late_aircraft_ct arr_cancelled arr_diverted arr_delay
## Min. : 0.00 Min. : 0.00 Min. : 0.0000 Min. : 0
## 1st Qu.: 1.23 1st Qu.: 0.00 1st Qu.: 0.0000 1st Qu.: 335
## Median : 5.00 Median : 1.00 Median : 0.0000 Median : 1018
## Mean : 23.77 Mean : 7.53 Mean : 0.8634 Mean : 4240
## 3rd Qu.: 15.26 3rd Qu.: 4.00 3rd Qu.: 1.0000 3rd Qu.: 2884
## Max. :2069.07 Max. :4951.00 Max. :197.0000 Max. :438783
## NA's :240 NA's :240 NA's :240 NA's :240
## carrier_delay weather_delay nas_delay security_delay
## Min. : 0 Min. : 0.0 Min. : 0.0 Min. : 0.000
## 1st Qu.: 110 1st Qu.: 0.0 1st Qu.: 34.0 1st Qu.: 0.000
## Median : 375 Median : 18.0 Median : 146.0 Median : 0.000
## Mean : 1437 Mean : 222.6 Mean : 920.6 Mean : 7.383
## 3rd Qu.: 1109 3rd Qu.: 146.0 3rd Qu.: 477.0 3rd Qu.: 0.000
## Max. :196944 Max. :31960.0 Max. :112018.0 Max. :3760.000
## NA's :240 NA's :240 NA's :240 NA's :240
## late_aircraft_delay
## Min. : 0
## 1st Qu.: 65
## Median : 320
## Mean : 1652
## 3rd Qu.: 1070
## Max. :227959
## NA's :240
To better understand the overall performance of flights, we have summarized the key categories of flights in the dataset:
library(dplyr)
flight_summary <- delay %>%
summarise(
total_flights = sum(arr_flights, na.rm = TRUE),
total_canceled = sum(arr_cancelled, na.rm = TRUE),
total_diverted = sum(arr_diverted, na.rm = TRUE),
total_delayed = sum(arr_del15, na.rm = TRUE), # Delayed flights (15+ minutes)
on_time_flights = total_flights - (total_canceled + total_diverted + total_delayed) # On-time flights
)
print(flight_summary)
## total_flights total_canceled total_diverted total_delayed on_time_flights
## 1 62146805 1290923 148007 11375095 49332780
This summary gives an overview of the dataset, showing the scale of delays and other disruptions in the U.S. aviation system. Here are the results:
# Summarize data to get totals for each category
flight_summary <- delay %>%
summarise(
total_flights = sum(arr_flights, na.rm = TRUE),
total_canceled = sum(arr_cancelled, na.rm = TRUE),
total_diverted = sum(arr_diverted, na.rm = TRUE),
total_delayed = sum(arr_del15, na.rm = TRUE)
) %>%
mutate(
on_time_flights = total_flights - (total_canceled + total_diverted + total_delayed)
) %>%
select(total_canceled, total_diverted, total_delayed, on_time_flights) %>%
pivot_longer(everything(), names_to = "category", values_to = "count")
# Create a pie chart
ggplot(flight_summary, aes(x = "", y = count, fill = category)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0) + # Polar coordinates for pie chart
scale_fill_manual(values = c("total_canceled" = "red",
"total_diverted" = "darkorange",
"total_delayed" = "darkblue",
"on_time_flights" = "darkgreen")) +
labs(title = "Breakdown of Flight Categories",
fill = "Flight Category") +
theme_void() + # Clean theme for pie charts
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10)
) +
geom_text(aes(label = paste0(round(count / sum(count) * 100, 1), "%")),
position = position_stack(vjust = 0.5), color = "white", size = 4)
This pie chart shows the breakdown of flight outcomes in the dataset, including on-time flights, delays over 15 minutes, cancellations, and diversions. It highlights how most flights are on time, while a smaller portion face delays, cancellations, and diversions.
# Aggregate arriving flights by year
flights_by_year <- delay %>%
group_by(year) %>%
summarise(total_arriving_flights = sum(arr_flights, na.rm = TRUE))
# Bar chart: Total number of arriving flights per year
ggplot(flights_by_year, aes(x = factor(year), y = total_arriving_flights)) +
geom_bar(stat = "identity", fill = "skyblue", color = "black") + # Added black border for clarity
theme_minimal() +
scale_y_continuous(labels = scales::comma) + # Format y-axis with commas
labs(title = "Total Number of Arriving Flights Per Year",
x = "Year",
y = "Number of Flights") +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.text.y = element_text(size = 12),
axis.text.x = element_text(size = 12, angle = 45, hjust = 1)
)
This bar chart shows the total number of arriving flights in the U.S. for each year from 2013 to 2023. It provides a clear view of trends in air traffic over the years.
monthly_summary <- delay %>%
group_by(year, month) %>%
summarise(total_flights = sum(arr_flights, na.rm = TRUE)) %>%
mutate(date = as.Date(paste(year, month, "01", sep = "-"))) # Create Date column
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
# Line chart: Total flights over time (monthly)
ggplot(monthly_summary, aes(x = date, y = total_flights)) +
geom_line(color = "blue", size = 1) +
geom_point(size = 2, color = "blue") +
scale_y_continuous(labels = scales::comma) +
labs(title = "Monthly Total Flights Over Time (Highlighting 2020 Decline)",
x = "Date",
y = "Total Flights") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10)
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
This line chart illustrates the monthly trend in total arriving flights in the U.S., spanning from 2013 to 2023.A striking feature is the dramatic decline in flights during 2020, a direct result of the COVID-19 pandemic’s impact on global air travel. The chart provides a clear perspective on the recovery of air traffic in the subsequent years.
yearly_cancellations <- delay %>%
group_by(year) %>%
summarise(total_cancellations = sum(arr_cancelled, na.rm = TRUE))
ggplot(yearly_cancellations, aes(x = factor(year), y = total_cancellations)) +
geom_bar(stat = "identity", fill = "red", color = "black") +
scale_y_continuous(labels = scales::comma) +
labs(title = "Total Cancellations Per Year",
x = "Year", y = "Number of Cancellations") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10)
)
This bar chart highlights the total number of flight cancellations per year, showcasing significant fluctuations over time. The significant rise in cancellations in 2020 reflects the impact of the COVID-19 pandemic, which disrupted air travel globally. This chart provides a clear visualization of how the aviation industry was impacted during this challenging period and its gradual recovery in the following years.
yearly_delays <- delay %>%
group_by(year) %>%
summarise(total_delays = sum(arr_delay, na.rm = TRUE)) # Sum of total arrival delays
# Bar plot: Total delays per year
ggplot(yearly_delays, aes(x = factor(year), y = total_delays)) +
geom_bar(stat = "identity", fill = "orange", color = "black") +
scale_y_continuous(labels = scales::comma) +
labs(title = "Total Delays Per Year",
x = "Year", y = "Total Delay (in minutes)") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10)
)
This chart illustrates the total minutes of delays per year across U.S. flights. A noticeable drop in total delays is evident in 2020, reflecting the reduced number of flights during the pandemic. With fewer flights in operation that year, overall delays naturally decreased. As flight numbers rebounded in subsequent years, delays also increased, highlighting the correlation between flight volume and total delay time.
# Aggregate arriving flights by month
flights_by_month <- delay %>%
group_by(month) %>%
summarise(total_arriving_flights = sum(arr_flights, na.rm = TRUE))
# Define month names
month_names <- c("January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November", "December")
flights_by_month$month <- factor(month_names, levels = month_names)
# Bar chart: Number of flights per month
ggplot(flights_by_month, aes(x = month, y = total_arriving_flights)) +
geom_bar(stat = "identity", fill = "steelblue") +
theme_minimal() +
scale_y_continuous(labels = scales::comma) +
labs(title = "Number of Flights by Month",
x = "Month",
y = "Number of Flights") +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1, size = 10)
)
This bar chart provides an overview of the total number of arriving flights for each month. It highlights the seasonality in flight traffic, with peaks observed during summer months like July and August, likely driven by vacation travel. Conversely, slightly lower activity may be seen in months such as February or October, reflecting off-peak travel periods. This seasonal variation offers insights into travel demand patterns across the year.
delay <- delay %>%
mutate(
season = case_when(
month %in% c(12, 1, 2) ~ "Winter",
month %in% c(3, 4, 5) ~ "Spring",
month %in% c(6, 7, 8) ~ "Summer",
month %in% c(9, 10, 11) ~ "Autumn"
)
)
“We’ve categorized the data into four distinct seasons Winter, Spring, Summer, and Autumnbased on the months. This seasonal grouping helps to analyze and visualize trends more effectively, capturing potential seasonal impacts on flight operations.
Below, the total number of flights for each season is visualized, providing insights into how flight traffic varies throughout the year.”
flights_by_season <- delay %>%
group_by(season) %>%
summarise(total_flights = sum(arr_flights, na.rm = TRUE))
#Visualize Flights by Season
ggplot(flights_by_season, aes(x = season, y = total_flights, fill = season)) +
geom_bar(stat = "identity") +
labs(title = "Total Flights Per Season", x = "Season", y = "Number of Flights") +
theme_minimal() +
scale_fill_brewer(palette = "Pastel1")
The bar chart displays the total number of flights for each season, making it easy to compare the volume of flights across different seasons.
# Aggregate arriving flights by airline
flights_by_airline <- delay %>%
group_by(carrier_name) %>%
summarise(total_arriving_flights = sum(arr_flights, na.rm = TRUE)) %>%
arrange(desc(total_arriving_flights))
# Bar chart: Number of flights by each airline
ggplot(flights_by_airline, aes(x = reorder(carrier_name, -total_arriving_flights), y = total_arriving_flights)) +
geom_bar(stat = "identity", fill = "steelblue") +
theme_minimal() +
scale_y_continuous(labels = scales::comma) + # Format y-axis with commas
labs(title = "Number of Flights by Each Airline",
x = "Airline",
y = "Number of Flights") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
axis.text.y = element_text(size = 10))
This chart showcases the number of flights handled by each airline over the observed period. Southwest Airlines leads as the airline with the highest number of flights, reflecting its large-scale operations and extensive network.
To remove 2020 data and ensure no anomalies interfere with the analysis:
# Subset data excluding the year 2020
delay_no2020 <- delay %>%
filter(year != 2020)
The year 2020 brought unprecedented disruptions to the aviation industry due to the COVID-19 pandemic. With strict travel restrictions, reduced demand, and widespread cancellations, flight operations during this period deviated significantly from typical patterns observed in other years. These anomalies could skew the analysis and affect the accuracy of insights derived from the data.
To ensure a more representative and consistent analysis, we have decided to exclude the data from 2020 in subsequent visualizations and calculations.
summary_filtered <- delay_no2020 %>%
summarise(
total_flights = sum(arr_flights, na.rm = TRUE),
total_canceled = sum(arr_cancelled, na.rm = TRUE),
total_diverted = sum(arr_diverted, na.rm = TRUE),
total_delayed = sum(arr_del15, na.rm = TRUE),
on_time_flights = total_flights - (total_canceled + total_diverted + total_delayed)
)
print(summary_filtered)
## total_flights total_canceled total_diverted total_delayed on_time_flights
## 1 57458451 1009889 140263 10943174 45365125
After excluding the anomalous year 2020 from the dataset, the key metrics were recalculated to provide a more consistent view of flight operations:
library(knitr)
summary_combined <- data.frame(
Metric = c("Total Flights", "Total Canceled", "Total Diverted",
"Total Delayed (15+ min)", "On-Time Flights"),
With_2020 = c(62146805, 1290923, 148007, 11375095, 49332780),
Without_2020 = c(57458451, 1009889, 140263, 10943174, 45365125)
)
kable(summary_combined,
caption = "Comparison of Flight Metrics Before and After Excluding 2020",
format = "simple")
Metric | With_2020 | Without_2020 |
---|---|---|
Total Flights | 62146805 | 57458451 |
Total Canceled | 1290923 | 1009889 |
Total Diverted | 148007 | 140263 |
Total Delayed (15+ min) | 11375095 | 10943174 |
On-Time Flights | 49332780 | 45365125 |
# Group by month and calculate total delays
monthly_delays <- delay_no2020 %>%
group_by(month) %>%
summarise(total_delays = sum(arr_del15, na.rm = TRUE))
# Bar chart to visualize total delays by month
ggplot(monthly_delays, aes(x = factor(month), y = total_delays)) +
geom_bar(stat = "identity", fill = "skyblue", color = "black") +
labs(
title = "Total Delays Over 15 Minutes by Month",
x = "Month",
y = "Total Delays (15+ minutes)"
) +
theme_minimal()
The pattern is logical: months with higher flight volumes often experience more delays. This is expected, as increased air traffic places more strain on resources, leading to a higher likelihood of delays. For example, summer months or holiday seasons, which tend to see more flights, also report higher delays, aligning with the earlier observation about the busiest seasons.
missing_values <- colSums(is.na(delay_no2020))
print(missing_values)
## year month carrier carrier_name
## 0 0 0 0
## airport airport_name arr_flights arr_del15
## 0 0 150 183
## carrier_ct weather_ct nas_ct security_ct
## 150 150 150 150
## late_aircraft_ct arr_cancelled arr_diverted arr_delay
## 150 150 150 150
## carrier_delay weather_delay nas_delay security_delay
## 150 150 150 150
## late_aircraft_delay season
## 150 0
Before analyzing the data, we checked for missing values to ensure data quality. Some columns, such as arr_flights, arr_del15, carrier_ct, and others, have 150 missing values. Delays-related columns like arr_delay and carrier_delay also have 150 missing values.
missing_percentage <- (colSums(is.na(delay_no2020)) / nrow(delay_no2020)) * 100
# Display the missing percentage for each column
print(missing_percentage)
## year month carrier carrier_name
## 0.00000000 0.00000000 0.00000000 0.00000000
## airport airport_name arr_flights arr_del15
## 0.00000000 0.00000000 0.09816304 0.11975891
## carrier_ct weather_ct nas_ct security_ct
## 0.09816304 0.09816304 0.09816304 0.09816304
## late_aircraft_ct arr_cancelled arr_diverted arr_delay
## 0.09816304 0.09816304 0.09816304 0.09816304
## carrier_delay weather_delay nas_delay security_delay
## 0.09816304 0.09816304 0.09816304 0.09816304
## late_aircraft_delay season
## 0.09816304 0.00000000
# Sort and display columns with missing values in descending order
missing_percentage_sorted <- sort(missing_percentage[missing_percentage > 0], decreasing = TRUE)
print(missing_percentage_sorted)
## arr_del15 arr_flights carrier_ct weather_ct
## 0.11975891 0.09816304 0.09816304 0.09816304
## nas_ct security_ct late_aircraft_ct arr_cancelled
## 0.09816304 0.09816304 0.09816304 0.09816304
## arr_diverted arr_delay carrier_delay weather_delay
## 0.09816304 0.09816304 0.09816304 0.09816304
## nas_delay security_delay late_aircraft_delay
## 0.09816304 0.09816304 0.09816304
We calculated the percentage of missing values for each column compared to the total data. Columns like arr_del15 have the highest missing percentage at 0.12%, while others, such as arr_flights and carrier_ct, have around 0.10% missing values. Most columns have minimal missing data, which will be addressed during preprocessing.
sum(is.na(delay_no2020)) # Total number of missing values in the dataset
## [1] 2283
For handling missing values, we replaced the missing entries in numeric columns with their respective median values. This ensures that the dataset remains complete while minimizing the impact on the data’s overall distribution.
# Identify numeric columns and get their names
numeric_columns <- names(delay_no2020)[sapply(delay_no2020, is.numeric)]
# Replace NA values in numeric columns with the median
delay_no2020[numeric_columns] <- lapply(delay_no2020[numeric_columns], function(x) {
replace(x, is.na(x), median(x, na.rm = TRUE))
})
colSums(is.na(delay_no2020))
## year month carrier carrier_name
## 0 0 0 0
## airport airport_name arr_flights arr_del15
## 0 0 0 0
## carrier_ct weather_ct nas_ct security_ct
## 0 0 0 0
## late_aircraft_ct arr_cancelled arr_diverted arr_delay
## 0 0 0 0
## carrier_delay weather_delay nas_delay security_delay
## 0 0 0 0
## late_aircraft_delay season
## 0 0
sum(is.na(delay_no2020))
## [1] 0
To normalize the delays for varying flight volumes, we calculate the ratio of flights delayed by 15+ minutes to the total number of arriving flights. Since this ratio can vary widely, we apply the natural logarithm to stabilize the variance.
Before calculating the ratio:
arr_flights
and arr_del15
to avoid division
errors.\[ \text{delay_ratio} = \log\left(\frac{\text{arr_del15}}{\text{arr_flights}}\right) \]
delay_no2020_ratio <- delay_no2020 %>%
filter(arr_flights > 0, arr_del15 > 0) %>%
mutate(delay_ratio = log(arr_del15 / arr_flights))
After computing the ratio, inspect the first few rows and the summary statistics to ensure correctness:
# View the first few rows
head(delay_no2020_ratio)
## year month carrier carrier_name airport
## 1 2023 8 9E Endeavor Air Inc. ABE
## 2 2023 8 9E Endeavor Air Inc. ABY
## 3 2023 8 9E Endeavor Air Inc. AEX
## 4 2023 8 9E Endeavor Air Inc. AGS
## 5 2023 8 9E Endeavor Air Inc. ALB
## 6 2023 8 9E Endeavor Air Inc. ATL
## airport_name arr_flights
## 1 Allentown/Bethlehem/Easton, PA: Lehigh Valley International 89
## 2 Albany, GA: Southwest Georgia Regional 62
## 3 Alexandria, LA: Alexandria International 62
## 4 Augusta, GA: Augusta Regional at Bush Field 66
## 5 Albany, NY: Albany International 92
## 6 Atlanta, GA: Hartsfield-Jackson Atlanta International 1636
## arr_del15 carrier_ct weather_ct nas_ct security_ct late_aircraft_ct
## 1 13 2.25 1.60 3.16 0 5.99
## 2 10 1.97 0.04 0.57 0 7.42
## 3 10 2.73 1.18 1.80 0 4.28
## 4 12 3.69 2.27 4.47 0 1.57
## 5 22 7.76 0.00 2.96 0 11.28
## 6 256 55.98 27.81 63.64 0 108.57
## arr_cancelled arr_diverted arr_delay carrier_delay weather_delay nas_delay
## 1 2 1 1375 71 761 118
## 2 0 1 799 218 1 62
## 3 1 0 766 56 188 78
## 4 1 1 1397 471 320 388
## 5 2 0 1530 628 0 134
## 6 32 11 29768 9339 4557 4676
## security_delay late_aircraft_delay season delay_ratio
## 1 0 425 Summer -1.923687
## 2 0 518 Summer -1.824549
## 3 0 444 Summer -1.824549
## 4 0 218 Summer -1.704748
## 5 0 768 Summer -1.430746
## 6 0 11196 Summer -1.854832
# Check the summary statistics of delay_ratio
summary(delay_no2020_ratio)
## year month carrier carrier_name
## Min. :2013 Min. : 1.000 Length:148384 Length:148384
## 1st Qu.:2016 1st Qu.: 4.000 Class :character Class :character
## Median :2018 Median : 7.000 Mode :character Mode :character
## Mean :2018 Mean : 6.521
## 3rd Qu.:2021 3rd Qu.: 9.000
## Max. :2023 Max. :12.000
## airport airport_name arr_flights arr_del15
## Length:148384 Length:148384 Min. : 1.0 Min. : 1.00
## Class :character Class :character 1st Qu.: 56.0 1st Qu.: 8.00
## Mode :character Mode :character Median : 113.0 Median : 20.00
## Mean : 387.1 Mean : 73.77
## 3rd Qu.: 270.0 3rd Qu.: 53.00
## Max. :21977.0 Max. :4176.00
## carrier_ct weather_ct nas_ct security_ct
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 0.0000
## 1st Qu.: 2.96 1st Qu.: 0.000 1st Qu.: 1.43 1st Qu.: 0.0000
## Median : 7.46 Median : 0.610 Median : 4.57 Median : 0.0000
## Mean : 22.98 Mean : 2.468 Mean : 21.42 Mean : 0.1711
## 3rd Qu.: 19.46 3rd Qu.: 2.000 3rd Qu.: 13.23 3rd Qu.: 0.0000
## Max. :1293.91 Max. :266.420 Max. :1884.42 Max. :58.6900
## late_aircraft_ct arr_cancelled arr_diverted arr_delay
## Min. : 0.00 Min. : 0.000 Min. : 0.0000 Min. : 0
## 1st Qu.: 2.00 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 453
## Median : 6.14 Median : 1.000 Median : 0.0000 Median : 1210
## Mean : 26.73 Mean : 6.802 Mean : 0.9445 Mean : 4718
## 3rd Qu.: 17.70 3rd Qu.: 4.000 3rd Qu.: 1.0000 3rd Qu.: 3277
## Max. :2069.07 Max. :1565.000 Max. :197.0000 Max. :438783
## carrier_delay weather_delay nas_delay security_delay
## Min. : 0 Min. : 0.0 Min. : 0 Min. : 0.000
## 1st Qu.: 150 1st Qu.: 0.0 1st Qu.: 51 1st Qu.: 0.000
## Median : 445 Median : 27.0 Median : 175 Median : 0.000
## Mean : 1585 Mean : 244.4 Mean : 1024 Mean : 8.104
## 3rd Qu.: 1249 3rd Qu.: 169.0 3rd Qu.: 546 3rd Qu.: 0.000
## Max. :196944 Max. :31960.0 Max. :112018 Max. :3760.000
## late_aircraft_delay season delay_ratio
## Min. : 0 Length:148384 Min. :-4.771
## 1st Qu.: 110 Class :character 1st Qu.:-2.052
## Median : 401 Mode :character Median :-1.689
## Mean : 1856 Mean :-1.748
## 3rd Qu.: 1245 3rd Qu.:-1.382
## Max. :227959 Max. : 2.944
The summary of the delay_ratio shows:
The lowest value is -4.771, meaning very few delays compared to total flights in some cases.
The average delay ratio is around -1.748, showing delays are generally low relative to flights.
Half of the data falls below -1.689 (median), and most values are below -1.382 (3rd quartile).
The highest delay ratio is 2.944, where delays are much higher compared to total flights.
This suggests that delays are usually low, but there are some cases with significantly higher delays.
To identify and investigate extreme cases where the delay_ratio exceeds 1 (indicating that the number of delays exceeds total arriving flights):
delay_no2020_ratio %>%
filter(delay_ratio > 1) # Example: Identify extreme cases
## year month carrier carrier_name airport
## 1 2022 10 OO SkyWest Airlines Inc. PHL
## 2 2022 10 YV Mesa Airlines Inc. JAN
## 3 2022 4 YV Mesa Airlines Inc. MLB
## 4 2022 4 YV Mesa Airlines Inc. SHV
## 5 2022 1 YV Mesa Airlines Inc. BOS
## 6 2022 1 YV Mesa Airlines Inc. BTV
## 7 2022 1 YV Mesa Airlines Inc. BUF
## 8 2022 1 YX Republic Airline GRB
## 9 2021 10 OO SkyWest Airlines Inc. CYS
## 10 2021 9 OO SkyWest Airlines Inc. TYR
## 11 2021 2 AA American Airlines Inc. LIH
## 12 2019 3 EV ExpressJet Airlines Inc. FAR
## 13 2019 3 EV ExpressJet Airlines Inc. PHL
## 14 2019 2 DL Delta Air Lines Inc. CID
## 15 2019 1 OO SkyWest Airlines Inc. GPT
## 16 2019 1 EV ExpressJet Airlines Inc. SAF
## 17 2018 8 EV ExpressJet Airlines Inc. PIA
## 18 2018 7 UA United Air Lines Inc. SBA
## 19 2018 4 EV ExpressJet Airlines Inc. RSW
## 20 2017 10 EV ExpressJet Airlines Inc. JFK
## 21 2017 9 EV ExpressJet Airlines Inc. AUS
## 22 2017 9 EV ExpressJet Airlines Inc. MSY
## 23 2017 9 UA United Air Lines Inc. EGE
## 24 2017 5 VX Virgin America SLC
## 25 2016 12 OO SkyWest Airlines Inc. PNS
## 26 2016 9 AA American Airlines Inc. BTV
## 27 2016 9 EV ExpressJet Airlines Inc. RSW
## 28 2015 11 MQ Envoy Air BPT
## 29 2015 3 MQ Envoy Air GUC
## 30 2015 1 OO SkyWest Airlines Inc. EVV
## 31 2014 3 OO SkyWest Airlines Inc. GFK
## 32 2013 9 EV ExpressJet Airlines Inc. TPA
## airport_name arr_flights
## 1 Philadelphia, PA: Philadelphia International 1
## 2 Jackson/Vicksburg, MS: Jackson Medgar Wiley Evers International 1
## 3 Melbourne, FL: Melbourne International 1
## 4 Shreveport, LA: Shreveport Regional 1
## 5 Boston, MA: Logan International 2
## 6 Burlington, VT: Burlington International 1
## 7 Buffalo, NY: Buffalo Niagara International 1
## 8 Green Bay, WI: Green Bay Austin Straubel International 2
## 9 Cheyenne, WY: Cheyenne Regional/Jerry Olson Field 2
## 10 Tyler, TX: Tyler Pounds Regional 1
## 11 Lihue, HI: Lihue Airport 2
## 12 Fargo, ND: Hector International 1
## 13 Philadelphia, PA: Philadelphia International 1
## 14 Cedar Rapids/Iowa City, IA: The Eastern Iowa 1
## 15 Gulfport/Biloxi, MS: Gulfport-Biloxi International 1
## 16 Santa Fe, NM: Santa Fe Municipal 4
## 17 Peoria, IL: General Downing - Peoria International 1
## 18 Santa Barbara, CA: Santa Barbara Municipal 1
## 19 Fort Myers, FL: Southwest Florida International 1
## 20 New York, NY: John F. Kennedy International 1
## 21 Austin, TX: Austin - Bergstrom International 1
## 22 New Orleans, LA: Louis Armstrong New Orleans International 2
## 23 Eagle, CO: Eagle County Regional 1
## 24 Salt Lake City, UT: Salt Lake City International 1
## 25 Pensacola, FL: Pensacola International 1
## 26 Burlington, VT: Burlington International 1
## 27 Fort Myers, FL: Southwest Florida International 1
## 28 Beaumont/Port Arthur, TX: Jack Brooks Regional 1
## 29 Gunnison, CO: Gunnison-Crested Butte Regional 1
## 30 Evansville, IN: Evansville Regional 1
## 31 Grand Forks, ND: Grand Forks International 1
## 32 Tampa, FL: Tampa International 1
## arr_del15 carrier_ct weather_ct nas_ct security_ct late_aircraft_ct
## 1 19 0 0 0 0 0
## 2 19 0 0 0 0 0
## 3 19 0 0 0 0 0
## 4 19 0 0 0 0 0
## 5 19 0 0 0 0 0
## 6 19 0 0 0 0 0
## 7 19 0 0 0 0 0
## 8 19 0 0 0 0 0
## 9 19 0 0 0 0 0
## 10 19 0 0 0 0 0
## 11 19 0 0 0 0 0
## 12 19 0 0 0 0 0
## 13 19 0 0 0 0 0
## 14 19 0 0 0 0 0
## 15 19 0 0 0 0 0
## 16 19 0 0 0 0 0
## 17 19 0 0 0 0 0
## 18 19 0 0 0 0 0
## 19 19 0 0 0 0 0
## 20 19 0 0 0 0 0
## 21 19 0 0 0 0 0
## 22 19 0 0 0 0 0
## 23 19 0 0 0 0 0
## 24 19 0 0 0 0 0
## 25 19 0 0 0 0 0
## 26 19 0 0 0 0 0
## 27 19 0 0 0 0 0
## 28 19 0 0 0 0 0
## 29 19 0 0 0 0 0
## 30 19 0 0 0 0 0
## 31 19 0 0 0 0 0
## 32 19 0 0 0 0 0
## arr_cancelled arr_diverted arr_delay carrier_delay weather_delay nas_delay
## 1 1 0 0 0 0 0
## 2 1 0 0 0 0 0
## 3 1 0 0 0 0 0
## 4 1 0 0 0 0 0
## 5 2 0 0 0 0 0
## 6 1 0 0 0 0 0
## 7 1 0 0 0 0 0
## 8 2 0 0 0 0 0
## 9 2 0 0 0 0 0
## 10 1 0 0 0 0 0
## 11 2 0 0 0 0 0
## 12 1 0 0 0 0 0
## 13 1 0 0 0 0 0
## 14 0 1 0 0 0 0
## 15 1 0 0 0 0 0
## 16 4 0 0 0 0 0
## 17 1 0 0 0 0 0
## 18 0 1 0 0 0 0
## 19 0 1 0 0 0 0
## 20 0 1 0 0 0 0
## 21 1 0 0 0 0 0
## 22 2 0 0 0 0 0
## 23 1 0 0 0 0 0
## 24 0 1 0 0 0 0
## 25 1 0 0 0 0 0
## 26 1 0 0 0 0 0
## 27 0 1 0 0 0 0
## 28 0 1 0 0 0 0
## 29 1 0 0 0 0 0
## 30 1 0 0 0 0 0
## 31 1 0 0 0 0 0
## 32 0 1 0 0 0 0
## security_delay late_aircraft_delay season delay_ratio
## 1 0 0 Autumn 2.944439
## 2 0 0 Autumn 2.944439
## 3 0 0 Spring 2.944439
## 4 0 0 Spring 2.944439
## 5 0 0 Winter 2.251292
## 6 0 0 Winter 2.944439
## 7 0 0 Winter 2.944439
## 8 0 0 Winter 2.251292
## 9 0 0 Autumn 2.251292
## 10 0 0 Autumn 2.944439
## 11 0 0 Winter 2.251292
## 12 0 0 Spring 2.944439
## 13 0 0 Spring 2.944439
## 14 0 0 Winter 2.944439
## 15 0 0 Winter 2.944439
## 16 0 0 Winter 1.558145
## 17 0 0 Summer 2.944439
## 18 0 0 Summer 2.944439
## 19 0 0 Spring 2.944439
## 20 0 0 Autumn 2.944439
## 21 0 0 Autumn 2.944439
## 22 0 0 Autumn 2.251292
## 23 0 0 Autumn 2.944439
## 24 0 0 Spring 2.944439
## 25 0 0 Winter 2.944439
## 26 0 0 Autumn 2.944439
## 27 0 0 Autumn 2.944439
## 28 0 0 Autumn 2.944439
## 29 0 0 Spring 2.944439
## 30 0 0 Winter 2.944439
## 31 0 0 Spring 2.944439
## 32 0 0 Autumn 2.944439
We identified rows where delay_ratio > 1, which is logically inconsistent as the number of flights delayed by 15+ minutes cannot exceed the total number of arriving flights. To address this, we filtered out these rows to ensure data consistency and reliability in the analysis.
# Filter out rows with delay_ratio > 1
delay_no2020_ratio <- delay_no2020_ratio %>%
filter(delay_ratio <= 1)
This step removes anomalies and ensures the dataset reflects realistic flight delay scenarios.
summary(delay_no2020_ratio$delay_ratio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.7707 -2.0523 -1.6895 -1.7491 -1.3825 0.9985
ggplot(delay_no2020_ratio, aes(x = arr_flights, y = arr_del15)) +
geom_point(alpha = 0.5, color = "steelblue") +
facet_wrap(~season) +
labs(
title = "Seasonal Variation in Delays Over 15 Minutes",
x = "Number of Flights (arr_flights)",
y = "Delays > 15 Minutes (arr_del15)"
) +
theme_minimal()
The plot shows the connection between the number of flights and delays over 15 minutes, separated by season (Autumn, Spring, Summer, Winter). Here’s what it tells us:
1- More Flights = More Delays:
2- Seasons Look Similar:
3- Outliers:
To better understand the contribution of different factors to delays, we normalize each delay type by the total number of flights. This ensures that the comparison is fair across different flight volumes. Additionally, the delay ratio (log-transformed) is used as the target variable for further analysis.
delay_normalized <- delay_no2020_ratio %>%
filter(arr_flights > 0, carrier_ct > 0, weather_ct > 0, nas_ct > 0,
security_ct > 0, late_aircraft_ct > 0) %>% # Ensure no zeros in denominators
mutate(
delay_ratio = log(arr_del15 / arr_flights), # Target variable (already log-transformed)
carrier_ratio = carrier_ct / arr_flights, # Normalize carrier delays
weather_ratio = weather_ct / arr_flights, # Normalize weather delays
nas_ratio = nas_ct / arr_flights, # Normalize NAS delays
security_ratio = security_ct / arr_flights, # Normalize security delays
late_aircraft_ratio = late_aircraft_ct / arr_flights # Normalize late aircraft delays
)
To visually identify skewness for the new predictors (carrier_ratio, weather_ratio, etc.):
library(ggplot2)
# Convert data to long format for plotting
normalized_long <- delay_normalized %>%
select(delay_ratio, carrier_ratio, weather_ratio, nas_ratio, security_ratio, late_aircraft_ratio) %>%
pivot_longer(everything(), names_to = "RatioType", values_to = "Value")
# Plot histograms
ggplot(normalized_long, aes(x = Value)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white", alpha = 0.7) +
facet_wrap(~RatioType, scales = "free") +
labs(
title = "Distribution of Normalized Delay Ratios",
x = "Value",
y = "Count"
) +
theme_minimal()
This visualization displays the distributions of normalized delay ratios for various delay factors, including carrier, weather, NAS, security, and late aircraft delays. Each histogram represents the contribution of a specific factor to delays relative to the total number of flights.
Skewness:
The distributions of all delay ratios exhibit positive skewness (right-skewed), as most values are clustered near zero, with a few higher values extending to the right. This indicates that for most flights, delays caused by these factors are relatively small, but occasional outliers result in significant delays. Notable Factors:
Carrier, late aircraft, and NAS delays show a wider spread, indicating greater variability in their contribution to delays. Security and weather delays have much smaller normalized ratios and tighter distributions, suggesting they contribute minimally and consistently to delays. Implications:
The skewed distributions highlight that while delays are generally low, some extreme cases significantly impact overall performance. This analysis helps pinpoint which delay types have more variability and need targeted intervention to minimize their effects.
Applying log transformation to the normalized delay ratios reduces skewness and stabilizes variance, making the data more suitable for further analysis.
delay_normalized_log <- delay_normalized %>%
mutate(
log_carrier_ratio = log(carrier_ratio),
log_weather_ratio = log(weather_ratio),
log_nas_ratio = log(nas_ratio),
log_security_ratio = log(security_ratio),
log_late_aircraft_ratio = log(late_aircraft_ratio)
)
summary(delay_normalized_log)
## year month carrier carrier_name
## Min. :2013 Min. : 1.000 Length:16554 Length:16554
## 1st Qu.:2016 1st Qu.: 4.000 Class :character Class :character
## Median :2019 Median : 7.000 Mode :character Mode :character
## Mean :2019 Mean : 6.509
## 3rd Qu.:2021 3rd Qu.: 9.000
## Max. :2023 Max. :12.000
## airport airport_name arr_flights arr_del15
## Length:16554 Length:16554 Min. : 7 Min. : 3.0
## Class :character Class :character 1st Qu.: 291 1st Qu.: 64.0
## Mode :character Mode :character Median : 731 Median : 157.0
## Mean : 1636 Mean : 319.7
## 3rd Qu.: 2098 3rd Qu.: 412.0
## Max. :21977 Max. :4176.0
## carrier_ct weather_ct nas_ct security_ct
## Min. : 0.03 Min. : 0.010 Min. : 0.02 Min. : 0.010
## 1st Qu.: 20.79 1st Qu.: 1.520 1st Qu.: 13.49 1st Qu.: 0.460
## Median : 49.90 Median : 3.950 Median : 38.66 Median : 1.000
## Mean : 93.08 Mean : 9.985 Mean : 92.14 Mean : 1.362
## 3rd Qu.: 120.66 3rd Qu.: 10.280 3rd Qu.: 110.20 3rd Qu.: 1.600
## Max. :1293.91 Max. :266.420 Max. :1884.42 Max. :41.970
## late_aircraft_ct arr_cancelled arr_diverted arr_delay
## Min. : 0.03 Min. : 0.00 Min. : 0.000 Min. : 85
## 1st Qu.: 18.39 1st Qu.: 2.00 1st Qu.: 0.000 1st Qu.: 3783
## Median : 51.53 Median : 8.00 Median : 1.000 Median : 9503
## Mean : 123.10 Mean : 30.02 Mean : 4.003 Mean : 20485
## 3rd Qu.: 150.92 3rd Qu.: 27.00 3rd Qu.: 4.000 3rd Qu.: 25255
## Max. :2069.07 Max. :1565.00 Max. :197.000 Max. :438783
## carrier_delay weather_delay nas_delay security_delay
## Min. : 1 Min. : 1.0 Min. : 1 Min. : 1.00
## 1st Qu.: 1260 1st Qu.: 104.0 1st Qu.: 537 1st Qu.: 15.00
## Median : 3139 Median : 324.0 Median : 1616 Median : 32.00
## Mean : 6645 Mean : 980.4 Mean : 4385 Mean : 64.91
## 3rd Qu.: 8090 3rd Qu.: 925.0 3rd Qu.: 4942 3rd Qu.: 74.00
## Max. :196944 Max. :26446.0 Max. :112018 Max. :3760.00
## late_aircraft_delay season delay_ratio carrier_ratio
## Min. : 1 Length:16554 Min. :-3.6068 Min. :0.0004839
## 1st Qu.: 1270 Class :character 1st Qu.:-1.8157 1st Qu.:0.0456990
## Median : 3605 Mode :character Median :-1.5413 Median :0.0647796
## Mean : 8409 Mean :-1.5655 Mean :0.0715220
## 3rd Qu.: 10174 3rd Qu.:-1.2854 3rd Qu.:0.0896030
## Max. :227959 Max. :-0.3185 Max. :0.4498214
## weather_ratio nas_ratio security_ratio late_aircraft_ratio
## Min. :6.040e-06 Min. :0.000032 Min. :2.620e-06 Min. :0.0001667
## 1st Qu.:2.625e-03 1st Qu.:0.032922 1st Qu.:4.470e-04 1st Qu.:0.0479366
## Median :5.721e-03 Median :0.053118 Median :1.087e-03 Median :0.0717587
## Mean :8.681e-03 Mean :0.063571 Mean :2.478e-03 Mean :0.0790836
## 3rd Qu.:1.110e-02 3rd Qu.:0.082959 3rd Qu.:2.687e-03 3rd Qu.:0.1027813
## Max. :1.553e-01 Max. :0.372857 Max. :1.264e-01 Max. :0.4900000
## log_carrier_ratio log_weather_ratio log_nas_ratio log_security_ratio
## Min. :-7.6337 Min. :-12.017 Min. :-10.3494 Min. :-12.853
## 1st Qu.:-3.0857 1st Qu.: -5.943 1st Qu.: -3.4136 1st Qu.: -7.713
## Median :-2.7368 Median : -5.164 Median : -2.9352 Median : -6.825
## Mean :-2.7615 Mean : -5.274 Mean : -3.0137 Mean : -6.854
## 3rd Qu.:-2.4124 3rd Qu.: -4.501 3rd Qu.: -2.4894 3rd Qu.: -5.919
## Max. :-0.7989 Max. : -1.862 Max. : -0.9866 Max. : -2.069
## log_late_aircraft_ratio
## Min. :-8.6995
## 1st Qu.:-3.0379
## Median :-2.6344
## Mean :-2.7000
## 3rd Qu.:-2.2752
## Max. :-0.7134
delay_normalized_log$season<-as.factor(delay_normalized_log$season) # Convert season column to factor
summary(delay_normalized_log$season)
## Autumn Spring Summer Winter
## 3246 4063 5297 3948
library(writexl)
# Export to Excel
write_xlsx(delay_normalized_log, "delay_normalized_log.xlsx")
The output provides a detailed summary of the normalized and log-transformed data, including descriptive statistics, missing values, and distribution characteristics for each variable.
library(summarytools)
print(dfSummary(delay_normalized_log), method = 'render')
No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | year [integer] |
|
|
16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | month [integer] |
|
12 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | carrier [character] |
|
|
16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4 | carrier_name [character] |
|
|
16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 | airport [character] |
|
|
16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6 | airport_name [character] |
|
|
16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7 | arr_flights [numeric] |
|
4813 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8 | arr_del15 [numeric] |
|
1697 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
9 | carrier_ct [numeric] |
|
10755 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10 | weather_ct [numeric] |
|
3462 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
11 | nas_ct [numeric] |
|
10150 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
12 | security_ct [numeric] |
|
844 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
13 | late_aircraft_ct [numeric] |
|
11139 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
14 | arr_cancelled [numeric] |
|
438 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
15 | arr_diverted [numeric] |
|
113 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
16 | arr_delay [numeric] |
|
12913 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
17 | carrier_delay [numeric] |
|
9238 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
18 | weather_delay [numeric] |
|
3471 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
19 | nas_delay [numeric] |
|
7451 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
20 | security_delay [numeric] |
|
562 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
21 | late_aircraft_delay [numeric] |
|
9951 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
22 | season [factor] |
|
|
16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
23 | delay_ratio [numeric] |
|
13455 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
24 | carrier_ratio [numeric] |
|
16274 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
25 | weather_ratio [numeric] |
|
15428 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
26 | nas_ratio [numeric] |
|
16251 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
27 | security_ratio [numeric] |
|
13709 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
28 | late_aircraft_ratio [numeric] |
|
16295 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
29 | log_carrier_ratio [numeric] |
|
16274 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
30 | log_weather_ratio [numeric] |
|
15425 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
31 | log_nas_ratio [numeric] |
|
16247 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
32 | log_security_ratio [numeric] |
|
13701 distinct values | 16554 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
33 | log_late_aircraft_ratio [numeric] |
|
16296 distinct values | 16554 (100.0%) | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.4.1)
2024-12-31
This pie chart visualizes the normalized contributions of different delay reasons to overall flight delays, represented as percentages for clear comparison.
Which delay type contributes the most overall?
# Summarize normalized delay ratios (not log-transformed)
delay_ratios <- delay_normalized %>%
summarise(
carrier_delay = sum(carrier_ratio),
weather_delay = sum(weather_ratio),
nas_delay = sum(nas_ratio),
security_delay = sum(security_ratio),
late_aircraft_delay = sum(late_aircraft_ratio)
) %>%
pivot_longer(everything(), names_to = "DelayReason", values_to = "RatioSum") %>%
mutate(Percentage = round(RatioSum / sum(RatioSum) * 100, 1)) # Calculate percentages
# Plot pie chart with percentages
ggplot(delay_ratios, aes(x = "", y = RatioSum, fill = DelayReason)) +
geom_bar(stat = "identity", width = 1, color = "white") + # Ensure segments are distinguishable
coord_polar("y", start = 0) + # Convert to pie chart
geom_text(aes(label = paste0(Percentage, "%")),
position = position_stack(vjust = 0.5), size = 4, color = "black") + # Add percentages
scale_fill_brewer(palette = "Set3") + # Use a nice color palette
labs(title = "Proportional Breakdown of Delay Reasons",
fill = "Delay Reason") +
theme_void() # Minimalist theme for pie chart
Which delay type is most correlated with the total delay ratio?
# Compute correlation matrix using log-transformed ratios
correlation_matrix <- delay_normalized_log %>%
select(delay_ratio,
log_carrier_ratio, log_weather_ratio, log_nas_ratio,
log_security_ratio, log_late_aircraft_ratio) %>%
cor(use = "complete.obs")
# Visualize correlation matrix using heatmap
library(ggcorrplot)
ggcorrplot(correlation_matrix,
lab = TRUE, # Add correlation values as text
type = "lower", # Show lower triangle of the matrix
title = "Correlation Matrix: Delay Ratio vs Log-Transformed Predictors",
lab_size = 4, # Text size
colors = c("blue", "white", "red"), # Diverging colors
ggtheme = theme_minimal())
The correlation matrix shows relationships between delay causes (e.g., carrier delay, late aircraft delay) and the delay ratio. Key observation:
The delay ratio has the strongest correlation with log_carrier_ratio (0.69) and log_late_aircraft_ratio (0.69).
Other delay factors like weather and NAS show weaker correlations with the delay ratio.
mod1<- lm(delay_ratio ~ log_late_aircraft_ratio,
data = delay_normalized_log)
summary(mod1)
##
## Call:
## lm(formula = delay_ratio ~ log_late_aircraft_ratio, data = delay_normalized_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.14723 -0.19028 -0.01037 0.17407 2.53492
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.373306 0.009966 -37.46 <2e-16 ***
## log_late_aircraft_ratio 0.441547 0.003597 122.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2879 on 16552 degrees of freedom
## Multiple R-squared: 0.4766, Adjusted R-squared: 0.4765
## F-statistic: 1.507e+04 on 1 and 16552 DF, p-value: < 2.2e-16
This linear regression analysis shows that there is a significant relationship between the delay_ratio and log_late_aircraft_ratio. The intercept is -0.3733, which represents the delay_ratio when log_late_aircraft_ratio is zero. The coefficient for log_late_aircraft_ratio is 0.4415, meaning that for each one-unit increase in the log of late aircraft ratio, the delay ratio increases by 0.4415. Both the intercept and the coefficient are highly significant, with p-values less than 2e-16. The model explains about 47.66% of the variation in delay ratio, and the overall model is statistically significant, as indicated by the high F-statistic and small residual standard error.
mod2<-update(mod1,.~.+log_carrier_ratio)
summary(mod2)
##
## Call:
## lm(formula = delay_ratio ~ log_late_aircraft_ratio + log_carrier_ratio,
## data = delay_normalized_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.81294 -0.14725 -0.02725 0.11628 1.79444
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.339674 0.010520 32.29 <2e-16 ***
## log_late_aircraft_ratio 0.313927 0.003092 101.52 <2e-16 ***
## log_carrier_ratio 0.382962 0.003776 101.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2261 on 16551 degrees of freedom
## Multiple R-squared: 0.6772, Adjusted R-squared: 0.6771
## F-statistic: 1.736e+04 on 2 and 16551 DF, p-value: < 2.2e-16
By adding the second predictor, log_carrier_ratio, the model’s ability to explain the delay_ratio improves. Both predictors, log_late_aircraft_ratio and log_carrier_ratio, are statistically significant, meaning each one has a meaningful impact on the delay ratio. The model now explains about 67.72% of the variation in the delay ratio, which is a significant improvement over the previous model. The residual standard error also decreases, indicating a better fit. The overall model is highly significant, confirming that including both predictors provides a more accurate prediction of delays.
mod3 <- update(mod2, . ~ . + log_nas_ratio)
summary(mod3)
##
## Call:
## lm(formula = delay_ratio ~ log_late_aircraft_ratio + log_carrier_ratio +
## log_nas_ratio, data = delay_normalized_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.32838 -0.08392 -0.02596 0.04780 1.65621
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.876101 0.007477 117.2 <2e-16 ***
## log_late_aircraft_ratio 0.259288 0.001987 130.5 <2e-16 ***
## log_carrier_ratio 0.392074 0.002390 164.1 <2e-16 ***
## log_nas_ratio 0.218601 0.001388 157.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.143 on 16550 degrees of freedom
## Multiple R-squared: 0.8708, Adjusted R-squared: 0.8708
## F-statistic: 3.718e+04 on 3 and 16550 DF, p-value: < 2.2e-16
By adding log_nas_ratio to the model, it explains delay_ratio much better. All three predictors — log_late_aircraft_ratio, log_carrier_ratio, and log_nas_ratio — are significant, meaning they all have a real impact on delays. The model now explains 87.08% of the variation in the delay ratio, which is a big improvement. The residual standard error is smaller, showing a better fit. The high F-statistic confirms that the model is strong and the addition of log_nas_ratio helps in predicting delays more accurately.
mod4 <- update(mod3, . ~ . + log_weather_ratio)
summary(mod4)
##
## Call:
## lm(formula = delay_ratio ~ log_late_aircraft_ratio + log_carrier_ratio +
## log_nas_ratio + log_weather_ratio, data = delay_normalized_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.30600 -0.07894 -0.02720 0.04387 1.46256
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0413970 0.0076453 136.21 <2e-16 ***
## log_late_aircraft_ratio 0.2612920 0.0018448 141.64 <2e-16 ***
## log_carrier_ratio 0.3699147 0.0022595 163.72 <2e-16 ***
## log_nas_ratio 0.2064721 0.0013098 157.64 <2e-16 ***
## log_weather_ratio 0.0488465 0.0009475 51.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1328 on 16549 degrees of freedom
## Multiple R-squared: 0.8887, Adjusted R-squared: 0.8887
## F-statistic: 3.303e+04 on 4 and 16549 DF, p-value: < 2.2e-16
By adding log_weather_ratio to the model, it helps predict delay_ratio even better. All four predictors — log_late_aircraft_ratio, log_carrier_ratio, log_nas_ratio, and log_weather_ratio — are important and have a strong impact on the delay ratio. The model now explains 88.87% of the variation in delays, which is a big improvement. The residual standard error is smaller, showing a better fit. The high F-statistic confirms that the model is strong, and adding log_weather_ratio shows that weather also affects flight delays.
mod5 <- update(mod4, . ~ . + log_security_ratio)
summary(mod5)
##
## Call:
## lm(formula = delay_ratio ~ log_late_aircraft_ratio + log_carrier_ratio +
## log_nas_ratio + log_weather_ratio + log_security_ratio, data = delay_normalized_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.27554 -0.07729 -0.02739 0.04317 1.41816
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1143859 0.0083421 133.59 <2e-16 ***
## log_late_aircraft_ratio 0.2681970 0.0018524 144.78 <2e-16 ***
## log_carrier_ratio 0.3570505 0.0023172 154.09 <2e-16 ***
## log_nas_ratio 0.2051379 0.0012950 158.40 <2e-16 ***
## log_weather_ratio 0.0454750 0.0009499 47.87 <2e-16 ***
## log_security_ratio 0.0162922 0.0007921 20.57 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1311 on 16548 degrees of freedom
## Multiple R-squared: 0.8915, Adjusted R-squared: 0.8914
## F-statistic: 2.718e+04 on 5 and 16548 DF, p-value: < 2.2e-16
By adding log_security_ratio to the model, the prediction of delay_ratio improves further. All five predictors — log_late_aircraft_ratio, log_carrier_ratio, log_nas_ratio, log_weather_ratio, and log_security_ratio — are statistically significant, with very small p-values (less than 2e-16). The model now explains 89.15% of the variation in the delay ratio, a slight improvement over the previous models. The residual standard error has decreased to 0.1311, indicating a better fit. The high F-statistic (27,180) confirms the model’s significance, and adding log_security_ratio shows that security factors also contribute to flight delays.
In analyzing flight delays, it is important not to overlook the potential impact of season on the delay_ratio. Previous observations have shown that the summer season experiences the highest number of delays. Therefore, adding the season variable to the model allows us to better account for seasonal effects that may influence delays, providing a more comprehensive understanding of the factors at play.
mod6 <- update(mod5, . ~ . +season)
summary(mod6)
##
## Call:
## lm(formula = delay_ratio ~ log_late_aircraft_ratio + log_carrier_ratio +
## log_nas_ratio + log_weather_ratio + log_security_ratio +
## season, data = delay_normalized_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.28620 -0.07713 -0.02650 0.04340 1.39549
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0582730 0.0097867 108.134 < 2e-16 ***
## log_late_aircraft_ratio 0.2649817 0.0018753 141.302 < 2e-16 ***
## log_carrier_ratio 0.3537572 0.0023268 152.039 < 2e-16 ***
## log_nas_ratio 0.2043026 0.0012920 158.131 < 2e-16 ***
## log_weather_ratio 0.0430712 0.0009861 43.678 < 2e-16 ***
## log_security_ratio 0.0163113 0.0007890 20.674 < 2e-16 ***
## seasonSpring 0.0197835 0.0031208 6.339 2.37e-10 ***
## seasonSummer 0.0329753 0.0031943 10.323 < 2e-16 ***
## seasonWinter 0.0329753 0.0031699 10.403 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1306 on 16545 degrees of freedom
## Multiple R-squared: 0.8923, Adjusted R-squared: 0.8923
## F-statistic: 1.714e+04 on 8 and 16545 DF, p-value: < 2.2e-16
The model with the inclusion of season shows that all predictors, including log_late_aircraft_ratio, log_carrier_ratio, log_nas_ratio, log_weather_ratio, and log_security_ratio, remain statistically significant. Additionally, the seasonal factors — seasonSpring, seasonSummer, and seasonWinter — also have a significant effect on the delay ratio, with summer and winter showing the highest increases. The model explains 89.23% of the variation in delays, and the residual standard error is reduced to 0.1306, indicating a better fit. The F-statistic confirms the model’s strong significance, suggesting that the inclusion of seasonal factors provides a more accurate prediction of delays.
mod7 <- update(mod5, . ~ . +log_weather_ratio*season)
summary(mod7)
##
## Call:
## lm(formula = delay_ratio ~ log_late_aircraft_ratio + log_carrier_ratio +
## log_nas_ratio + log_weather_ratio + log_security_ratio +
## season + log_weather_ratio:season, data = delay_normalized_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.28241 -0.07708 -0.02618 0.04353 1.39795
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0640856 0.0141812 75.035 < 2e-16 ***
## log_late_aircraft_ratio 0.2650177 0.0018755 141.304 < 2e-16 ***
## log_carrier_ratio 0.3539913 0.0023285 152.023 < 2e-16 ***
## log_nas_ratio 0.2045027 0.0012944 157.994 < 2e-16 ***
## log_weather_ratio 0.0437989 0.0020240 21.640 < 2e-16 ***
## log_security_ratio 0.0163332 0.0007892 20.695 < 2e-16 ***
## seasonSpring 0.0157279 0.0157027 1.002 0.31655
## seasonSummer 0.0110818 0.0150196 0.738 0.46063
## seasonWinter 0.0437708 0.0156687 2.794 0.00522 **
## log_weather_ratio:seasonSpring -0.0006832 0.0027381 -0.250 0.80296
## log_weather_ratio:seasonSummer -0.0043900 0.0027479 -1.598 0.11015
## log_weather_ratio:seasonWinter 0.0020923 0.0027462 0.762 0.44615
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1306 on 16542 degrees of freedom
## Multiple R-squared: 0.8924, Adjusted R-squared: 0.8923
## F-statistic: 1.247e+04 on 11 and 16542 DF, p-value: < 2.2e-16
In this model, we are checking if season affects delay_ratio independently or if it interacts with weather. The results show that season itself doesn’t have a strong impact, except for Winter, which has a small but significant effect. The interaction terms between weather and the seasons (Spring, Summer, Winter) also don’t show significant effects. This means that season and weather affect delays mostly independently, with Winter having a noticeable impact on delays.
Given this, it would be better to ignore the interaction terms and just consider season as a standalone variable in your model. The seasonWinter variable does have a significant effect on delay_ratio, but the interaction terms do not add meaningful information to the model. Therefore, we can keep the simpler model with season alone, which explains the delays effectively without introducing unnecessary complexity.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
residualPlots(mod6)
## Test stat Pr(>|Test stat|)
## log_late_aircraft_ratio 66.677 < 2.2e-16 ***
## log_carrier_ratio 37.595 < 2.2e-16 ***
## log_nas_ratio 102.995 < 2.2e-16 ***
## log_weather_ratio 34.258 < 2.2e-16 ***
## log_security_ratio 14.780 < 2.2e-16 ***
## season
## Tukey test 32.835 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
qqPlot(residuals(mod6))
## [1] 2376 3078
The QQ plot (Quantile-Quantile plot) displays the comparison between the quantiles of the residuals and a normal distribution. If the residuals were normally distributed, the points would lie along a straight diagonal line. However, if the points deviate significantly from this line, it suggests that the residuals do not follow a normal distribution, which could indicate issues such as skewness or heavy tails in the residuals. Given the curvature in the residuals plot and the appearance in the QQ plot, the assumption of normality for the residuals may not hold, and this is another reason why a GAM could provide a better fit for the data.
Residuals
The residuals plot shows a clear curvature pattern, indicating that the relationship between the predictors and the response variable might not be entirely linear. This suggests that the current linear model may not fully capture the underlying structure of the data. To address this, we can consider fitting a Generalized Additive Model (GAM), which allows for more flexible relationships between the predictors and the response by using smooth functions. This approach can help model the non-linear trends observed in the residuals plot.
Generalized Additive Model (GAM)
A Generalized Additive Model (GAM) extends traditional linear models by allowing non-linear relationships between the predictors and the outcome. The general formula for a GAM is:
y = β₀ + f₁(x₁) + f₂(x₂) + ... + fₚ(xₚ) + ε
Where:
In R, we can use the mgcv package to fit a GAM, where the smooth functions fₖ are typically estimated using splines. This flexibility allows the model to capture complex, non-linear relationships between the predictors and the response variable.
library(mgcv)
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
library(nlme)
mod_gam <- gam(delay_ratio ~ s(log_late_aircraft_ratio) + s(log_carrier_ratio) +
s(log_nas_ratio) + s(log_weather_ratio) + s(log_security_ratio)
+ season, data = delay_normalized_log)
# Summary of the GAM model
summary(mod_gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## delay_ratio ~ s(log_late_aircraft_ratio) + s(log_carrier_ratio) +
## s(log_nas_ratio) + s(log_weather_ratio) + s(log_security_ratio) +
## season
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.570575 0.001288 -1219.784 < 2e-16 ***
## seasonSpring 0.004973 0.001672 2.975 0.00293 **
## seasonSummer 0.004620 0.001723 2.681 0.00734 **
## seasonWinter 0.009964 0.001697 5.872 4.38e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(log_late_aircraft_ratio) 8.578 8.949 10037.3 <2e-16 ***
## s(log_carrier_ratio) 7.716 8.582 7455.0 <2e-16 ***
## s(log_nas_ratio) 8.850 8.993 13080.6 <2e-16 ***
## s(log_weather_ratio) 8.105 8.794 543.1 <2e-16 ***
## s(log_security_ratio) 7.269 8.312 117.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.97 Deviance explained = 97%
## GCV = 0.0048334 Scale est. = 0.0048204 n = 16554
The results from the Generalized Additive Model (GAM) suggest a strong fit for predicting delay_ratio with significant non-linear relationships between the predictors and the outcome. Here’s a detailed summary of the model’s findings:
1- Parametric Coefficients:
The intercept is highly significant with a very small p-value (<2e-16), indicating that the baseline level of the delay ratio is significantly different from zero.
Season variables (Spring, Summer, and Winter) are also significant, with Winter showing the strongest effect, having the highest t-value (5.872) and p-value (< 2e-16). This suggests that delays are notably higher in winter compared to other seasons. Spring and Summer are also significant, but with smaller effects.
2- Smooth Terms:
The smooth terms for each predictor (log_late_aircraft_ratio, log_carrier_ratio, log_nas_ratio, log_weather_ratio, and log_security_ratio) all have edf (estimated degrees of freedom) close to their ref.df (reference degrees of freedom), indicating that the smooth terms are effectively modeling the non-linear relationships between these predictors and the outcome.
The F-statistics for all smooth terms are extremely high, and the p-values are all less than 2e-16, suggesting that these smooth functions are statistically significant. This confirms that each predictor has a strong, non-linear influence on the delay ratio.
3- Model Fit:
The Adjusted R-squared value is 0.97, which indicates that the model explains 97% of the variance in the delay ratio. This is an excellent fit, showing that the model accounts for almost all the variability in the outcome.
The Deviance explained is also 97%, which further supports the idea that the model fits the data very well. The Generalized Cross Validation (GCV) score of 0.0048334 is very low, indicating that the model generalizes well and avoids overfitting.
In summary, this GAM provides a strong, well-fitting model with both linear and non-linear relationships, and it effectively captures the seasonal effects on delays. The model’s performance metrics, including R-squared and Deviance explained, demonstrate that it explains the variation in the delay ratio very effectively.
plot(mod_gam,shade = TRUE, shade.col = "lightblue")
1- Plot for log_late_aircraft_ratio:
This smooth function shows a positive relationship between log_late_aircraft_ratio and the delay_ratio. As the log of the late aircraft ratio increases, the delay ratio also increases, particularly after the value reaches around -5. The plot shows a gradual increase with a smooth curve, suggesting that the effect of log_late_aircraft_ratio is non-linear.
2- Plot for log_carrier_ratio:
This smooth function indicates a positive, non-linear relationship between log_carrier_ratio and delay_ratio. Similar to log_late_aircraft_ratio, the delay ratio increases as the log_carrier_ratio rises. The curve becomes steeper at higher values, showing a stronger effect as the carrier ratio increases.
3- Plot for log_nas_ratio:
This plot shows a positive relationship between log_nas_ratio and delay_ratio. As log_nas_ratio increases, the delay ratio increases as well. The curve is quite steep for lower values of log_nas_ratio and then flattens out slightly at higher values.
4- Plot for log_weather_ratio:
The smooth function for log_weather_ratio shows that weather ratio has a slight positive effect on delay_ratio. However, the relationship is almost flat, indicating that changes in weather ratio do not have a strong influence on delay ratio.
5- Plot for log_security_ratio:
The log_security_ratio also shows a very small positive effect on delay_ratio. The curve is almost flat, meaning that security ratio has a minimal non-linear effect on the delay ratio, with only small increases in the delay ratio as the security ratio rises.
Each plot represents the non-linear relationship between the respective predictor and the delay ratio, where the shaded areas indicate confidence intervals around the smooth terms. The smooth curves highlight how each predictor affects the response variable without assuming a linear relationship.
qqPlot(residuals(mod_gam))
## [1] 6046 11891
This is a QQ plot for the residuals of the GAM model. The black line represents the observed quantiles of the residuals, and the blue line represents the expected quantiles under a normal distribution.
The plot suggests that the residuals are not perfectly normal, as indicated by the departure from the blue line at the extremes. This means that the residuals have some deviations from normality, which may suggest issues indicating that there might be outliers or extreme values in the data..
However, the residuals do follow the expected pattern in the middle range, indicating that most of the residuals are approximately normally distributed.
The QQ plot of the residuals suggests that the residuals deviate from the normal distribution at both extremes, indicating the potential presence of outliers or extreme values. While the residuals appear normally distributed in the middle range, the tails show departures that could be influenced by these outliers.
To investigate further and identify the outliers in the data, I examined the residuals using different thresholds. These thresholds represent different quantiles of the absolute residuals, and any data point with an absolute residual greater than the threshold is considered an outlier.
# Extract residuals from the GAM model
residuals <- residuals(mod_gam)
# Try different thresholds for outliers
thresholds <- c(0.85,0.88,0.90, 0.92, 0.95, 0.98, 0.99, 0.995) # Different quantiles
# Identify outliers based on absolute residuals
outlier_sets <- lapply(thresholds, function(t) {
which(abs(residuals) > quantile(abs(residuals), t))
})
# Compare the number of outliers identified at each threshold
sapply(outlier_sets, length)
## [1] 2483 1987 1656 1325 828 332 166 83
The results showed the following number of outliers at each threshold:
This shows that the number of identified outliers decreases as the threshold becomes stricter, with only 83 data points being flagged as outliers at the 99.5% threshold.
By examining the outliers at various thresholds, we can decide how to handle them in the model, either by removing, transforming, or keeping them, depending on their impact on the analysis.
# Extract residuals
residuals <- residuals(mod_gam)
# Identify the largest positive and negative residuals
outliers <- which(abs(residuals) > quantile(abs(residuals), 0.85)) # Top 15% residuals
data_outliers <- delay_normalized_log[outliers, ]
# View these outliers
#print(data_outliers)
# Load writexl library
library(writexl)
# Save the influential data points to an Excel file
write_xlsx(data_outliers, path = "data_outliers.xlsx")
I decided to check the top 15% of residuals to find the extreme values that could be affecting the model. I selected the residuals that were higher than the 85th percentile (the largest 15% of residuals). Then, I saved these outliers in a new dataset and exported them to an Excel file for further review. This way, I can examine the outliers and decide if they should be removed or handled differently in the analysis.
table(data_outliers$season)
##
## Autumn Spring Summer Winter
## 553 577 792 561
The table shows the distribution of outliers across different seasons:
It appears that Summer has the highest number of outliers, which may suggest that delays in the summer are more extreme compared to other seasons. This could be due to various factors like higher traffic or weather conditions, which lead to more significant delays. The relatively similar number of outliers in Autumn, Winter, and Spring indicates that the other seasons have less extreme delays on average. Investigating the reasons behind the higher outliers in Summer could provide useful insights for improving the model or understanding seasonal variations in flight delays.
table(data_outliers$carrier)
##
## 9E AA AS B6 DL EV G4 HA MQ NK OH OO QX UA US VX WN YV YX
## 26 195 133 209 195 6 148 86 98 341 57 323 8 17 38 69 397 43 94
table(data_outliers$airport)
##
## ABE ABQ ACK ACY AEX AGS ALB AMA ANC ASE ATL ATW AUS AVL AVP AZA BDL BET BFL BHM
## 1 19 1 1 2 1 10 3 34 8 72 1 18 3 2 7 9 1 1 3
## BIL BIS BLI BLV BNA BOI BOS BPT BQK BQN BRW BTR BTV BUF BUR BWI BZN CAE CAK CHO
## 1 2 2 2 22 7 38 2 1 12 4 3 1 6 19 29 3 1 4 2
## CHS CID CLE CLT CMH COS COU CRP CVG DAL DAY DCA DEN DFW DSM DTW ECP EGE ELP EUG
## 7 3 18 40 15 4 1 2 16 13 2 21 40 62 10 55 4 1 11 2
## EVV EWR FAI FAR FAT FAY FCA FLG FLL FNT FSD FSM GCK GEG GFK GJT GNV GPT GRK GRR
## 1 81 1 4 2 2 1 4 43 1 2 2 1 9 1 4 2 2 2 7
## GSO GSP HDN HHH HNL HOU HPN HRL HVN IAD IAG IAH ICT IDA ILM IMT IND ISP ITO JAN
## 2 3 1 1 50 11 4 3 1 9 1 27 4 3 3 1 17 6 6 4
## JAX JFK JMS JNU KOA LAN LAS LAX LBB LBE LCK LEX LFT LGA LGB LIH LIT MAF MCI MCO
## 18 39 1 7 7 1 41 90 2 1 2 1 4 79 6 10 7 6 11 31
## MDT MDW MEM MFE MFR MGM MHT MIA MKE MLI MLU MOT MRY MSN MSP MSY MTJ MVY MYR OAJ
## 1 9 7 5 4 2 1 18 11 4 1 1 3 5 44 22 1 1 17 1
## OAK OGG OKC OMA ONT ORD ORF ORH PBI PDX PGD PHL PHX PIA PIE PIT PLN PNS PRC PSC
## 25 36 5 3 18 75 4 2 19 38 8 31 58 1 5 12 1 9 1 3
## PSE PSG PSM PSP PVD PVU RAP RDM RDU RFD RIC RNO ROC ROW RSW SAN SAT SAV SBN SBP
## 4 2 1 3 7 2 1 3 23 2 7 14 1 3 15 33 7 2 3 1
## SDF SEA SFB SFO SGF SHV SIT SJC SJT SJU SLC SMF SNA SPI SPS SRQ STL STT STX SWF
## 12 47 5 89 5 2 1 31 2 28 60 42 18 1 1 7 16 13 4 2
## SYR TLH TPA TRI TUL TUS TVC TXK TYR TYS USA VPS WRG XNA YAK YUM
## 5 1 19 1 10 8 1 3 1 3 1 10 2 6 1 1
The tables you provided show the distribution of outliers by carrier and airport.
1- By Carrier:
2- By Airport:
table(data_outliers$month)
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 181 184 192 198 187 281 281 230 205 173 175 196
table(data_outliers$year)
##
## 2013 2014 2015 2016 2017 2018 2019 2021 2022 2023
## 131 223 161 186 180 213 213 560 324 292
The tables display the distribution of outliers by month and year.
1- By Month:
2- By Year:
# Count outliers by year and month from data_outliers
outlier_time_trend <- data_outliers %>%
group_by(year, month) %>%
summarise(outlier_count = n())
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
# Sort the outlier counts by descending order
outlier_time_trend_sorted <- outlier_time_trend %>%
arrange(desc(outlier_count))
# Display the sorted data
cat("The outlier counts sorted by month and year:\n")
## The outlier counts sorted by month and year:
# Display the sorted data
cat("The outlier counts sorted by month and year:\n")
## The outlier counts sorted by month and year:
print(outlier_time_trend_sorted)
## # A tibble: 109 × 3
## # Groups: year [10]
## year month outlier_count
## <int> <int> <int>
## 1 2021 7 80
## 2 2021 6 79
## 3 2023 7 67
## 4 2021 3 56
## 5 2023 6 56
## 6 2021 4 51
## 7 2021 5 44
## 8 2013 12 43
## 9 2021 2 43
## 10 2021 10 43
## # ℹ 99 more rows
# Print the month and year with the highest number of outliers
top_outlier <- outlier_time_trend_sorted[1, ]
cat("\nThe highest number of outliers occurred in:\n")
##
## The highest number of outliers occurred in:
print(top_outlier)
## # A tibble: 1 × 3
## # Groups: year [1]
## year month outlier_count
## <int> <int> <int>
## 1 2021 7 80
It’s clear that 2021 has a high number of outliers, particularly in July and June, with counts of 80 and 79 outliers, respectively. This stands out compared to other years, especially when looking at years like 2023 or 2013, where the outlier counts are lower. The persistently high outlier counts in 2021 suggest that this year might still have been impacted by external factors, potentially linked to ongoing disruptions, possibly from the pandemic or operational challenges that persisted into 2021.
# Add a flag to differentiate outliers from non-outliers
delay_normalized_log$outlier_flag <- ifelse(rownames(delay_normalized_log) %in% rownames(data_outliers), "Outlier", "Non-Outlier")
# Summary statistics for predictors (grouped by outlier_flag)
library(dplyr)
summary_table <- delay_normalized_log %>%
group_by(outlier_flag) %>%
summarise(
avg_weather_ratio = mean(log_weather_ratio, na.rm = TRUE),
avg_carrier_ratio = mean(log_carrier_ratio, na.rm = TRUE),
avg_nas_ratio = mean(log_nas_ratio, na.rm = TRUE),
avg_security_ratio = mean(log_security_ratio, na.rm = TRUE),
avg_late_aircraft_ratio = mean(log_late_aircraft_ratio, na.rm = TRUE)
)
print(summary_table)
## # A tibble: 2 × 6
## outlier_flag avg_weather_ratio avg_carrier_ratio avg_nas_ratio
## <chr> <dbl> <dbl> <dbl>
## 1 Non-Outlier -5.28 -2.74 -2.98
## 2 Outlier -5.25 -2.88 -3.23
## # ℹ 2 more variables: avg_security_ratio <dbl>, avg_late_aircraft_ratio <dbl>
I flagged the top 15% of extreme residuals as outliers and compared them to the non-outliers to understand why there are so many. Here’s what I found:
Differences Between Outliers and Non-Outliers:
Outliers tend to have higher carrier and NAS ratios, suggesting that flights with more traffic or airline issues may experience bigger delays.
Outliers also have slightly lower weather and security ratios, which implies that weather and security issues might not be the main causes of extreme delays.
Late aircraft ratio is higher in outliers, indicating that delayed flights may have experienced more late arrivals, leading to further delays.
To compare the results of the GAM model with and without the outliers, I first created two models: one with only outliers and one with the data excluding the outliers. By doing this, I can analyze how the outliers influence the model performance and the relationships between the predictors and the delay ratio.
Model with Outliers
First, I built the GAM model with outliers included. The model was trained using the data_outliers dataset, which contains only the outliers based on the defined threshold (top 15% of residuals). Here is the summary of the model:
library(mgcv)
# Rebuild the GAM model with outliers
mod_gam_outlier <- gam(delay_ratio ~ s(log_late_aircraft_ratio) + s(log_carrier_ratio) +
s(log_nas_ratio) + s(log_weather_ratio) + s(log_security_ratio)
+ season,data = data_outliers)
# Summary of the new model
summary(mod_gam_outlier)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## delay_ratio ~ s(log_late_aircraft_ratio) + s(log_carrier_ratio) +
## s(log_nas_ratio) + s(log_weather_ratio) + s(log_security_ratio) +
## season
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.619251 0.006964 -232.501 < 2e-16 ***
## seasonSpring -0.001726 0.009346 -0.185 0.85353
## seasonSummer 0.012196 0.009671 1.261 0.20739
## seasonWinter 0.030254 0.009562 3.164 0.00158 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(log_late_aircraft_ratio) 7.964 8.730 531.084 < 2e-16 ***
## s(log_carrier_ratio) 7.300 8.340 465.504 < 2e-16 ***
## s(log_nas_ratio) 8.582 8.951 931.856 < 2e-16 ***
## s(log_weather_ratio) 5.761 6.984 39.321 < 2e-16 ***
## s(log_security_ratio) 4.850 5.995 5.732 6.45e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.943 Deviance explained = 94.4%
## GCV = 0.023599 Scale est. = 0.023233 n = 2483
This model shows the relationships between the variables and the delay ratio when only the outliers are considered. It demonstrates that factors like log_late_aircraft_ratio, log_carrier_ratio, and log_nas_ratio significantly affect the delay ratio. The model also indicates how different seasons influence delays, with winter showing a stronger effect compared to other seasons.
Model without Outliers
Next, I built a GAM model without outliers. To do this, I filtered out the outliers from the original dataset and trained a new model using the cleaned data. The following code removes the outliers and then fits the model:
# Filter to exclude outliers
data_no_outliers <- delay_normalized_log %>%
filter(!(rownames(delay_normalized_log) %in% rownames(data_outliers)))
# Check the size of the dataset
dim(data_no_outliers)
## [1] 14071 34
After filtering out the outliers, the dataset size becomes smaller, as expected. Here’s the model built on the cleaned data:
This model shows a slightly improved fit as indicated by a higher adjusted R-squared value (0.989 compared to 0.943 for the model with outliers). It also shows similar relationships between predictors and delay ratio, but the effect of season and certain variables may slightly differ when compared to the model with outliers.
library(mgcv)
# Rebuild the GAM model without outliers
mod_gam_clean <- gam(delay_ratio ~ s(log_late_aircraft_ratio) + s(log_carrier_ratio) +
s(log_nas_ratio) + s(log_weather_ratio) + s(log_security_ratio)
+ season,data = data_no_outliers)
# Summary of the new model
summary(mod_gam_clean)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## delay_ratio ~ s(log_late_aircraft_ratio) + s(log_carrier_ratio) +
## s(log_nas_ratio) + s(log_weather_ratio) + s(log_security_ratio) +
## season
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.5628531 0.0007111 -2197.773 < 2e-16 ***
## seasonSpring 0.0068734 0.0009179 7.489 7.38e-14 ***
## seasonSummer 0.0047597 0.0009473 5.024 5.11e-07 ***
## seasonWinter 0.0074353 0.0009302 7.993 1.42e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(log_late_aircraft_ratio) 8.820 8.991 27936.7 <2e-16 ***
## s(log_carrier_ratio) 8.713 8.975 19223.2 <2e-16 ***
## s(log_nas_ratio) 8.985 9.000 31444.8 <2e-16 ***
## s(log_weather_ratio) 8.634 8.963 1297.7 <2e-16 ***
## s(log_security_ratio) 7.605 8.534 347.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.989 Deviance explained = 98.9%
## GCV = 0.0012273 Scale est. = 0.0012232 n = 14071
By comparing the models with and without outliers, I can evaluate how outliers impact the predictive power of the model. The model without outliers shows a better fit, with a lower GCV (0.0012273), indicating a more generalizable model that avoids overfitting and better captures the relationships between predictors and the delay ratio. This suggests that extreme values in the dataset may have a distorting effect on the model, causing unnecessary complexity and affecting the smoothness of the relationship between variables.
However, the model with outliers, despite its higher GCV (0.023599), still provides valuable insights into the behavior of extreme delays. It helps in understanding how outliers influence flight delay predictions and their potential impact on the airline industry. In conclusion, while excluding outliers improves the model’s generalization, keeping outliers can be useful for examining the effects of extreme delays on the overall system.
par(mfrow = c(1, 2))
# Plot QQ plot for residuals of the first model before extracting 15% outliers (mod_gam)
qqPlot(residuals(mod_gam), main = "QQ Plot for mod_gam")
## [1] 6046 11891
# Plot QQ plot for residuals of the second model after extracting outliers (mod_gam_clean)
qqPlot(residuals(mod_gam_clean), main = "QQ Plot for mod_gam_clean")
## [1] 10030 12929
Left plot (with outliers): The QQ plot for the model with outliers shows more significant deviations from the blue line, particularly at the extremes. This indicates that the residuals deviate more from the normal distribution due to the presence of outliers. Outliers can increase the variance of the residuals, leading to a less normal distribution.
Right plot (without outliers): After excluding outliers, the QQ plot on the right demonstrates that the residuals more closely follow the normal distribution. The line fits much better, and the deviations at the extremes are smaller, indicating a better-fitting model with smoother residuals. This suggests that excluding outliers helps in achieving a better model fit and a more stable relationship between the predictors and the target variable.
Now, we aim to assess the performance of the Generalized Additive Model (GAM) by evaluating its predictive accuracy through cross-validation and using Root Mean Squared Error (RMSE) as the evaluation metric. By running this code, we will train the model on subsets of the data (training set) and evaluate its performance on separate unseen data (test set) to understand how well it generalizes. This will allow us to compare the model’s behavior when trained on the full dataset and when outliers are excluded, offering valuable insights into the effect of outliers on model performance.
To ensure the model’s performance is robust and not overly influenced by the specific data split, we will perform k-fold cross-validation. In this method, the dataset is divided into k equal-sized folds, and the model is trained on k-1 folds while being tested on the remaining fold. This process is repeated k times, each time using a different fold as the test set and the remaining data for training. The results from all folds are then averaged to provide a more reliable estimate of model performance. Cross-validation helps mitigate overfitting and provides a clearer picture of how the model might perform on new, unseen data.
The Root Mean Squared Error (RMSE) will be used to evaluate the model’s predictive accuracy. RMSE measures the average magnitude of the model’s prediction errors and gives a clear indication of how close the model’s predictions are to the actual values. The formula for RMSE is:
\[ RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} \]
Where:
Lower RMSE values signify better model performance, indicating that the model’s predictions are closer to the actual values.
The training process involves fitting the model to the data, learning the relationships between the input features and the target variable. The model is then tested on data it has never seen during training, allowing us to gauge how well it performs on unseen data. By comparing the results of the model trained on the entire dataset and the one trained on data without outliers, we can assess whether excluding outliers leads to improved model accuracy.
library(mgcv)
# Set up k-fold cross-validation
set.seed(123) # For reproducibility
k <- 5 # Number of folds
folds <- cut(seq(1, nrow(data_no_outliers)), breaks = k, labels = FALSE)
# Initialize a vector to store RMSE values for each fold
rmse_values <- numeric(k)
# Perform cross-validation
for (i in 1:k) {
# Split data into training and test sets
test_indices <- which(folds == i)
test_data <- data_no_outliers[test_indices, ]
train_data <- data_no_outliers[-test_indices, ]
# Fit the GAM model on the training set
mod_gam_cv <- gam(delay_ratio ~ s(log_late_aircraft_ratio) + s(log_carrier_ratio) +
s(log_nas_ratio) + s(log_weather_ratio) + s(log_security_ratio)
+ season, data = train_data)
# Predict on the test set
predictions <- predict(mod_gam_cv, newdata = test_data)
# Calculate the Root Mean Squared Error (RMSE) for this fold
rmse_values[i] <- sqrt(mean((test_data$delay_ratio - predictions)^2))
}
# Calculate the average RMSE across all folds
avg_rmse <- mean(rmse_values)
print(avg_rmse)
## [1] 0.03585187
# Residuals from the cross-validated model
residuals_cv <- test_data$delay_ratio - predictions
# Residuals vs Index
plot(residuals_cv, main = "Residuals vs Index", xlab = "Index", ylab = "Residuals", pch = 19, col = "blue")
The Residuals vs Index plot was used to evaluate the performance of the model by visualizing the differences between the observed and predicted values (residuals). In this plot, the residuals are randomly scattered around zero, indicating that there are no apparent patterns in the errors. This is a positive result, as it suggests that the model is well-fitted and has captured the underlying relationships in the data without missing any important patterns.
The random spread of residuals means that the model’s predictions are not consistently off in one direction, and there are no obvious outliers or trends that would suggest model misspecification. The absence of any clear pattern indicates that the model is appropriately capturing the variability in the data and is performing reliably.
Overall, the model’s ability to predict the test data is satisfactory, and the residuals’ behavior suggests that the model does not require any immediate adjustments. This confirms that the model is robust and provides reliable predictions for the given dataset.