Dataset Information

The dataset used for this analysis is titled “Flight Delay Data for U.S. Airports by Carrier August 2013 - August 2023” and is accessible through Kaggle.

This dataset provides detailed information on flight arrivals and delays for U.S. airports, categorized by carriers. The data includes metrics such as the number of arriving flights, delays over 15 minutes, cancellation and diversion counts, and the breakdown of delays attributed to carriers, weather, NAS (National Airspace System), security, and late aircraft arrivals. Explore and analyze the performance of different carriers at various airports during this period. Use this dataset to gain insights into the factors contributing to delays in the aviation industry.

Our mission is to uncover the most critical factors contributing to flight delays over 15 minutes across U.S. airports.

Load necessary libraries

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggthemes)
library(tidyr)
library(knitr)
library(summarytools)
library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

library(patchwork)
library(ggrepel)

Let’s start our journey into the analysis that aims to uncover the most critical factors contributing to flight delays over 15 minutes across U.S. airports.

Load the dataset

delay <- read.csv("/Users/farzane/Downloads/NewStatistical-2025/Airline_Delay_Cause 4.csv")

dim(delay) # Number of Rows and Columns of the dataset respectively

## [1] 171666     21

names(delay) # Displays the Column names of the dataset

##  [1] "year"                "month"               "carrier"            
##  [4] "carrier_name"        "airport"             "airport_name"       
##  [7] "arr_flights"         "arr_del15"           "carrier_ct"         
## [10] "weather_ct"          "nas_ct"              "security_ct"        
## [13] "late_aircraft_ct"    "arr_cancelled"       "arr_diverted"       
## [16] "arr_delay"           "carrier_delay"       "weather_delay"      
## [19] "nas_delay"           "security_delay"      "late_aircraft_delay"

head(delay, 10) # Displays the first 10 lines of the dataset

##    year month carrier      carrier_name airport
## 1  2023     8      9E Endeavor Air Inc.     ABE
## 2  2023     8      9E Endeavor Air Inc.     ABY
## 3  2023     8      9E Endeavor Air Inc.     AEX
## 4  2023     8      9E Endeavor Air Inc.     AGS
## 5  2023     8      9E Endeavor Air Inc.     ALB
## 6  2023     8      9E Endeavor Air Inc.     ATL
## 7  2023     8      9E Endeavor Air Inc.     AUS
## 8  2023     8      9E Endeavor Air Inc.     AVL
## 9  2023     8      9E Endeavor Air Inc.     AZO
## 10 2023     8      9E Endeavor Air Inc.     BDL
##                                                   airport_name arr_flights
## 1  Allentown/Bethlehem/Easton, PA: Lehigh Valley International          89
## 2                       Albany, GA: Southwest Georgia Regional          62
## 3                     Alexandria, LA: Alexandria International          62
## 4                  Augusta, GA: Augusta Regional at Bush Field          66
## 5                             Albany, NY: Albany International          92
## 6        Atlanta, GA: Hartsfield-Jackson Atlanta International        1636
## 7                 Austin, TX: Austin - Bergstrom International          75
## 8                            Asheville, NC: Asheville Regional          59
## 9          Kalamazoo, MI: Kalamazoo/Battle Creek International          62
## 10                         Hartford, CT: Bradley International          30
##    arr_del15 carrier_ct weather_ct nas_ct security_ct late_aircraft_ct
## 1         13       2.25       1.60   3.16           0             5.99
## 2         10       1.97       0.04   0.57           0             7.42
## 3         10       2.73       1.18   1.80           0             4.28
## 4         12       3.69       2.27   4.47           0             1.57
## 5         22       7.76       0.00   2.96           0            11.28
## 6        256      55.98      27.81  63.64           0           108.57
## 7         12       5.62       0.97   4.41           0             1.00
## 8          7       3.32       0.00   0.42           0             3.26
## 9         13       6.53       0.94   3.54           0             1.99
## 10         4       0.00       0.82   0.00           0             3.18
##    arr_cancelled arr_diverted arr_delay carrier_delay weather_delay nas_delay
## 1              2            1      1375            71           761       118
## 2              0            1       799           218             1        62
## 3              1            0       766            56           188        78
## 4              1            1      1397           471           320       388
## 5              2            0      1530           628             0       134
## 6             32           11     29768          9339          4557      4676
## 7              0            0       843           535           170       111
## 8              2            0       324           117             0        25
## 9              0            0       707           470            77        87
## 10             1            0      1421             0           532         0
##    security_delay late_aircraft_delay
## 1               0                 425
## 2               0                 518
## 3               0                 444
## 4               0                 218
## 5               0                 768
## 6               0               11196
## 7               0                  27
## 8               0                 182
## 9               0                  73
## 10              0                 889

Data Description

The dataset consists of 171666 observations and 21 variables.

year: The year of the data.
month: The month of the data.
carrier: Carrier code.
carrier_name: Carrier name.
airport: Airport code.
airport_name: Airport name.
arr_flights: Number of arriving flights.
arr_del15: Number of flights delayed by 15 minutes or more.
carrier_ct: Carrier count (delay due to the carrier).
weather_ct: Weather count (delay due to weather).
nas_ct: NAS (National Airspace System) count (delay due to the NAS).
security_ct: Security count (delay due to security).
late_aircraft_ct: Late aircraft count (delay due to late aircraft arrival).
arr_cancelled: Number of flights canceled.
arr_diverted: Number of flights diverted.
arr_delay: Total arrival delay.
carrier_delay: Delay attributed to the carrier.
weather_delay: Delay attributed to weather.
nas_delay: Delay attributed to the NAS.
security_delay: Delay attributed to security.
late_aircraft_delay: Delay attributed to late aircraft arrival.

Exploratory Data Analysis

summary(delay)

##       year          month          carrier          carrier_name      
##  Min.   :2013   Min.   : 1.000   Length:171666      Length:171666     
##  1st Qu.:2016   1st Qu.: 4.000   Class :character   Class :character  
##  Median :2019   Median : 7.000   Mode  :character   Mode  :character  
##  Mean   :2019   Mean   : 6.494                                        
##  3rd Qu.:2021   3rd Qu.: 9.000                                        
##  Max.   :2023   Max.   :12.000                                        
##                                                                       
##    airport          airport_name        arr_flights        arr_del15      
##  Length:171666      Length:171666      Min.   :    1.0   Min.   :   0.00  
##  Class :character   Class :character   1st Qu.:   50.0   1st Qu.:   6.00  
##  Mode  :character   Mode  :character   Median :  100.0   Median :  17.00  
##                                        Mean   :  362.5   Mean   :  66.43  
##                                        3rd Qu.:  250.0   3rd Qu.:  47.00  
##                                        Max.   :21977.0   Max.   :4176.00  
##                                        NA's   :240       NA's   :443      
##    carrier_ct        weather_ct         nas_ct         security_ct     
##  Min.   :   0.00   Min.   :  0.00   Min.   :   0.00   Min.   : 0.0000  
##  1st Qu.:   2.16   1st Qu.:  0.00   1st Qu.:   1.00   1st Qu.: 0.0000  
##  Median :   6.40   Median :  0.40   Median :   3.91   Median : 0.0000  
##  Mean   :  20.80   Mean   :  2.25   Mean   :  19.38   Mean   : 0.1571  
##  3rd Qu.:  17.26   3rd Qu.:  1.86   3rd Qu.:  11.71   3rd Qu.: 0.0000  
##  Max.   :1293.91   Max.   :266.42   Max.   :1884.42   Max.   :58.6900  
##  NA's   :240       NA's   :240      NA's   :240       NA's   :240      
##  late_aircraft_ct  arr_cancelled      arr_diverted        arr_delay     
##  Min.   :   0.00   Min.   :   0.00   Min.   :  0.0000   Min.   :     0  
##  1st Qu.:   1.23   1st Qu.:   0.00   1st Qu.:  0.0000   1st Qu.:   335  
##  Median :   5.00   Median :   1.00   Median :  0.0000   Median :  1018  
##  Mean   :  23.77   Mean   :   7.53   Mean   :  0.8634   Mean   :  4240  
##  3rd Qu.:  15.26   3rd Qu.:   4.00   3rd Qu.:  1.0000   3rd Qu.:  2884  
##  Max.   :2069.07   Max.   :4951.00   Max.   :197.0000   Max.   :438783  
##  NA's   :240       NA's   :240       NA's   :240        NA's   :240     
##  carrier_delay    weather_delay       nas_delay        security_delay    
##  Min.   :     0   Min.   :    0.0   Min.   :     0.0   Min.   :   0.000  
##  1st Qu.:   110   1st Qu.:    0.0   1st Qu.:    34.0   1st Qu.:   0.000  
##  Median :   375   Median :   18.0   Median :   146.0   Median :   0.000  
##  Mean   :  1437   Mean   :  222.6   Mean   :   920.6   Mean   :   7.383  
##  3rd Qu.:  1109   3rd Qu.:  146.0   3rd Qu.:   477.0   3rd Qu.:   0.000  
##  Max.   :196944   Max.   :31960.0   Max.   :112018.0   Max.   :3760.000  
##  NA's   :240      NA's   :240       NA's   :240        NA's   :240       
##  late_aircraft_delay
##  Min.   :     0     
##  1st Qu.:    65     
##  Median :   320     
##  Mean   :  1652     
##  3rd Qu.:  1070     
##  Max.   :227959     
##  NA's   :240

Summarize the flight categories

To better understand the overall performance of flights, we have summarized the key categories of flights in the dataset:

library(dplyr)

flight_summary <- delay %>%
  summarise(
    total_flights = sum(arr_flights, na.rm = TRUE),
    total_canceled = sum(arr_cancelled, na.rm = TRUE),
    total_diverted = sum(arr_diverted, na.rm = TRUE),
    total_delayed = sum(arr_del15, na.rm = TRUE),  # Delayed flights (15+ minutes)
    on_time_flights = total_flights - (total_canceled + total_diverted + total_delayed)  # On-time flights
  )

print(flight_summary)

##   total_flights total_canceled total_diverted total_delayed on_time_flights
## 1      62146805        1290923         148007      11375095        49332780

This summary gives an overview of the dataset, showing the scale of delays and other disruptions in the U.S. aviation system. Here are the results:

Total Flights: 62,146,805
Canceled Flights: 1,290,923
Diverted Flights: 148,007
Delayed Flights: 11,375,095
On-Time Flights: 49,332,780

Pie Chart of Flight Categories

# Summarize data to get totals for each category
flight_summary <- delay %>%
  summarise(
    total_flights = sum(arr_flights, na.rm = TRUE),
    total_canceled = sum(arr_cancelled, na.rm = TRUE),
    total_diverted = sum(arr_diverted, na.rm = TRUE),
    total_delayed = sum(arr_del15, na.rm = TRUE)
  ) %>%
  mutate(
    on_time_flights = total_flights - (total_canceled + total_diverted + total_delayed)
  ) %>%
  select(total_canceled, total_diverted, total_delayed, on_time_flights) %>%
  pivot_longer(everything(), names_to = "category", values_to = "count")

# Create a pie chart
ggplot(flight_summary, aes(x = "", y = count, fill = category)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar("y", start = 0) +  # Polar coordinates for pie chart
  scale_fill_manual(values = c("total_canceled" = "red",
                               "total_diverted" = "darkorange",
                               "total_delayed" = "darkblue",
                               "on_time_flights" = "darkgreen")) +
  labs(title = "Breakdown of Flight Categories",
       fill = "Flight Category") +
  theme_void() +  # Clean theme for pie charts
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10)
  ) +
  geom_text(aes(label = paste0(round(count / sum(count) * 100, 1), "%")),
            position = position_stack(vjust = 0.5), color = "white", size = 4)

This pie chart shows the breakdown of flight outcomes in the dataset, including on-time flights, delays over 15 minutes, cancellations, and diversions. It highlights how most flights are on time, while a smaller portion face delays, cancellations, and diversions.

Total number of arriving flights per year

# Aggregate arriving flights by year
flights_by_year <- delay %>%
  group_by(year) %>%
  summarise(total_arriving_flights = sum(arr_flights, na.rm = TRUE))

# Bar chart: Total number of arriving flights per year
ggplot(flights_by_year, aes(x = factor(year), y = total_arriving_flights)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +  # Added black border for clarity
  theme_minimal() +
  scale_y_continuous(labels = scales::comma) +  # Format y-axis with commas
  labs(title = "Total Number of Arriving Flights Per Year",
       x = "Year",
       y = "Number of Flights") +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.text.y = element_text(size = 12),
    axis.text.x = element_text(size = 12, angle = 45, hjust = 1)
  )

This bar chart shows the total number of arriving flights in the U.S. for each year from 2013 to 2023. It provides a clear view of trends in air traffic over the years.

Exploring Monthly Trends in Total Flights

monthly_summary <- delay %>%
  group_by(year, month) %>%
  summarise(total_flights = sum(arr_flights, na.rm = TRUE)) %>%
  mutate(date = as.Date(paste(year, month, "01", sep = "-")))  # Create Date column

## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

# Line chart: Total flights over time (monthly)
ggplot(monthly_summary, aes(x = date, y = total_flights)) +
  geom_line(color = "blue", size = 1) +
  geom_point(size = 2, color = "blue") +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Monthly Total Flights Over Time (Highlighting 2020 Decline)",
       x = "Date",
       y = "Total Flights") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
    axis.text.y = element_text(size = 10)
  )

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

This line chart illustrates the monthly trend in total arriving flights in the U.S., spanning from 2013 to 2023.A striking feature is the dramatic decline in flights during 2020, a direct result of the COVID-19 pandemic’s impact on global air travel. The chart provides a clear perspective on the recovery of air traffic in the subsequent years.

Annual Trends in Flight Cancellations

yearly_cancellations <- delay %>%
  group_by(year) %>%
  summarise(total_cancellations = sum(arr_cancelled, na.rm = TRUE))

ggplot(yearly_cancellations, aes(x = factor(year), y = total_cancellations)) +
  geom_bar(stat = "identity", fill = "red", color = "black") +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Total Cancellations Per Year",
       x = "Year", y = "Number of Cancellations") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
    axis.text.y = element_text(size = 10)
  )

This bar chart highlights the total number of flight cancellations per year, showcasing significant fluctuations over time. The significant rise in cancellations in 2020 reflects the impact of the COVID-19 pandemic, which disrupted air travel globally. This chart provides a clear visualization of how the aviation industry was impacted during this challenging period and its gradual recovery in the following years.

Total Delays Per Year

yearly_delays <- delay %>%
  group_by(year) %>%
  summarise(total_delays = sum(arr_delay, na.rm = TRUE))  # Sum of total arrival delays

# Bar plot: Total delays per year
ggplot(yearly_delays, aes(x = factor(year), y = total_delays)) +
  geom_bar(stat = "identity", fill = "orange", color = "black") +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Total Delays Per Year",
       x = "Year", y = "Total Delay (in minutes)") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
    axis.text.y = element_text(size = 10)
  )

This chart illustrates the total minutes of delays per year across U.S. flights. A noticeable drop in total delays is evident in 2020, reflecting the reduced number of flights during the pandemic. With fewer flights in operation that year, overall delays naturally decreased. As flight numbers rebounded in subsequent years, delays also increased, highlighting the correlation between flight volume and total delay time.

Monthly Distribution of Arriving Flights

# Aggregate arriving flights by month
flights_by_month <- delay %>%
  group_by(month) %>%
  summarise(total_arriving_flights = sum(arr_flights, na.rm = TRUE))

# Define month names
month_names <- c("January", "February", "March", "April", "May", "June",
                 "July", "August", "September", "October", "November", "December")

flights_by_month$month <- factor(month_names, levels = month_names)

# Bar chart: Number of flights per month
ggplot(flights_by_month, aes(x = month, y = total_arriving_flights)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme_minimal() +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Number of Flights by Month",
       x = "Month",
       y = "Number of Flights") +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10)
  )

This bar chart provides an overview of the total number of arriving flights for each month. It highlights the seasonality in flight traffic, with peaks observed during summer months like July and August, likely driven by vacation travel. Conversely, slightly lower activity may be seen in months such as February or October, reflecting off-peak travel periods. This seasonal variation offers insights into travel demand patterns across the year.

Season

delay <- delay %>%
  mutate(
    season = case_when(
      month %in% c(12, 1, 2) ~ "Winter",
      month %in% c(3, 4, 5) ~ "Spring",
      month %in% c(6, 7, 8) ~ "Summer",
      month %in% c(9, 10, 11) ~ "Autumn"
    )
  )

“We’ve categorized the data into four distinct seasons Winter, Spring, Summer, and Autumnbased on the months. This seasonal grouping helps to analyze and visualize trends more effectively, capturing potential seasonal impacts on flight operations.

Below, the total number of flights for each season is visualized, providing insights into how flight traffic varies throughout the year.”

flights_by_season <- delay %>%
  group_by(season) %>%
  summarise(total_flights = sum(arr_flights, na.rm = TRUE))

#Visualize Flights by Season
ggplot(flights_by_season, aes(x = season, y = total_flights, fill = season)) +
  geom_bar(stat = "identity") +
  labs(title = "Total Flights Per Season", x = "Season", y = "Number of Flights") +
  theme_minimal() +
  scale_fill_brewer(palette = "Pastel1")

The bar chart displays the total number of flights for each season, making it easy to compare the volume of flights across different seasons.

Number of Flights by Each Airline

# Aggregate arriving flights by airline
flights_by_airline <- delay %>%
  group_by(carrier_name) %>%
  summarise(total_arriving_flights = sum(arr_flights, na.rm = TRUE)) %>%
  arrange(desc(total_arriving_flights))

# Bar chart: Number of flights by each airline
ggplot(flights_by_airline, aes(x = reorder(carrier_name, -total_arriving_flights), y = total_arriving_flights)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme_minimal() +
  scale_y_continuous(labels = scales::comma) +  # Format y-axis with commas
  labs(title = "Number of Flights by Each Airline",
       x = "Airline",
       y = "Number of Flights") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
        axis.text.y = element_text(size = 10))

This chart showcases the number of flights handled by each airline over the observed period. Southwest Airlines leads as the airline with the highest number of flights, reflecting its large-scale operations and extensive network.

Subsetting Out Year 2020:

To remove 2020 data and ensure no anomalies interfere with the analysis:

# Subset data excluding the year 2020
delay_no2020 <- delay %>% 
  filter(year != 2020)

Why Remove 2020 Data?

The year 2020 brought unprecedented disruptions to the aviation industry due to the COVID-19 pandemic. With strict travel restrictions, reduced demand, and widespread cancellations, flight operations during this period deviated significantly from typical patterns observed in other years. These anomalies could skew the analysis and affect the accuracy of insights derived from the data.

To ensure a more representative and consistent analysis, we have decided to exclude the data from 2020 in subsequent visualizations and calculations.

Updated Metrics Excluding 2020

summary_filtered <- delay_no2020 %>%
  summarise(
    total_flights = sum(arr_flights, na.rm = TRUE),
    total_canceled = sum(arr_cancelled, na.rm = TRUE),
    total_diverted = sum(arr_diverted, na.rm = TRUE),
    total_delayed = sum(arr_del15, na.rm = TRUE),
    on_time_flights = total_flights - (total_canceled + total_diverted + total_delayed)
  )

print(summary_filtered)

##   total_flights total_canceled total_diverted total_delayed on_time_flights
## 1      57458451        1009889         140263      10943174        45365125

After excluding the anomalous year 2020 from the dataset, the key metrics were recalculated to provide a more consistent view of flight operations:

Total Flights: 57,458,451
Canceled Flights: 1,009,889
Diverted Flights: 140,263
Delayed Flights (15+ minutes): 10,943,174
On-Time Flights: 45,365,125

Comparison of Flight Metrics Before and After Excluding 2020

library(knitr)

summary_combined <- data.frame(
  Metric = c("Total Flights", "Total Canceled", "Total Diverted", 
             "Total Delayed (15+ min)", "On-Time Flights"),
  With_2020 = c(62146805, 1290923, 148007, 11375095, 49332780),
  Without_2020 = c(57458451, 1009889, 140263, 10943174, 45365125)
)

kable(summary_combined, 
      caption = "Comparison of Flight Metrics Before and After Excluding 2020", 
      format = "simple")

Comparison of Flight Metrics Before and After Excluding 2020
Metric	With_2020	Without_2020
Total Flights	62146805	57458451
Total Canceled	1290923	1009889
Total Diverted	148007	140263
Total Delayed (15+ min)	11375095	10943174
On-Time Flights	49332780	45365125

Total Delays by Month (after excluding pandamic)

# Group by month and calculate total delays
monthly_delays <- delay_no2020 %>%
  group_by(month) %>%
  summarise(total_delays = sum(arr_del15, na.rm = TRUE))

# Bar chart to visualize total delays by month
ggplot(monthly_delays, aes(x = factor(month), y = total_delays)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(
    title = "Total Delays Over 15 Minutes by Month",
    x = "Month",
    y = "Total Delays (15+ minutes)"
  ) +
  theme_minimal()

The pattern is logical: months with higher flight volumes often experience more delays. This is expected, as increased air traffic places more strain on resources, leading to a higher likelihood of delays. For example, summer months or holiday seasons, which tend to see more flights, also report higher delays, aligning with the earlier observation about the busiest seasons.

Preprocessing

missing_values <- colSums(is.na(delay_no2020))

print(missing_values)

##                year               month             carrier        carrier_name 
##                   0                   0                   0                   0 
##             airport        airport_name         arr_flights           arr_del15 
##                   0                   0                 150                 183 
##          carrier_ct          weather_ct              nas_ct         security_ct 
##                 150                 150                 150                 150 
##    late_aircraft_ct       arr_cancelled        arr_diverted           arr_delay 
##                 150                 150                 150                 150 
##       carrier_delay       weather_delay           nas_delay      security_delay 
##                 150                 150                 150                 150 
## late_aircraft_delay              season 
##                 150                   0

Before analyzing the data, we checked for missing values to ensure data quality. Some columns, such as arr_flights, arr_del15, carrier_ct, and others, have 150 missing values. Delays-related columns like arr_delay and carrier_delay also have 150 missing values.

Preprocessing: Missing Values Percentage

missing_percentage <- (colSums(is.na(delay_no2020)) / nrow(delay_no2020)) * 100

# Display the missing percentage for each column
print(missing_percentage)

##                year               month             carrier        carrier_name 
##          0.00000000          0.00000000          0.00000000          0.00000000 
##             airport        airport_name         arr_flights           arr_del15 
##          0.00000000          0.00000000          0.09816304          0.11975891 
##          carrier_ct          weather_ct              nas_ct         security_ct 
##          0.09816304          0.09816304          0.09816304          0.09816304 
##    late_aircraft_ct       arr_cancelled        arr_diverted           arr_delay 
##          0.09816304          0.09816304          0.09816304          0.09816304 
##       carrier_delay       weather_delay           nas_delay      security_delay 
##          0.09816304          0.09816304          0.09816304          0.09816304 
## late_aircraft_delay              season 
##          0.09816304          0.00000000

# Sort and display columns with missing values in descending order
missing_percentage_sorted <- sort(missing_percentage[missing_percentage > 0], decreasing = TRUE)
print(missing_percentage_sorted)

##           arr_del15         arr_flights          carrier_ct          weather_ct 
##          0.11975891          0.09816304          0.09816304          0.09816304 
##              nas_ct         security_ct    late_aircraft_ct       arr_cancelled 
##          0.09816304          0.09816304          0.09816304          0.09816304 
##        arr_diverted           arr_delay       carrier_delay       weather_delay 
##          0.09816304          0.09816304          0.09816304          0.09816304 
##           nas_delay      security_delay late_aircraft_delay 
##          0.09816304          0.09816304          0.09816304

We calculated the percentage of missing values for each column compared to the total data. Columns like arr_del15 have the highest missing percentage at 0.12%, while others, such as arr_flights and carrier_ct, have around 0.10% missing values. Most columns have minimal missing data, which will be addressed during preprocessing.

sum(is.na(delay_no2020)) # Total number of missing values in the dataset

## [1] 2283

Impute

For handling missing values, we replaced the missing entries in numeric columns with their respective median values. This ensures that the dataset remains complete while minimizing the impact on the data’s overall distribution.

# Identify numeric columns and get their names
numeric_columns <- names(delay_no2020)[sapply(delay_no2020, is.numeric)]

# Replace NA values in numeric columns with the median
delay_no2020[numeric_columns] <- lapply(delay_no2020[numeric_columns], function(x) {
  replace(x, is.na(x), median(x, na.rm = TRUE))
})

colSums(is.na(delay_no2020))

##                year               month             carrier        carrier_name 
##                   0                   0                   0                   0 
##             airport        airport_name         arr_flights           arr_del15 
##                   0                   0                   0                   0 
##          carrier_ct          weather_ct              nas_ct         security_ct 
##                   0                   0                   0                   0 
##    late_aircraft_ct       arr_cancelled        arr_diverted           arr_delay 
##                   0                   0                   0                   0 
##       carrier_delay       weather_delay           nas_delay      security_delay 
##                   0                   0                   0                   0 
## late_aircraft_delay              season 
##                   0                   0

sum(is.na(delay_no2020))

## [1] 0

Compute Log Ratio of Delays

To normalize the delays for varying flight volumes, we calculate the ratio of flights delayed by 15+ minutes to the total number of arriving flights. Since this ratio can vary widely, we apply the natural logarithm to stabilize the variance.

Before calculating the ratio:

We ensure there are no zero values in arr_flights and arr_del15 to avoid division errors.
We compute:

\[ \text{delay_ratio} = \log\left(\frac{\text{arr_del15}}{\text{arr_flights}}\right) \]

delay_no2020_ratio <- delay_no2020 %>%
  filter(arr_flights > 0, arr_del15 > 0) %>%
  mutate(delay_ratio = log(arr_del15 / arr_flights))

Verify the Results:

After computing the ratio, inspect the first few rows and the summary statistics to ensure correctness:

# View the first few rows
head(delay_no2020_ratio)

##   year month carrier      carrier_name airport
## 1 2023     8      9E Endeavor Air Inc.     ABE
## 2 2023     8      9E Endeavor Air Inc.     ABY
## 3 2023     8      9E Endeavor Air Inc.     AEX
## 4 2023     8      9E Endeavor Air Inc.     AGS
## 5 2023     8      9E Endeavor Air Inc.     ALB
## 6 2023     8      9E Endeavor Air Inc.     ATL
##                                                  airport_name arr_flights
## 1 Allentown/Bethlehem/Easton, PA: Lehigh Valley International          89
## 2                      Albany, GA: Southwest Georgia Regional          62
## 3                    Alexandria, LA: Alexandria International          62
## 4                 Augusta, GA: Augusta Regional at Bush Field          66
## 5                            Albany, NY: Albany International          92
## 6       Atlanta, GA: Hartsfield-Jackson Atlanta International        1636
##   arr_del15 carrier_ct weather_ct nas_ct security_ct late_aircraft_ct
## 1        13       2.25       1.60   3.16           0             5.99
## 2        10       1.97       0.04   0.57           0             7.42
## 3        10       2.73       1.18   1.80           0             4.28
## 4        12       3.69       2.27   4.47           0             1.57
## 5        22       7.76       0.00   2.96           0            11.28
## 6       256      55.98      27.81  63.64           0           108.57
##   arr_cancelled arr_diverted arr_delay carrier_delay weather_delay nas_delay
## 1             2            1      1375            71           761       118
## 2             0            1       799           218             1        62
## 3             1            0       766            56           188        78
## 4             1            1      1397           471           320       388
## 5             2            0      1530           628             0       134
## 6            32           11     29768          9339          4557      4676
##   security_delay late_aircraft_delay season delay_ratio
## 1              0                 425 Summer   -1.923687
## 2              0                 518 Summer   -1.824549
## 3              0                 444 Summer   -1.824549
## 4              0                 218 Summer   -1.704748
## 5              0                 768 Summer   -1.430746
## 6              0               11196 Summer   -1.854832

# Check the summary statistics of delay_ratio
summary(delay_no2020_ratio)

##       year          month          carrier          carrier_name      
##  Min.   :2013   Min.   : 1.000   Length:148384      Length:148384     
##  1st Qu.:2016   1st Qu.: 4.000   Class :character   Class :character  
##  Median :2018   Median : 7.000   Mode  :character   Mode  :character  
##  Mean   :2018   Mean   : 6.521                                        
##  3rd Qu.:2021   3rd Qu.: 9.000                                        
##  Max.   :2023   Max.   :12.000                                        
##    airport          airport_name        arr_flights        arr_del15      
##  Length:148384      Length:148384      Min.   :    1.0   Min.   :   1.00  
##  Class :character   Class :character   1st Qu.:   56.0   1st Qu.:   8.00  
##  Mode  :character   Mode  :character   Median :  113.0   Median :  20.00  
##                                        Mean   :  387.1   Mean   :  73.77  
##                                        3rd Qu.:  270.0   3rd Qu.:  53.00  
##                                        Max.   :21977.0   Max.   :4176.00  
##    carrier_ct        weather_ct          nas_ct         security_ct     
##  Min.   :   0.00   Min.   :  0.000   Min.   :   0.00   Min.   : 0.0000  
##  1st Qu.:   2.96   1st Qu.:  0.000   1st Qu.:   1.43   1st Qu.: 0.0000  
##  Median :   7.46   Median :  0.610   Median :   4.57   Median : 0.0000  
##  Mean   :  22.98   Mean   :  2.468   Mean   :  21.42   Mean   : 0.1711  
##  3rd Qu.:  19.46   3rd Qu.:  2.000   3rd Qu.:  13.23   3rd Qu.: 0.0000  
##  Max.   :1293.91   Max.   :266.420   Max.   :1884.42   Max.   :58.6900  
##  late_aircraft_ct  arr_cancelled       arr_diverted        arr_delay     
##  Min.   :   0.00   Min.   :   0.000   Min.   :  0.0000   Min.   :     0  
##  1st Qu.:   2.00   1st Qu.:   0.000   1st Qu.:  0.0000   1st Qu.:   453  
##  Median :   6.14   Median :   1.000   Median :  0.0000   Median :  1210  
##  Mean   :  26.73   Mean   :   6.802   Mean   :  0.9445   Mean   :  4718  
##  3rd Qu.:  17.70   3rd Qu.:   4.000   3rd Qu.:  1.0000   3rd Qu.:  3277  
##  Max.   :2069.07   Max.   :1565.000   Max.   :197.0000   Max.   :438783  
##  carrier_delay    weather_delay       nas_delay      security_delay    
##  Min.   :     0   Min.   :    0.0   Min.   :     0   Min.   :   0.000  
##  1st Qu.:   150   1st Qu.:    0.0   1st Qu.:    51   1st Qu.:   0.000  
##  Median :   445   Median :   27.0   Median :   175   Median :   0.000  
##  Mean   :  1585   Mean   :  244.4   Mean   :  1024   Mean   :   8.104  
##  3rd Qu.:  1249   3rd Qu.:  169.0   3rd Qu.:   546   3rd Qu.:   0.000  
##  Max.   :196944   Max.   :31960.0   Max.   :112018   Max.   :3760.000  
##  late_aircraft_delay    season           delay_ratio    
##  Min.   :     0      Length:148384      Min.   :-4.771  
##  1st Qu.:   110      Class :character   1st Qu.:-2.052  
##  Median :   401      Mode  :character   Median :-1.689  
##  Mean   :  1856                         Mean   :-1.748  
##  3rd Qu.:  1245                         3rd Qu.:-1.382  
##  Max.   :227959                         Max.   : 2.944

Summary of Delay Ratio

The summary of the delay_ratio shows:

The lowest value is -4.771, meaning very few delays compared to total flights in some cases.
The average delay ratio is around -1.748, showing delays are generally low relative to flights.
Half of the data falls below -1.689 (median), and most values are below -1.382 (3rd quartile).
The highest delay ratio is 2.944, where delays are much higher compared to total flights.
This suggests that delays are usually low, but there are some cases with significantly higher delays.

Filtering for Extreme Cases

To identify and investigate extreme cases where the delay_ratio exceeds 1 (indicating that the number of delays exceeds total arriving flights):

delay_no2020_ratio %>%
    filter(delay_ratio > 1)  # Example: Identify extreme cases

##    year month carrier             carrier_name airport
## 1  2022    10      OO    SkyWest Airlines Inc.     PHL
## 2  2022    10      YV       Mesa Airlines Inc.     JAN
## 3  2022     4      YV       Mesa Airlines Inc.     MLB
## 4  2022     4      YV       Mesa Airlines Inc.     SHV
## 5  2022     1      YV       Mesa Airlines Inc.     BOS
## 6  2022     1      YV       Mesa Airlines Inc.     BTV
## 7  2022     1      YV       Mesa Airlines Inc.     BUF
## 8  2022     1      YX         Republic Airline     GRB
## 9  2021    10      OO    SkyWest Airlines Inc.     CYS
## 10 2021     9      OO    SkyWest Airlines Inc.     TYR
## 11 2021     2      AA   American Airlines Inc.     LIH
## 12 2019     3      EV ExpressJet Airlines Inc.     FAR
## 13 2019     3      EV ExpressJet Airlines Inc.     PHL
## 14 2019     2      DL     Delta Air Lines Inc.     CID
## 15 2019     1      OO    SkyWest Airlines Inc.     GPT
## 16 2019     1      EV ExpressJet Airlines Inc.     SAF
## 17 2018     8      EV ExpressJet Airlines Inc.     PIA
## 18 2018     7      UA    United Air Lines Inc.     SBA
## 19 2018     4      EV ExpressJet Airlines Inc.     RSW
## 20 2017    10      EV ExpressJet Airlines Inc.     JFK
## 21 2017     9      EV ExpressJet Airlines Inc.     AUS
## 22 2017     9      EV ExpressJet Airlines Inc.     MSY
## 23 2017     9      UA    United Air Lines Inc.     EGE
## 24 2017     5      VX           Virgin America     SLC
## 25 2016    12      OO    SkyWest Airlines Inc.     PNS
## 26 2016     9      AA   American Airlines Inc.     BTV
## 27 2016     9      EV ExpressJet Airlines Inc.     RSW
## 28 2015    11      MQ                Envoy Air     BPT
## 29 2015     3      MQ                Envoy Air     GUC
## 30 2015     1      OO    SkyWest Airlines Inc.     EVV
## 31 2014     3      OO    SkyWest Airlines Inc.     GFK
## 32 2013     9      EV ExpressJet Airlines Inc.     TPA
##                                                       airport_name arr_flights
## 1                     Philadelphia, PA: Philadelphia International           1
## 2  Jackson/Vicksburg, MS: Jackson Medgar Wiley Evers International           1
## 3                           Melbourne, FL: Melbourne International           1
## 4                              Shreveport, LA: Shreveport Regional           1
## 5                                  Boston, MA: Logan International           2
## 6                         Burlington, VT: Burlington International           1
## 7                       Buffalo, NY: Buffalo Niagara International           1
## 8           Green Bay, WI: Green Bay Austin Straubel International           2
## 9                Cheyenne, WY: Cheyenne Regional/Jerry Olson Field           2
## 10                                Tyler, TX: Tyler Pounds Regional           1
## 11                                        Lihue, HI: Lihue Airport           2
## 12                                 Fargo, ND: Hector International           1
## 13                    Philadelphia, PA: Philadelphia International           1
## 14                    Cedar Rapids/Iowa City, IA: The Eastern Iowa           1
## 15              Gulfport/Biloxi, MS: Gulfport-Biloxi International           1
## 16                                Santa Fe, NM: Santa Fe Municipal           4
## 17              Peoria, IL: General Downing - Peoria International           1
## 18                      Santa Barbara, CA: Santa Barbara Municipal           1
## 19                 Fort Myers, FL: Southwest Florida International           1
## 20                     New York, NY: John F. Kennedy International           1
## 21                    Austin, TX: Austin - Bergstrom International           1
## 22      New Orleans, LA: Louis Armstrong New Orleans International           2
## 23                                Eagle, CO: Eagle County Regional           1
## 24                Salt Lake City, UT: Salt Lake City International           1
## 25                          Pensacola, FL: Pensacola International           1
## 26                        Burlington, VT: Burlington International           1
## 27                 Fort Myers, FL: Southwest Florida International           1
## 28                  Beaumont/Port Arthur, TX: Jack Brooks Regional           1
## 29                   Gunnison, CO: Gunnison-Crested Butte Regional           1
## 30                             Evansville, IN: Evansville Regional           1
## 31                      Grand Forks, ND: Grand Forks International           1
## 32                                  Tampa, FL: Tampa International           1
##    arr_del15 carrier_ct weather_ct nas_ct security_ct late_aircraft_ct
## 1         19          0          0      0           0                0
## 2         19          0          0      0           0                0
## 3         19          0          0      0           0                0
## 4         19          0          0      0           0                0
## 5         19          0          0      0           0                0
## 6         19          0          0      0           0                0
## 7         19          0          0      0           0                0
## 8         19          0          0      0           0                0
## 9         19          0          0      0           0                0
## 10        19          0          0      0           0                0
## 11        19          0          0      0           0                0
## 12        19          0          0      0           0                0
## 13        19          0          0      0           0                0
## 14        19          0          0      0           0                0
## 15        19          0          0      0           0                0
## 16        19          0          0      0           0                0
## 17        19          0          0      0           0                0
## 18        19          0          0      0           0                0
## 19        19          0          0      0           0                0
## 20        19          0          0      0           0                0
## 21        19          0          0      0           0                0
## 22        19          0          0      0           0                0
## 23        19          0          0      0           0                0
## 24        19          0          0      0           0                0
## 25        19          0          0      0           0                0
## 26        19          0          0      0           0                0
## 27        19          0          0      0           0                0
## 28        19          0          0      0           0                0
## 29        19          0          0      0           0                0
## 30        19          0          0      0           0                0
## 31        19          0          0      0           0                0
## 32        19          0          0      0           0                0
##    arr_cancelled arr_diverted arr_delay carrier_delay weather_delay nas_delay
## 1              1            0         0             0             0         0
## 2              1            0         0             0             0         0
## 3              1            0         0             0             0         0
## 4              1            0         0             0             0         0
## 5              2            0         0             0             0         0
## 6              1            0         0             0             0         0
## 7              1            0         0             0             0         0
## 8              2            0         0             0             0         0
## 9              2            0         0             0             0         0
## 10             1            0         0             0             0         0
## 11             2            0         0             0             0         0
## 12             1            0         0             0             0         0
## 13             1            0         0             0             0         0
## 14             0            1         0             0             0         0
## 15             1            0         0             0             0         0
## 16             4            0         0             0             0         0
## 17             1            0         0             0             0         0
## 18             0            1         0             0             0         0
## 19             0            1         0             0             0         0
## 20             0            1         0             0             0         0
## 21             1            0         0             0             0         0
## 22             2            0         0             0             0         0
## 23             1            0         0             0             0         0
## 24             0            1         0             0             0         0
## 25             1            0         0             0             0         0
## 26             1            0         0             0             0         0
## 27             0            1         0             0             0         0
## 28             0            1         0             0             0         0
## 29             1            0         0             0             0         0
## 30             1            0         0             0             0         0
## 31             1            0         0             0             0         0
## 32             0            1         0             0             0         0
##    security_delay late_aircraft_delay season delay_ratio
## 1               0                   0 Autumn    2.944439
## 2               0                   0 Autumn    2.944439
## 3               0                   0 Spring    2.944439
## 4               0                   0 Spring    2.944439
## 5               0                   0 Winter    2.251292
## 6               0                   0 Winter    2.944439
## 7               0                   0 Winter    2.944439
## 8               0                   0 Winter    2.251292
## 9               0                   0 Autumn    2.251292
## 10              0                   0 Autumn    2.944439
## 11              0                   0 Winter    2.251292
## 12              0                   0 Spring    2.944439
## 13              0                   0 Spring    2.944439
## 14              0                   0 Winter    2.944439
## 15              0                   0 Winter    2.944439
## 16              0                   0 Winter    1.558145
## 17              0                   0 Summer    2.944439
## 18              0                   0 Summer    2.944439
## 19              0                   0 Spring    2.944439
## 20              0                   0 Autumn    2.944439
## 21              0                   0 Autumn    2.944439
## 22              0                   0 Autumn    2.251292
## 23              0                   0 Autumn    2.944439
## 24              0                   0 Spring    2.944439
## 25              0                   0 Winter    2.944439
## 26              0                   0 Autumn    2.944439
## 27              0                   0 Autumn    2.944439
## 28              0                   0 Autumn    2.944439
## 29              0                   0 Spring    2.944439
## 30              0                   0 Winter    2.944439
## 31              0                   0 Spring    2.944439
## 32              0                   0 Autumn    2.944439

Handling Extreme Cases

We identified rows where delay_ratio > 1, which is logically inconsistent as the number of flights delayed by 15+ minutes cannot exceed the total number of arriving flights. To address this, we filtered out these rows to ensure data consistency and reliability in the analysis.

# Filter out rows with delay_ratio > 1
delay_no2020_ratio <- delay_no2020_ratio %>%
    filter(delay_ratio <= 1)

This step removes anomalies and ensures the dataset reflects realistic flight delay scenarios.

summary(delay_no2020_ratio$delay_ratio)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -4.7707 -2.0523 -1.6895 -1.7491 -1.3825  0.9985

ggplot(delay_no2020_ratio, aes(x = arr_flights, y = arr_del15)) +
  geom_point(alpha = 0.5, color = "steelblue") +
  facet_wrap(~season) +
  labs(
    title = "Seasonal Variation in Delays Over 15 Minutes",
    x = "Number of Flights (arr_flights)",
    y = "Delays > 15 Minutes (arr_del15)"
  ) +
  theme_minimal()

The plot shows the connection between the number of flights and delays over 15 minutes, separated by season (Autumn, Spring, Summer, Winter). Here’s what it tells us:

1- More Flights = More Delays:

In every season, when the number of flights increases, the number of delays also goes up. This is expected.

2- Seasons Look Similar:

The trend between flights and delays is very similar across all seasons, with no major differences.

3- Outliers:

Some points show unusually high delays for very few flights or very high flights with fewer delays.

Normalizing Delay Predictors by Flight Volume

To better understand the contribution of different factors to delays, we normalize each delay type by the total number of flights. This ensures that the comparison is fair across different flight volumes. Additionally, the delay ratio (log-transformed) is used as the target variable for further analysis.

delay_normalized <- delay_no2020_ratio %>%
  filter(arr_flights > 0, carrier_ct > 0, weather_ct > 0, nas_ct > 0, 
         security_ct > 0, late_aircraft_ct > 0) %>%  # Ensure no zeros in denominators
  mutate(
    delay_ratio = log(arr_del15 / arr_flights),   # Target variable (already log-transformed)
    carrier_ratio = carrier_ct / arr_flights,     # Normalize carrier delays
    weather_ratio = weather_ct / arr_flights,     # Normalize weather delays
    nas_ratio = nas_ct / arr_flights,             # Normalize NAS delays
    security_ratio = security_ct / arr_flights,   # Normalize security delays
    late_aircraft_ratio = late_aircraft_ct / arr_flights  # Normalize late aircraft delays
  )

Check Skewness

To visually identify skewness for the new predictors (carrier_ratio, weather_ratio, etc.):

library(ggplot2)

# Convert data to long format for plotting
normalized_long <- delay_normalized %>%
  select(delay_ratio, carrier_ratio, weather_ratio, nas_ratio, security_ratio, late_aircraft_ratio) %>%
  pivot_longer(everything(), names_to = "RatioType", values_to = "Value")

# Plot histograms
ggplot(normalized_long, aes(x = Value)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white", alpha = 0.7) +
  facet_wrap(~RatioType, scales = "free") +
  labs(
    title = "Distribution of Normalized Delay Ratios",
    x = "Value",
    y = "Count"
  ) +
  theme_minimal()

This visualization displays the distributions of normalized delay ratios for various delay factors, including carrier, weather, NAS, security, and late aircraft delays. Each histogram represents the contribution of a specific factor to delays relative to the total number of flights.

Key Observations:

Skewness:

The distributions of all delay ratios exhibit positive skewness (right-skewed), as most values are clustered near zero, with a few higher values extending to the right. This indicates that for most flights, delays caused by these factors are relatively small, but occasional outliers result in significant delays. Notable Factors:
Carrier, late aircraft, and NAS delays show a wider spread, indicating greater variability in their contribution to delays. Security and weather delays have much smaller normalized ratios and tighter distributions, suggesting they contribute minimally and consistently to delays. Implications:
The skewed distributions highlight that while delays are generally low, some extreme cases significantly impact overall performance. This analysis helps pinpoint which delay types have more variability and need targeted intervention to minimize their effects.

Log Transformation of Delay Ratios

Applying log transformation to the normalized delay ratios reduces skewness and stabilizes variance, making the data more suitable for further analysis.

delay_normalized_log <- delay_normalized %>%
  mutate(
    log_carrier_ratio = log(carrier_ratio),  
    log_weather_ratio = log(weather_ratio),
    log_nas_ratio = log(nas_ratio),
    log_security_ratio = log(security_ratio),
    log_late_aircraft_ratio = log(late_aircraft_ratio)
  )
summary(delay_normalized_log)

##       year          month          carrier          carrier_name      
##  Min.   :2013   Min.   : 1.000   Length:16554       Length:16554      
##  1st Qu.:2016   1st Qu.: 4.000   Class :character   Class :character  
##  Median :2019   Median : 7.000   Mode  :character   Mode  :character  
##  Mean   :2019   Mean   : 6.509                                        
##  3rd Qu.:2021   3rd Qu.: 9.000                                        
##  Max.   :2023   Max.   :12.000                                        
##    airport          airport_name        arr_flights      arr_del15     
##  Length:16554       Length:16554       Min.   :    7   Min.   :   3.0  
##  Class :character   Class :character   1st Qu.:  291   1st Qu.:  64.0  
##  Mode  :character   Mode  :character   Median :  731   Median : 157.0  
##                                        Mean   : 1636   Mean   : 319.7  
##                                        3rd Qu.: 2098   3rd Qu.: 412.0  
##                                        Max.   :21977   Max.   :4176.0  
##    carrier_ct        weather_ct          nas_ct         security_ct    
##  Min.   :   0.03   Min.   :  0.010   Min.   :   0.02   Min.   : 0.010  
##  1st Qu.:  20.79   1st Qu.:  1.520   1st Qu.:  13.49   1st Qu.: 0.460  
##  Median :  49.90   Median :  3.950   Median :  38.66   Median : 1.000  
##  Mean   :  93.08   Mean   :  9.985   Mean   :  92.14   Mean   : 1.362  
##  3rd Qu.: 120.66   3rd Qu.: 10.280   3rd Qu.: 110.20   3rd Qu.: 1.600  
##  Max.   :1293.91   Max.   :266.420   Max.   :1884.42   Max.   :41.970  
##  late_aircraft_ct  arr_cancelled      arr_diverted       arr_delay     
##  Min.   :   0.03   Min.   :   0.00   Min.   :  0.000   Min.   :    85  
##  1st Qu.:  18.39   1st Qu.:   2.00   1st Qu.:  0.000   1st Qu.:  3783  
##  Median :  51.53   Median :   8.00   Median :  1.000   Median :  9503  
##  Mean   : 123.10   Mean   :  30.02   Mean   :  4.003   Mean   : 20485  
##  3rd Qu.: 150.92   3rd Qu.:  27.00   3rd Qu.:  4.000   3rd Qu.: 25255  
##  Max.   :2069.07   Max.   :1565.00   Max.   :197.000   Max.   :438783  
##  carrier_delay    weather_delay       nas_delay      security_delay   
##  Min.   :     1   Min.   :    1.0   Min.   :     1   Min.   :   1.00  
##  1st Qu.:  1260   1st Qu.:  104.0   1st Qu.:   537   1st Qu.:  15.00  
##  Median :  3139   Median :  324.0   Median :  1616   Median :  32.00  
##  Mean   :  6645   Mean   :  980.4   Mean   :  4385   Mean   :  64.91  
##  3rd Qu.:  8090   3rd Qu.:  925.0   3rd Qu.:  4942   3rd Qu.:  74.00  
##  Max.   :196944   Max.   :26446.0   Max.   :112018   Max.   :3760.00  
##  late_aircraft_delay    season           delay_ratio      carrier_ratio      
##  Min.   :     1      Length:16554       Min.   :-3.6068   Min.   :0.0004839  
##  1st Qu.:  1270      Class :character   1st Qu.:-1.8157   1st Qu.:0.0456990  
##  Median :  3605      Mode  :character   Median :-1.5413   Median :0.0647796  
##  Mean   :  8409                         Mean   :-1.5655   Mean   :0.0715220  
##  3rd Qu.: 10174                         3rd Qu.:-1.2854   3rd Qu.:0.0896030  
##  Max.   :227959                         Max.   :-0.3185   Max.   :0.4498214  
##  weather_ratio         nas_ratio        security_ratio      late_aircraft_ratio
##  Min.   :6.040e-06   Min.   :0.000032   Min.   :2.620e-06   Min.   :0.0001667  
##  1st Qu.:2.625e-03   1st Qu.:0.032922   1st Qu.:4.470e-04   1st Qu.:0.0479366  
##  Median :5.721e-03   Median :0.053118   Median :1.087e-03   Median :0.0717587  
##  Mean   :8.681e-03   Mean   :0.063571   Mean   :2.478e-03   Mean   :0.0790836  
##  3rd Qu.:1.110e-02   3rd Qu.:0.082959   3rd Qu.:2.687e-03   3rd Qu.:0.1027813  
##  Max.   :1.553e-01   Max.   :0.372857   Max.   :1.264e-01   Max.   :0.4900000  
##  log_carrier_ratio log_weather_ratio log_nas_ratio      log_security_ratio
##  Min.   :-7.6337   Min.   :-12.017   Min.   :-10.3494   Min.   :-12.853   
##  1st Qu.:-3.0857   1st Qu.: -5.943   1st Qu.: -3.4136   1st Qu.: -7.713   
##  Median :-2.7368   Median : -5.164   Median : -2.9352   Median : -6.825   
##  Mean   :-2.7615   Mean   : -5.274   Mean   : -3.0137   Mean   : -6.854   
##  3rd Qu.:-2.4124   3rd Qu.: -4.501   3rd Qu.: -2.4894   3rd Qu.: -5.919   
##  Max.   :-0.7989   Max.   : -1.862   Max.   : -0.9866   Max.   : -2.069   
##  log_late_aircraft_ratio
##  Min.   :-8.6995        
##  1st Qu.:-3.0379        
##  Median :-2.6344        
##  Mean   :-2.7000        
##  3rd Qu.:-2.2752        
##  Max.   :-0.7134

Converting Season from Character to Facto

delay_normalized_log$season<-as.factor(delay_normalized_log$season) # Convert season column to factor
summary(delay_normalized_log$season)

## Autumn Spring Summer Winter 
##   3246   4063   5297   3948

library(writexl)

# Export to Excel
write_xlsx(delay_normalized_log, "delay_normalized_log.xlsx")

Comprehensive Summary of Normalized and Log-Transformed Data

The output provides a detailed summary of the normalized and log-transformed data, including descriptive statistics, missing values, and distribution characteristics for each variable.

library(summarytools)
print(dfSummary(delay_normalized_log), method = 'render')

Data Frame Summary

delay_normalized_log

Dimensions: 16554 x 33
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

year [integer]

Mean (sd) : 2018.6 (3.1)

min ≤ med ≤ max:

2013 ≤ 2019 ≤ 2023

IQR (CV) : 5 (0)

2013	:	660	(	4.0%	)
2014	:	1389	(	8.4%	)
2015	:	1533	(	9.3%	)
2016	:	1326	(	8.0%	)
2017	:	1306	(	7.9%	)
2018	:	1917	(	11.6%	)
2019	:	2016	(	12.2%	)
2021	:	2313	(	14.0%	)
2022	:	2345	(	14.2%	)
2023	:	1749	(	10.6%	)

16554 (100.0%)

0 (0.0%)

month [integer]

Mean (sd) : 6.5 (3.3)

min ≤ med ≤ max:

1 ≤ 7 ≤ 12

IQR (CV) : 5 (0.5)

12 distinct values

16554 (100.0%)

0 (0.0%)

carrier [character]

1. WN

2. AA

3. B6

4. NK

5. OO

6. DL

7. MQ

8. OH

9. AS

10. G4

[ 9 others ]

3381	(	20.4%	)
2976	(	18.0%	)
1615	(	9.8%	)
1424	(	8.6%	)
1366	(	8.3%	)
888	(	5.4%	)
842	(	5.1%	)
729	(	4.4%	)
656	(	4.0%	)
580	(	3.5%	)
2097	(	12.7%	)

16554 (100.0%)

0 (0.0%)

carrier_name [character]

1. Southwest Airlines Co.

2. American Airlines Inc.

3. JetBlue Airways

4. Spirit Air Lines

5. SkyWest Airlines Inc.

6. Delta Air Lines Inc.

7. Envoy Air

8. PSA Airlines Inc.

9. Alaska Airlines Inc.

10. Allegiant Air

[ 10 others ]

3381	(	20.4%	)
2976	(	18.0%	)
1615	(	9.8%	)
1424	(	8.6%	)
1366	(	8.3%	)
888	(	5.4%	)
817	(	4.9%	)
729	(	4.4%	)
656	(	4.0%	)
580	(	3.5%	)
2122	(	12.8%	)

16554 (100.0%)

0 (0.0%)

airport [character]

1. LAX

2. ORD

3. LAS

4. ATL

5. DFW

6. FLL

7. MCO

8. DEN

9. PHX

10. LGA

[ 275 others ]

585	(	3.5%	)
475	(	2.9%	)
449	(	2.7%	)
429	(	2.6%	)
397	(	2.4%	)
392	(	2.4%	)
382	(	2.3%	)
367	(	2.2%	)
353	(	2.1%	)
348	(	2.1%	)
12377	(	74.8%	)

16554 (100.0%)

0 (0.0%)

airport_name [character]

1. Los Angeles, CA: Los Ange

2. Chicago, IL: Chicago O'Ha

3. Atlanta, GA: Hartsfield-J

4. Dallas/Fort Worth, TX: Da

5. Fort Lauderdale, FL: Fort

6. Orlando, FL: Orlando Inte

7. Denver, CO: Denver Intern

8. Phoenix, AZ: Phoenix Sky

9. New York, NY: LaGuardia

10. San Francisco, CA: San Fr

[ 286 others ]

585	(	3.5%	)
475	(	2.9%	)
429	(	2.6%	)
397	(	2.4%	)
392	(	2.4%	)
382	(	2.3%	)
367	(	2.2%	)
353	(	2.1%	)
348	(	2.1%	)
348	(	2.1%	)
12478	(	75.4%	)

16554 (100.0%)

0 (0.0%)

arr_flights [numeric]

Mean (sd) : 1635.5 (2329.2)

min ≤ med ≤ max:

7 ≤ 731 ≤ 21977

IQR (CV) : 1807.5 (1.4)

4813 distinct values

16554 (100.0%)

0 (0.0%)

arr_del15 [numeric]

Mean (sd) : 319.7 (418.1)

min ≤ med ≤ max:

3 ≤ 157 ≤ 4176

IQR (CV) : 348 (1.3)

1697 distinct values

16554 (100.0%)

0 (0.0%)

carrier_ct [numeric]

Mean (sd) : 93.1 (116.2)

min ≤ med ≤ max:

0 ≤ 49.9 ≤ 1293.9

IQR (CV) : 99.9 (1.2)

10755 distinct values

16554 (100.0%)

0 (0.0%)

weather_ct [numeric]

Mean (sd) : 10 (17.9)

min ≤ med ≤ max:

0 ≤ 4 ≤ 266.4

IQR (CV) : 8.8 (1.8)

3462 distinct values

16554 (100.0%)

0 (0.0%)

nas_ct [numeric]

Mean (sd) : 92.1 (138.7)

min ≤ med ≤ max:

0 ≤ 38.7 ≤ 1884.4

IQR (CV) : 96.7 (1.5)

10150 distinct values

16554 (100.0%)

0 (0.0%)

security_ct [numeric]

Mean (sd) : 1.4 (1.7)

min ≤ med ≤ max:

0 ≤ 1 ≤ 42

IQR (CV) : 1.1 (1.2)

844 distinct values

16554 (100.0%)

0 (0.0%)

late_aircraft_ct [numeric]

Mean (sd) : 123.1 (178)

min ≤ med ≤ max:

0 ≤ 51.5 ≤ 2069.1

IQR (CV) : 132.5 (1.4)

11139 distinct values

16554 (100.0%)

0 (0.0%)

arr_cancelled [numeric]

Mean (sd) : 30 (70.7)

min ≤ med ≤ max:

0 ≤ 8 ≤ 1565

IQR (CV) : 25 (2.4)

438 distinct values

16554 (100.0%)

0 (0.0%)

arr_diverted [numeric]

Mean (sd) : 4 (9.7)

min ≤ med ≤ max:

0 ≤ 1 ≤ 197

IQR (CV) : 4 (2.4)

113 distinct values

16554 (100.0%)

0 (0.0%)

arr_delay [numeric]

Mean (sd) : 20484.6 (30161.4)

min ≤ med ≤ max:

85 ≤ 9503 ≤ 438783

IQR (CV) : 21471.8 (1.5)

12913 distinct values

16554 (100.0%)

0 (0.0%)

carrier_delay [numeric]

Mean (sd) : 6644.9 (10382)

min ≤ med ≤ max:

1 ≤ 3139 ≤ 196944

IQR (CV) : 6829.8 (1.6)

9238 distinct values

16554 (100.0%)

0 (0.0%)

weather_delay [numeric]

Mean (sd) : 980.4 (2001.3)

min ≤ med ≤ max:

1 ≤ 324 ≤ 26446

IQR (CV) : 821 (2)

3471 distinct values

16554 (100.0%)

0 (0.0%)

nas_delay [numeric]

Mean (sd) : 4385 (7565.2)

min ≤ med ≤ max:

1 ≤ 1616.5 ≤ 112018

IQR (CV) : 4405 (1.7)

7451 distinct values

16554 (100.0%)

0 (0.0%)

security_delay [numeric]

Mean (sd) : 64.9 (109.2)

min ≤ med ≤ max:

1 ≤ 32 ≤ 3760

IQR (CV) : 59 (1.7)

562 distinct values

16554 (100.0%)

0 (0.0%)

late_aircraft_delay [numeric]

Mean (sd) : 8409.4 (12884.4)

min ≤ med ≤ max:

1 ≤ 3605 ≤ 227959

IQR (CV) : 8903.5 (1.5)

9951 distinct values

16554 (100.0%)

0 (0.0%)

season [factor]

1. Autumn

2. Spring

3. Summer

4. Winter

3246	(	19.6%	)
4063	(	24.5%	)
5297	(	32.0%	)
3948	(	23.8%	)

16554 (100.0%)

0 (0.0%)

delay_ratio [numeric]

Mean (sd) : -1.6 (0.4)

min ≤ med ≤ max:

-3.6 ≤ -1.5 ≤ -0.3

IQR (CV) : 0.5 (-0.3)

13455 distinct values

16554 (100.0%)

0 (0.0%)

carrier_ratio [numeric]

Mean (sd) : 0.1 (0)

min ≤ med ≤ max:

0 ≤ 0.1 ≤ 0.4

IQR (CV) : 0 (0.5)

16274 distinct values

16554 (100.0%)

0 (0.0%)

weather_ratio [numeric]

Mean (sd) : 0 (0)

min ≤ med ≤ max:

0 ≤ 0 ≤ 0.2

IQR (CV) : 0 (1.1)

15428 distinct values

16554 (100.0%)

0 (0.0%)

nas_ratio [numeric]

Mean (sd) : 0.1 (0)

min ≤ med ≤ max:

0 ≤ 0.1 ≤ 0.4

IQR (CV) : 0.1 (0.7)

16251 distinct values

16554 (100.0%)

0 (0.0%)

security_ratio [numeric]

Mean (sd) : 0 (0)

min ≤ med ≤ max:

0 ≤ 0 ≤ 0.1

IQR (CV) : 0 (1.8)

13709 distinct values

16554 (100.0%)

0 (0.0%)

late_aircraft_ratio [numeric]

Mean (sd) : 0.1 (0)

min ≤ med ≤ max:

0 ≤ 0.1 ≤ 0.5

IQR (CV) : 0.1 (0.5)

16295 distinct values

16554 (100.0%)

0 (0.0%)

log_carrier_ratio [numeric]

Mean (sd) : -2.8 (0.5)

min ≤ med ≤ max:

-7.6 ≤ -2.7 ≤ -0.8

IQR (CV) : 0.7 (-0.2)

16274 distinct values

16554 (100.0%)

0 (0.0%)

log_weather_ratio [numeric]

Mean (sd) : -5.3 (1.1)

min ≤ med ≤ max:

-12 ≤ -5.2 ≤ -1.9

IQR (CV) : 1.4 (-0.2)

15425 distinct values

16554 (100.0%)

0 (0.0%)

log_nas_ratio [numeric]

Mean (sd) : -3 (0.8)

min ≤ med ≤ max:

-10.3 ≤ -2.9 ≤ -1

IQR (CV) : 0.9 (-0.3)

16247 distinct values

16554 (100.0%)

0 (0.0%)

log_security_ratio [numeric]

Mean (sd) : -6.9 (1.4)

min ≤ med ≤ max:

-12.9 ≤ -6.8 ≤ -2.1

IQR (CV) : 1.8 (-0.2)

13701 distinct values

16554 (100.0%)

0 (0.0%)

log_late_aircraft_ratio [numeric]

Mean (sd) : -2.7 (0.6)

min ≤ med ≤ max:

-8.7 ≤ -2.6 ≤ -0.7

IQR (CV) : 0.8 (-0.2)

16296 distinct values

16554 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.1)
2024-12-31

Proportional Breakdown of Delay Causes

This pie chart visualizes the normalized contributions of different delay reasons to overall flight delays, represented as percentages for clear comparison.

Which delay type contributes the most overall?

# Summarize normalized delay ratios (not log-transformed)
delay_ratios <- delay_normalized %>%
  summarise(
    carrier_delay = sum(carrier_ratio),
    weather_delay = sum(weather_ratio),
    nas_delay = sum(nas_ratio),
    security_delay = sum(security_ratio),
    late_aircraft_delay = sum(late_aircraft_ratio)
  ) %>%
  pivot_longer(everything(), names_to = "DelayReason", values_to = "RatioSum") %>%
  mutate(Percentage = round(RatioSum / sum(RatioSum) * 100, 1))  # Calculate percentages

# Plot pie chart with percentages
ggplot(delay_ratios, aes(x = "", y = RatioSum, fill = DelayReason)) +
  geom_bar(stat = "identity", width = 1, color = "white") +  # Ensure segments are distinguishable
  coord_polar("y", start = 0) +  # Convert to pie chart
  geom_text(aes(label = paste0(Percentage, "%")), 
            position = position_stack(vjust = 0.5), size = 4, color = "black") +  # Add percentages
  scale_fill_brewer(palette = "Set3") +  # Use a nice color palette
  labs(title = "Proportional Breakdown of Delay Reasons",
       fill = "Delay Reason") +
  theme_void()  # Minimalist theme for pie chart

Correlation analysis code for your delay_ratio and normalized predictors:

Which delay type is most correlated with the total delay ratio?

# Compute correlation matrix using log-transformed ratios
correlation_matrix <- delay_normalized_log %>%
  select(delay_ratio, 
         log_carrier_ratio, log_weather_ratio, log_nas_ratio, 
         log_security_ratio, log_late_aircraft_ratio) %>%
  cor(use = "complete.obs")

# Visualize correlation matrix using heatmap
library(ggcorrplot)

ggcorrplot(correlation_matrix, 
           lab = TRUE,                 # Add correlation values as text
           type = "lower",             # Show lower triangle of the matrix
           title = "Correlation Matrix: Delay Ratio vs Log-Transformed Predictors",
           lab_size = 4,               # Text size
           colors = c("blue", "white", "red"), # Diverging colors
           ggtheme = theme_minimal())

The correlation matrix shows relationships between delay causes (e.g., carrier delay, late aircraft delay) and the delay ratio. Key observation:
The delay ratio has the strongest correlation with log_carrier_ratio (0.69) and log_late_aircraft_ratio (0.69).
Other delay factors like weather and NAS show weaker correlations with the delay ratio.

Starting the Model: Multiple Linear Regression

mod1<- lm(delay_ratio ~ log_late_aircraft_ratio, 
                 data = delay_normalized_log)

summary(mod1)

## 
## Call:
## lm(formula = delay_ratio ~ log_late_aircraft_ratio, data = delay_normalized_log)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.14723 -0.19028 -0.01037  0.17407  2.53492 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -0.373306   0.009966  -37.46   <2e-16 ***
## log_late_aircraft_ratio  0.441547   0.003597  122.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2879 on 16552 degrees of freedom
## Multiple R-squared:  0.4766, Adjusted R-squared:  0.4765 
## F-statistic: 1.507e+04 on 1 and 16552 DF,  p-value: < 2.2e-16

This linear regression analysis shows that there is a significant relationship between the delay_ratio and log_late_aircraft_ratio. The intercept is -0.3733, which represents the delay_ratio when log_late_aircraft_ratio is zero. The coefficient for log_late_aircraft_ratio is 0.4415, meaning that for each one-unit increase in the log of late aircraft ratio, the delay ratio increases by 0.4415. Both the intercept and the coefficient are highly significant, with p-values less than 2e-16. The model explains about 47.66% of the variation in delay ratio, and the overall model is statistically significant, as indicated by the high F-statistic and small residual standard error.

Adding Carrier Delays

mod2<-update(mod1,.~.+log_carrier_ratio)

summary(mod2)

## 
## Call:
## lm(formula = delay_ratio ~ log_late_aircraft_ratio + log_carrier_ratio, 
##     data = delay_normalized_log)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.81294 -0.14725 -0.02725  0.11628  1.79444 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.339674   0.010520   32.29   <2e-16 ***
## log_late_aircraft_ratio 0.313927   0.003092  101.52   <2e-16 ***
## log_carrier_ratio       0.382962   0.003776  101.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2261 on 16551 degrees of freedom
## Multiple R-squared:  0.6772, Adjusted R-squared:  0.6771 
## F-statistic: 1.736e+04 on 2 and 16551 DF,  p-value: < 2.2e-16

By adding the second predictor, log_carrier_ratio, the model’s ability to explain the delay_ratio improves. Both predictors, log_late_aircraft_ratio and log_carrier_ratio, are statistically significant, meaning each one has a meaningful impact on the delay ratio. The model now explains about 67.72% of the variation in the delay ratio, which is a significant improvement over the previous model. The residual standard error also decreases, indicating a better fit. The overall model is highly significant, confirming that including both predictors provides a more accurate prediction of delays.

Adding NAS Delays

mod3 <- update(mod2, . ~ . + log_nas_ratio)

summary(mod3)

## 
## Call:
## lm(formula = delay_ratio ~ log_late_aircraft_ratio + log_carrier_ratio + 
##     log_nas_ratio, data = delay_normalized_log)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32838 -0.08392 -0.02596  0.04780  1.65621 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.876101   0.007477   117.2   <2e-16 ***
## log_late_aircraft_ratio 0.259288   0.001987   130.5   <2e-16 ***
## log_carrier_ratio       0.392074   0.002390   164.1   <2e-16 ***
## log_nas_ratio           0.218601   0.001388   157.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.143 on 16550 degrees of freedom
## Multiple R-squared:  0.8708, Adjusted R-squared:  0.8708 
## F-statistic: 3.718e+04 on 3 and 16550 DF,  p-value: < 2.2e-16

By adding log_nas_ratio to the model, it explains delay_ratio much better. All three predictors — log_late_aircraft_ratio, log_carrier_ratio, and log_nas_ratio — are significant, meaning they all have a real impact on delays. The model now explains 87.08% of the variation in the delay ratio, which is a big improvement. The residual standard error is smaller, showing a better fit. The high F-statistic confirms that the model is strong and the addition of log_nas_ratio helps in predicting delays more accurately.

Adding Weather Delays

mod4 <- update(mod3, . ~ . + log_weather_ratio)

summary(mod4)

## 
## Call:
## lm(formula = delay_ratio ~ log_late_aircraft_ratio + log_carrier_ratio + 
##     log_nas_ratio + log_weather_ratio, data = delay_normalized_log)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30600 -0.07894 -0.02720  0.04387  1.46256 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.0413970  0.0076453  136.21   <2e-16 ***
## log_late_aircraft_ratio 0.2612920  0.0018448  141.64   <2e-16 ***
## log_carrier_ratio       0.3699147  0.0022595  163.72   <2e-16 ***
## log_nas_ratio           0.2064721  0.0013098  157.64   <2e-16 ***
## log_weather_ratio       0.0488465  0.0009475   51.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1328 on 16549 degrees of freedom
## Multiple R-squared:  0.8887, Adjusted R-squared:  0.8887 
## F-statistic: 3.303e+04 on 4 and 16549 DF,  p-value: < 2.2e-16

By adding log_weather_ratio to the model, it helps predict delay_ratio even better. All four predictors — log_late_aircraft_ratio, log_carrier_ratio, log_nas_ratio, and log_weather_ratio — are important and have a strong impact on the delay ratio. The model now explains 88.87% of the variation in delays, which is a big improvement. The residual standard error is smaller, showing a better fit. The high F-statistic confirms that the model is strong, and adding log_weather_ratio shows that weather also affects flight delays.

Adding Security Delays

mod5 <- update(mod4, . ~ . + log_security_ratio)
summary(mod5)

## 
## Call:
## lm(formula = delay_ratio ~ log_late_aircraft_ratio + log_carrier_ratio + 
##     log_nas_ratio + log_weather_ratio + log_security_ratio, data = delay_normalized_log)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27554 -0.07729 -0.02739  0.04317  1.41816 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.1143859  0.0083421  133.59   <2e-16 ***
## log_late_aircraft_ratio 0.2681970  0.0018524  144.78   <2e-16 ***
## log_carrier_ratio       0.3570505  0.0023172  154.09   <2e-16 ***
## log_nas_ratio           0.2051379  0.0012950  158.40   <2e-16 ***
## log_weather_ratio       0.0454750  0.0009499   47.87   <2e-16 ***
## log_security_ratio      0.0162922  0.0007921   20.57   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1311 on 16548 degrees of freedom
## Multiple R-squared:  0.8915, Adjusted R-squared:  0.8914 
## F-statistic: 2.718e+04 on 5 and 16548 DF,  p-value: < 2.2e-16

By adding log_security_ratio to the model, the prediction of delay_ratio improves further. All five predictors — log_late_aircraft_ratio, log_carrier_ratio, log_nas_ratio, log_weather_ratio, and log_security_ratio — are statistically significant, with very small p-values (less than 2e-16). The model now explains 89.15% of the variation in the delay ratio, a slight improvement over the previous models. The residual standard error has decreased to 0.1311, indicating a better fit. The high F-statistic (27,180) confirms the model’s significance, and adding log_security_ratio shows that security factors also contribute to flight delays.

Season

In analyzing flight delays, it is important not to overlook the potential impact of season on the delay_ratio. Previous observations have shown that the summer season experiences the highest number of delays. Therefore, adding the season variable to the model allows us to better account for seasonal effects that may influence delays, providing a more comprehensive understanding of the factors at play.

mod6 <- update(mod5, . ~ . +season)
summary(mod6)

## 
## Call:
## lm(formula = delay_ratio ~ log_late_aircraft_ratio + log_carrier_ratio + 
##     log_nas_ratio + log_weather_ratio + log_security_ratio + 
##     season, data = delay_normalized_log)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28620 -0.07713 -0.02650  0.04340  1.39549 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.0582730  0.0097867 108.134  < 2e-16 ***
## log_late_aircraft_ratio 0.2649817  0.0018753 141.302  < 2e-16 ***
## log_carrier_ratio       0.3537572  0.0023268 152.039  < 2e-16 ***
## log_nas_ratio           0.2043026  0.0012920 158.131  < 2e-16 ***
## log_weather_ratio       0.0430712  0.0009861  43.678  < 2e-16 ***
## log_security_ratio      0.0163113  0.0007890  20.674  < 2e-16 ***
## seasonSpring            0.0197835  0.0031208   6.339 2.37e-10 ***
## seasonSummer            0.0329753  0.0031943  10.323  < 2e-16 ***
## seasonWinter            0.0329753  0.0031699  10.403  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1306 on 16545 degrees of freedom
## Multiple R-squared:  0.8923, Adjusted R-squared:  0.8923 
## F-statistic: 1.714e+04 on 8 and 16545 DF,  p-value: < 2.2e-16

The model with the inclusion of season shows that all predictors, including log_late_aircraft_ratio, log_carrier_ratio, log_nas_ratio, log_weather_ratio, and log_security_ratio, remain statistically significant. Additionally, the seasonal factors — seasonSpring, seasonSummer, and seasonWinter — also have a significant effect on the delay ratio, with summer and winter showing the highest increases. The model explains 89.23% of the variation in delays, and the residual standard error is reduced to 0.1306, indicating a better fit. The F-statistic confirms the model’s strong significance, suggesting that the inclusion of seasonal factors provides a more accurate prediction of delays.

Interaction Between Weather and Season

mod7 <- update(mod5, . ~ . +log_weather_ratio*season)
summary(mod7)

## 
## Call:
## lm(formula = delay_ratio ~ log_late_aircraft_ratio + log_carrier_ratio + 
##     log_nas_ratio + log_weather_ratio + log_security_ratio + 
##     season + log_weather_ratio:season, data = delay_normalized_log)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28241 -0.07708 -0.02618  0.04353  1.39795 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     1.0640856  0.0141812  75.035  < 2e-16 ***
## log_late_aircraft_ratio         0.2650177  0.0018755 141.304  < 2e-16 ***
## log_carrier_ratio               0.3539913  0.0023285 152.023  < 2e-16 ***
## log_nas_ratio                   0.2045027  0.0012944 157.994  < 2e-16 ***
## log_weather_ratio               0.0437989  0.0020240  21.640  < 2e-16 ***
## log_security_ratio              0.0163332  0.0007892  20.695  < 2e-16 ***
## seasonSpring                    0.0157279  0.0157027   1.002  0.31655    
## seasonSummer                    0.0110818  0.0150196   0.738  0.46063    
## seasonWinter                    0.0437708  0.0156687   2.794  0.00522 ** 
## log_weather_ratio:seasonSpring -0.0006832  0.0027381  -0.250  0.80296    
## log_weather_ratio:seasonSummer -0.0043900  0.0027479  -1.598  0.11015    
## log_weather_ratio:seasonWinter  0.0020923  0.0027462   0.762  0.44615    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1306 on 16542 degrees of freedom
## Multiple R-squared:  0.8924, Adjusted R-squared:  0.8923 
## F-statistic: 1.247e+04 on 11 and 16542 DF,  p-value: < 2.2e-16

In this model, we are checking if season affects delay_ratio independently or if it interacts with weather. The results show that season itself doesn’t have a strong impact, except for Winter, which has a small but significant effect. The interaction terms between weather and the seasons (Spring, Summer, Winter) also don’t show significant effects. This means that season and weather affect delays mostly independently, with Winter having a noticeable impact on delays.

Given this, it would be better to ignore the interaction terms and just consider season as a standalone variable in your model. The seasonWinter variable does have a significant effect on delay_ratio, but the interaction terms do not add meaningful information to the model. Therefore, we can keep the simpler model with season alone, which explains the delays effectively without introducing unnecessary complexity.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

residualPlots(mod6)

##                         Test stat Pr(>|Test stat|)    
## log_late_aircraft_ratio    66.677        < 2.2e-16 ***
## log_carrier_ratio          37.595        < 2.2e-16 ***
## log_nas_ratio             102.995        < 2.2e-16 ***
## log_weather_ratio          34.258        < 2.2e-16 ***
## log_security_ratio         14.780        < 2.2e-16 ***
## season                                                
## Tukey test                 32.835        < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

qqPlot(residuals(mod6))

## [1] 2376 3078

The QQ plot (Quantile-Quantile plot) displays the comparison between the quantiles of the residuals and a normal distribution. If the residuals were normally distributed, the points would lie along a straight diagonal line. However, if the points deviate significantly from this line, it suggests that the residuals do not follow a normal distribution, which could indicate issues such as skewness or heavy tails in the residuals. Given the curvature in the residuals plot and the appearance in the QQ plot, the assumption of normality for the residuals may not hold, and this is another reason why a GAM could provide a better fit for the data.

Residuals

The residuals plot shows a clear curvature pattern, indicating that the relationship between the predictors and the response variable might not be entirely linear. This suggests that the current linear model may not fully capture the underlying structure of the data. To address this, we can consider fitting a Generalized Additive Model (GAM), which allows for more flexible relationships between the predictors and the response by using smooth functions. This approach can help model the non-linear trends observed in the residuals plot.

Generalized Additive Model (GAM)

A Generalized Additive Model (GAM) extends traditional linear models by allowing non-linear relationships between the predictors and the outcome. The general formula for a GAM is:

                        y = β₀ + f₁(x₁) + f₂(x₂) + ... + fₚ(xₚ) + ε

Where:

y is the dependent variable (for example, delay ratio),
x₁, x₂, …, xₚ are the predictors (such as log_late_aircraft_ratio, log_carrier_ratio, etc.),
f₁, f₂, …, fₚ are smooth functions applied to each predictor, allowing them to have a non-linear impact on y,
β₀ is the intercept, and
ε is the error term.

In R, we can use the mgcv package to fit a GAM, where the smooth functions fₖ are typically estimated using splines. This flexibility allows the model to capture complex, non-linear relationships between the predictors and the response variable.

GAM

library(mgcv)

## Loading required package: nlme

## 
## Attaching package: 'nlme'

## The following object is masked from 'package:dplyr':
## 
##     collapse

## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.

library(nlme)



mod_gam <- gam(delay_ratio ~ s(log_late_aircraft_ratio) + s(log_carrier_ratio) + 
                 s(log_nas_ratio) +  s(log_weather_ratio) + s(log_security_ratio) 
                   + season, data = delay_normalized_log)


# Summary of the GAM model
summary(mod_gam)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## delay_ratio ~ s(log_late_aircraft_ratio) + s(log_carrier_ratio) + 
##     s(log_nas_ratio) + s(log_weather_ratio) + s(log_security_ratio) + 
##     season
## 
## Parametric coefficients:
##               Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)  -1.570575   0.001288 -1219.784  < 2e-16 ***
## seasonSpring  0.004973   0.001672     2.975  0.00293 ** 
## seasonSummer  0.004620   0.001723     2.681  0.00734 ** 
## seasonWinter  0.009964   0.001697     5.872 4.38e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                              edf Ref.df       F p-value    
## s(log_late_aircraft_ratio) 8.578  8.949 10037.3  <2e-16 ***
## s(log_carrier_ratio)       7.716  8.582  7455.0  <2e-16 ***
## s(log_nas_ratio)           8.850  8.993 13080.6  <2e-16 ***
## s(log_weather_ratio)       8.105  8.794   543.1  <2e-16 ***
## s(log_security_ratio)      7.269  8.312   117.1  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.97   Deviance explained =   97%
## GCV = 0.0048334  Scale est. = 0.0048204  n = 16554

The results from the Generalized Additive Model (GAM) suggest a strong fit for predicting delay_ratio with significant non-linear relationships between the predictors and the outcome. Here’s a detailed summary of the model’s findings:

1- Parametric Coefficients:

The intercept is highly significant with a very small p-value (<2e-16), indicating that the baseline level of the delay ratio is significantly different from zero.
Season variables (Spring, Summer, and Winter) are also significant, with Winter showing the strongest effect, having the highest t-value (5.872) and p-value (< 2e-16). This suggests that delays are notably higher in winter compared to other seasons. Spring and Summer are also significant, but with smaller effects.

2- Smooth Terms:

The smooth terms for each predictor (log_late_aircraft_ratio, log_carrier_ratio, log_nas_ratio, log_weather_ratio, and log_security_ratio) all have edf (estimated degrees of freedom) close to their ref.df (reference degrees of freedom), indicating that the smooth terms are effectively modeling the non-linear relationships between these predictors and the outcome.
The F-statistics for all smooth terms are extremely high, and the p-values are all less than 2e-16, suggesting that these smooth functions are statistically significant. This confirms that each predictor has a strong, non-linear influence on the delay ratio.

3- Model Fit:

The Adjusted R-squared value is 0.97, which indicates that the model explains 97% of the variance in the delay ratio. This is an excellent fit, showing that the model accounts for almost all the variability in the outcome.
The Deviance explained is also 97%, which further supports the idea that the model fits the data very well. The Generalized Cross Validation (GCV) score of 0.0048334 is very low, indicating that the model generalizes well and avoids overfitting.

In summary, this GAM provides a strong, well-fitting model with both linear and non-linear relationships, and it effectively captures the seasonal effects on delays. The model’s performance metrics, including R-squared and Deviance explained, demonstrate that it explains the variation in the delay ratio very effectively.

plot(mod_gam,shade = TRUE, shade.col = "lightblue")

1- Plot for log_late_aircraft_ratio:

This smooth function shows a positive relationship between log_late_aircraft_ratio and the delay_ratio. As the log of the late aircraft ratio increases, the delay ratio also increases, particularly after the value reaches around -5. The plot shows a gradual increase with a smooth curve, suggesting that the effect of log_late_aircraft_ratio is non-linear.

2- Plot for log_carrier_ratio:

This smooth function indicates a positive, non-linear relationship between log_carrier_ratio and delay_ratio. Similar to log_late_aircraft_ratio, the delay ratio increases as the log_carrier_ratio rises. The curve becomes steeper at higher values, showing a stronger effect as the carrier ratio increases.

3- Plot for log_nas_ratio:

This plot shows a positive relationship between log_nas_ratio and delay_ratio. As log_nas_ratio increases, the delay ratio increases as well. The curve is quite steep for lower values of log_nas_ratio and then flattens out slightly at higher values.

4- Plot for log_weather_ratio:

The smooth function for log_weather_ratio shows that weather ratio has a slight positive effect on delay_ratio. However, the relationship is almost flat, indicating that changes in weather ratio do not have a strong influence on delay ratio.

5- Plot for log_security_ratio:

The log_security_ratio also shows a very small positive effect on delay_ratio. The curve is almost flat, meaning that security ratio has a minimal non-linear effect on the delay ratio, with only small increases in the delay ratio as the security ratio rises.

Each plot represents the non-linear relationship between the respective predictor and the delay ratio, where the shaded areas indicate confidence intervals around the smooth terms. The smooth curves highlight how each predictor affects the response variable without assuming a linear relationship.

qqPlot(residuals(mod_gam))

## [1]  6046 11891

This is a QQ plot for the residuals of the GAM model. The black line represents the observed quantiles of the residuals, and the blue line represents the expected quantiles under a normal distribution.

The plot suggests that the residuals are not perfectly normal, as indicated by the departure from the blue line at the extremes. This means that the residuals have some deviations from normality, which may suggest issues indicating that there might be outliers or extreme values in the data..

However, the residuals do follow the expected pattern in the middle range, indicating that most of the residuals are approximately normally distributed.

The QQ plot of the residuals suggests that the residuals deviate from the normal distribution at both extremes, indicating the potential presence of outliers or extreme values. While the residuals appear normally distributed in the middle range, the tails show departures that could be influenced by these outliers.

To investigate further and identify the outliers in the data, I examined the residuals using different thresholds. These thresholds represent different quantiles of the absolute residuals, and any data point with an absolute residual greater than the threshold is considered an outlier.

Here is the R code I used to detect outliers based on different thresholds:

# Extract residuals from the GAM model
residuals <- residuals(mod_gam)

# Try different thresholds for outliers
thresholds <- c(0.85,0.88,0.90, 0.92, 0.95, 0.98, 0.99, 0.995)  # Different quantiles

# Identify outliers based on absolute residuals
outlier_sets <- lapply(thresholds, function(t) {
  which(abs(residuals) > quantile(abs(residuals), t))
})

# Compare the number of outliers identified at each threshold
sapply(outlier_sets, length)

## [1] 2483 1987 1656 1325  828  332  166   83

The results showed the following number of outliers at each threshold:

At the 85th percentile (0.85), 2,483 outliers were detected.
At the 88th percentile (0.88), 1,987 outliers were detected.
At the 90th percentile (0.90), 1,656 outliers were detected.
At the 92nd percentile (0.92), 1,325 outliers were detected.
At the 95th percentile (0.95), 828 outliers were detected.
At the 98th percentile (0.98), 332 outliers were detected. = At the 99th percentile (0.99), 166 outliers were detected. -At the 99.5th percentile (0.995), 83 outliers were detected.

This shows that the number of identified outliers decreases as the threshold becomes stricter, with only 83 data points being flagged as outliers at the 99.5% threshold.

By examining the outliers at various thresholds, we can decide how to handle them in the model, either by removing, transforming, or keeping them, depending on their impact on the analysis.

# Extract residuals
residuals <- residuals(mod_gam)

# Identify the largest positive and negative residuals
outliers <- which(abs(residuals) > quantile(abs(residuals), 0.85)) # Top 15% residuals
data_outliers <- delay_normalized_log[outliers, ]

# View these outliers
#print(data_outliers)

# Load writexl library
library(writexl)

# Save the influential data points to an Excel file
write_xlsx(data_outliers, path = "data_outliers.xlsx")

I decided to check the top 15% of residuals to find the extreme values that could be affecting the model. I selected the residuals that were higher than the 85th percentile (the largest 15% of residuals). Then, I saved these outliers in a new dataset and exported them to an Excel file for further review. This way, I can examine the outliers and decide if they should be removed or handled differently in the analysis.

table(data_outliers$season)

## 
## Autumn Spring Summer Winter 
##    553    577    792    561

The table shows the distribution of outliers across different seasons:

Autumn: 553 outliers
Spring: 577 outliers
Summer: 792 outliers
Winter: 561 outliers

It appears that Summer has the highest number of outliers, which may suggest that delays in the summer are more extreme compared to other seasons. This could be due to various factors like higher traffic or weather conditions, which lead to more significant delays. The relatively similar number of outliers in Autumn, Winter, and Spring indicates that the other seasons have less extreme delays on average. Investigating the reasons behind the higher outliers in Summer could provide useful insights for improving the model or understanding seasonal variations in flight delays.

table(data_outliers$carrier)

## 
##  9E  AA  AS  B6  DL  EV  G4  HA  MQ  NK  OH  OO  QX  UA  US  VX  WN  YV  YX 
##  26 195 133 209 195   6 148  86  98 341  57 323   8  17  38  69 397  43  94

table(data_outliers$airport)

## 
## ABE ABQ ACK ACY AEX AGS ALB AMA ANC ASE ATL ATW AUS AVL AVP AZA BDL BET BFL BHM 
##   1  19   1   1   2   1  10   3  34   8  72   1  18   3   2   7   9   1   1   3 
## BIL BIS BLI BLV BNA BOI BOS BPT BQK BQN BRW BTR BTV BUF BUR BWI BZN CAE CAK CHO 
##   1   2   2   2  22   7  38   2   1  12   4   3   1   6  19  29   3   1   4   2 
## CHS CID CLE CLT CMH COS COU CRP CVG DAL DAY DCA DEN DFW DSM DTW ECP EGE ELP EUG 
##   7   3  18  40  15   4   1   2  16  13   2  21  40  62  10  55   4   1  11   2 
## EVV EWR FAI FAR FAT FAY FCA FLG FLL FNT FSD FSM GCK GEG GFK GJT GNV GPT GRK GRR 
##   1  81   1   4   2   2   1   4  43   1   2   2   1   9   1   4   2   2   2   7 
## GSO GSP HDN HHH HNL HOU HPN HRL HVN IAD IAG IAH ICT IDA ILM IMT IND ISP ITO JAN 
##   2   3   1   1  50  11   4   3   1   9   1  27   4   3   3   1  17   6   6   4 
## JAX JFK JMS JNU KOA LAN LAS LAX LBB LBE LCK LEX LFT LGA LGB LIH LIT MAF MCI MCO 
##  18  39   1   7   7   1  41  90   2   1   2   1   4  79   6  10   7   6  11  31 
## MDT MDW MEM MFE MFR MGM MHT MIA MKE MLI MLU MOT MRY MSN MSP MSY MTJ MVY MYR OAJ 
##   1   9   7   5   4   2   1  18  11   4   1   1   3   5  44  22   1   1  17   1 
## OAK OGG OKC OMA ONT ORD ORF ORH PBI PDX PGD PHL PHX PIA PIE PIT PLN PNS PRC PSC 
##  25  36   5   3  18  75   4   2  19  38   8  31  58   1   5  12   1   9   1   3 
## PSE PSG PSM PSP PVD PVU RAP RDM RDU RFD RIC RNO ROC ROW RSW SAN SAT SAV SBN SBP 
##   4   2   1   3   7   2   1   3  23   2   7  14   1   3  15  33   7   2   3   1 
## SDF SEA SFB SFO SGF SHV SIT SJC SJT SJU SLC SMF SNA SPI SPS SRQ STL STT STX SWF 
##  12  47   5  89   5   2   1  31   2  28  60  42  18   1   1   7  16  13   4   2 
## SYR TLH TPA TRI TUL TUS TVC TXK TYR TYS USA VPS WRG XNA YAK YUM 
##   5   1  19   1  10   8   1   3   1   3   1  10   2   6   1   1

The tables you provided show the distribution of outliers by carrier and airport.

1- By Carrier:

The table lists the number of outliers for each airline. WN (Southwest Airlines) has the highest number of outliers (397), followed by NK (Spirit Airlines) with 341 outliers, and OO (SkyWest Airlines) with 323 outliers. Other carriers, such as B6 (JetBlue), AA (American Airlines), and DL (Delta), also show significant numbers of outliers (209, 195, and 195, respectively). These airlines have higher flight volumes, which may explain the higher number of extreme delays (outliers).

2- By Airport:

This table shows the number of outliers by airport code. For example, MCO (Orlando International Airport) has 90 outliers, and LAX (Los Angeles International Airport) has 41 outliers. Some airports, like ABE (Lehigh Valley International) and ACK (Nantucket Memorial), have only a few outliers (1 or 2), likely because of lower flight volumes or more predictable operations. These insights suggest that larger, busier airports and airlines tend to have more extreme delays (outliers), possibly due to higher traffic or other operational challenges. Exploring why these specific carriers and airports have more outliers could help identify patterns or issues influencing delays.

table(data_outliers$month)

## 
##   1   2   3   4   5   6   7   8   9  10  11  12 
## 181 184 192 198 187 281 281 230 205 173 175 196

table(data_outliers$year)

## 
## 2013 2014 2015 2016 2017 2018 2019 2021 2022 2023 
##  131  223  161  186  180  213  213  560  324  292

The tables display the distribution of outliers by month and year.

1- By Month:

The number of outliers is relatively consistent throughout the year, with the highest counts observed in July and August (281 outliers each). This suggests that delays tend to be more extreme during the summer months, possibly due to higher traffic and weather-related disruptions commonly experienced during this time.

2- By Year:

The distribution of outliers shows that 2021 has the highest number of outliers (560). This indicates that the effects of external factors, such as operational challenges, may have continued to influence delays in 2021. Despite the exclusion of 2020 from the analysis, 2021 stands out as a year with a significant number of extreme delays, which could be linked to ongoing industry adjustments and recovery processes. Overall, the data suggests that both summer months and 2021 experienced more extreme delays, highlighting the potential impact of seasonal and operational factors on flight performance.

# Count outliers by year and month from data_outliers
outlier_time_trend <- data_outliers %>%
  group_by(year, month) %>%
  summarise(outlier_count = n())

## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

# Sort the outlier counts by descending order
outlier_time_trend_sorted <- outlier_time_trend %>%
  arrange(desc(outlier_count))

# Display the sorted data
cat("The outlier counts sorted by month and year:\n")

## The outlier counts sorted by month and year:

# Display the sorted data
cat("The outlier counts sorted by month and year:\n")

## The outlier counts sorted by month and year:

print(outlier_time_trend_sorted)

## # A tibble: 109 × 3
## # Groups:   year [10]
##     year month outlier_count
##    <int> <int>         <int>
##  1  2021     7            80
##  2  2021     6            79
##  3  2023     7            67
##  4  2021     3            56
##  5  2023     6            56
##  6  2021     4            51
##  7  2021     5            44
##  8  2013    12            43
##  9  2021     2            43
## 10  2021    10            43
## # ℹ 99 more rows

# Print the month and year with the highest number of outliers
top_outlier <- outlier_time_trend_sorted[1, ]
cat("\nThe highest number of outliers occurred in:\n")

## 
## The highest number of outliers occurred in:

print(top_outlier)

## # A tibble: 1 × 3
## # Groups:   year [1]
##    year month outlier_count
##   <int> <int>         <int>
## 1  2021     7            80

It’s clear that 2021 has a high number of outliers, particularly in July and June, with counts of 80 and 79 outliers, respectively. This stands out compared to other years, especially when looking at years like 2023 or 2013, where the outlier counts are lower. The persistently high outlier counts in 2021 suggest that this year might still have been impacted by external factors, potentially linked to ongoing disruptions, possibly from the pandemic or operational challenges that persisted into 2021.

Outlier Analysis in Flight Delay Data

# Add a flag to differentiate outliers from non-outliers
delay_normalized_log$outlier_flag <- ifelse(rownames(delay_normalized_log) %in% rownames(data_outliers), "Outlier", "Non-Outlier")

# Summary statistics for predictors (grouped by outlier_flag)
library(dplyr)
summary_table <- delay_normalized_log %>%
  group_by(outlier_flag) %>%
  summarise(
    avg_weather_ratio = mean(log_weather_ratio, na.rm = TRUE),
    avg_carrier_ratio = mean(log_carrier_ratio, na.rm = TRUE),
    avg_nas_ratio = mean(log_nas_ratio, na.rm = TRUE),
    avg_security_ratio = mean(log_security_ratio, na.rm = TRUE),
    avg_late_aircraft_ratio = mean(log_late_aircraft_ratio, na.rm = TRUE)
  )
print(summary_table)

## # A tibble: 2 × 6
##   outlier_flag avg_weather_ratio avg_carrier_ratio avg_nas_ratio
##   <chr>                    <dbl>             <dbl>         <dbl>
## 1 Non-Outlier              -5.28             -2.74         -2.98
## 2 Outlier                  -5.25             -2.88         -3.23
## # ℹ 2 more variables: avg_security_ratio <dbl>, avg_late_aircraft_ratio <dbl>

I flagged the top 15% of extreme residuals as outliers and compared them to the non-outliers to understand why there are so many. Here’s what I found:

Differences Between Outliers and Non-Outliers:

Outliers tend to have higher carrier and NAS ratios, suggesting that flights with more traffic or airline issues may experience bigger delays.
Outliers also have slightly lower weather and security ratios, which implies that weather and security issues might not be the main causes of extreme delays.
Late aircraft ratio is higher in outliers, indicating that delayed flights may have experienced more late arrivals, leading to further delays.

To compare the results of the GAM model with and without the outliers, I first created two models: one with only outliers and one with the data excluding the outliers. By doing this, I can analyze how the outliers influence the model performance and the relationships between the predictors and the delay ratio.

Model with Outliers

First, I built the GAM model with outliers included. The model was trained using the data_outliers dataset, which contains only the outliers based on the defined threshold (top 15% of residuals). Here is the summary of the model:

library(mgcv)

# Rebuild the GAM model with outliers

mod_gam_outlier <- gam(delay_ratio ~ s(log_late_aircraft_ratio) + s(log_carrier_ratio) + 
                 s(log_nas_ratio) +  s(log_weather_ratio) + s(log_security_ratio) 
                   + season,data = data_outliers)

# Summary of the new model
summary(mod_gam_outlier)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## delay_ratio ~ s(log_late_aircraft_ratio) + s(log_carrier_ratio) + 
##     s(log_nas_ratio) + s(log_weather_ratio) + s(log_security_ratio) + 
##     season
## 
## Parametric coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  -1.619251   0.006964 -232.501  < 2e-16 ***
## seasonSpring -0.001726   0.009346   -0.185  0.85353    
## seasonSummer  0.012196   0.009671    1.261  0.20739    
## seasonWinter  0.030254   0.009562    3.164  0.00158 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                              edf Ref.df       F  p-value    
## s(log_late_aircraft_ratio) 7.964  8.730 531.084  < 2e-16 ***
## s(log_carrier_ratio)       7.300  8.340 465.504  < 2e-16 ***
## s(log_nas_ratio)           8.582  8.951 931.856  < 2e-16 ***
## s(log_weather_ratio)       5.761  6.984  39.321  < 2e-16 ***
## s(log_security_ratio)      4.850  5.995   5.732 6.45e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.943   Deviance explained = 94.4%
## GCV = 0.023599  Scale est. = 0.023233  n = 2483

This model shows the relationships between the variables and the delay ratio when only the outliers are considered. It demonstrates that factors like log_late_aircraft_ratio, log_carrier_ratio, and log_nas_ratio significantly affect the delay ratio. The model also indicates how different seasons influence delays, with winter showing a stronger effect compared to other seasons.

Model without Outliers

Next, I built a GAM model without outliers. To do this, I filtered out the outliers from the original dataset and trained a new model using the cleaned data. The following code removes the outliers and then fits the model:

# Filter to exclude outliers
data_no_outliers <- delay_normalized_log %>%
  filter(!(rownames(delay_normalized_log) %in% rownames(data_outliers)))

# Check the size of the dataset
dim(data_no_outliers)

## [1] 14071    34

After filtering out the outliers, the dataset size becomes smaller, as expected. Here’s the model built on the cleaned data:

This model shows a slightly improved fit as indicated by a higher adjusted R-squared value (0.989 compared to 0.943 for the model with outliers). It also shows similar relationships between predictors and delay ratio, but the effect of season and certain variables may slightly differ when compared to the model with outliers.

library(mgcv)

# Rebuild the GAM model without outliers

mod_gam_clean <- gam(delay_ratio ~ s(log_late_aircraft_ratio) + s(log_carrier_ratio) + 
                 s(log_nas_ratio) +  s(log_weather_ratio) + s(log_security_ratio) 
                   + season,data = data_no_outliers)

# Summary of the new model
summary(mod_gam_clean)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## delay_ratio ~ s(log_late_aircraft_ratio) + s(log_carrier_ratio) + 
##     s(log_nas_ratio) + s(log_weather_ratio) + s(log_security_ratio) + 
##     season
## 
## Parametric coefficients:
##                Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)  -1.5628531  0.0007111 -2197.773  < 2e-16 ***
## seasonSpring  0.0068734  0.0009179     7.489 7.38e-14 ***
## seasonSummer  0.0047597  0.0009473     5.024 5.11e-07 ***
## seasonWinter  0.0074353  0.0009302     7.993 1.42e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                              edf Ref.df       F p-value    
## s(log_late_aircraft_ratio) 8.820  8.991 27936.7  <2e-16 ***
## s(log_carrier_ratio)       8.713  8.975 19223.2  <2e-16 ***
## s(log_nas_ratio)           8.985  9.000 31444.8  <2e-16 ***
## s(log_weather_ratio)       8.634  8.963  1297.7  <2e-16 ***
## s(log_security_ratio)      7.605  8.534   347.5  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.989   Deviance explained = 98.9%
## GCV = 0.0012273  Scale est. = 0.0012232  n = 14071

By comparing the models with and without outliers, I can evaluate how outliers impact the predictive power of the model. The model without outliers shows a better fit, with a lower GCV (0.0012273), indicating a more generalizable model that avoids overfitting and better captures the relationships between predictors and the delay ratio. This suggests that extreme values in the dataset may have a distorting effect on the model, causing unnecessary complexity and affecting the smoothness of the relationship between variables.

However, the model with outliers, despite its higher GCV (0.023599), still provides valuable insights into the behavior of extreme delays. It helps in understanding how outliers influence flight delay predictions and their potential impact on the airline industry. In conclusion, while excluding outliers improves the model’s generalization, keeping outliers can be useful for examining the effects of extreme delays on the overall system.

par(mfrow = c(1, 2))

# Plot QQ plot for residuals of the first model before extracting 15% outliers (mod_gam)
qqPlot(residuals(mod_gam), main = "QQ Plot for mod_gam")

## [1]  6046 11891

# Plot QQ plot for residuals of the second model after extracting outliers (mod_gam_clean)
qqPlot(residuals(mod_gam_clean), main = "QQ Plot for mod_gam_clean")

## [1] 10030 12929

Left plot (with outliers): The QQ plot for the model with outliers shows more significant deviations from the blue line, particularly at the extremes. This indicates that the residuals deviate more from the normal distribution due to the presence of outliers. Outliers can increase the variance of the residuals, leading to a less normal distribution.
Right plot (without outliers): After excluding outliers, the QQ plot on the right demonstrates that the residuals more closely follow the normal distribution. The line fits much better, and the deviations at the extremes are smaller, indicating a better-fitting model with smoother residuals. This suggests that excluding outliers helps in achieving a better model fit and a more stable relationship between the predictors and the target variable.

Now, we aim to assess the performance of the Generalized Additive Model (GAM) by evaluating its predictive accuracy through cross-validation and using Root Mean Squared Error (RMSE) as the evaluation metric. By running this code, we will train the model on subsets of the data (training set) and evaluate its performance on separate unseen data (test set) to understand how well it generalizes. This will allow us to compare the model’s behavior when trained on the full dataset and when outliers are excluded, offering valuable insights into the effect of outliers on model performance.

Cross-Validation and RMSE

To ensure the model’s performance is robust and not overly influenced by the specific data split, we will perform k-fold cross-validation. In this method, the dataset is divided into k equal-sized folds, and the model is trained on k-1 folds while being tested on the remaining fold. This process is repeated k times, each time using a different fold as the test set and the remaining data for training. The results from all folds are then averaged to provide a more reliable estimate of model performance. Cross-validation helps mitigate overfitting and provides a clearer picture of how the model might perform on new, unseen data.

The Root Mean Squared Error (RMSE) will be used to evaluate the model’s predictive accuracy. RMSE measures the average magnitude of the model’s prediction errors and gives a clear indication of how close the model’s predictions are to the actual values. The formula for RMSE is:

\[ RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} \]

Where:

\(y_i\) is the actual value for observation \(i\),
\(\hat{y}_i\) is the predicted value for observation \(i\),
\(n\) is the total number of observations.

Lower RMSE values signify better model performance, indicating that the model’s predictions are closer to the actual values.

Training and Testing

The training process involves fitting the model to the data, learning the relationships between the input features and the target variable. The model is then tested on data it has never seen during training, allowing us to gauge how well it performs on unseen data. By comparing the results of the model trained on the entire dataset and the one trained on data without outliers, we can assess whether excluding outliers leads to improved model accuracy.

library(mgcv)

# Set up k-fold cross-validation
set.seed(123)  # For reproducibility
k <- 5  # Number of folds
folds <- cut(seq(1, nrow(data_no_outliers)), breaks = k, labels = FALSE)

# Initialize a vector to store RMSE values for each fold
rmse_values <- numeric(k)

# Perform cross-validation
for (i in 1:k) {
  # Split data into training and test sets
  test_indices <- which(folds == i)
  test_data <- data_no_outliers[test_indices, ]
  train_data <- data_no_outliers[-test_indices, ]
  
  # Fit the GAM model on the training set
  mod_gam_cv <- gam(delay_ratio ~ s(log_late_aircraft_ratio) + s(log_carrier_ratio) + 
                 s(log_nas_ratio) +  s(log_weather_ratio) + s(log_security_ratio) 
                   + season, data = train_data)
  
  # Predict on the test set
  predictions <- predict(mod_gam_cv, newdata = test_data)
  
  # Calculate the Root Mean Squared Error (RMSE) for this fold
  rmse_values[i] <- sqrt(mean((test_data$delay_ratio - predictions)^2))
}

# Calculate the average RMSE across all folds
avg_rmse <- mean(rmse_values)
print(avg_rmse)

## [1] 0.03585187

# Residuals from the cross-validated model
residuals_cv <- test_data$delay_ratio - predictions

# Residuals vs Index
plot(residuals_cv, main = "Residuals vs Index", xlab = "Index", ylab = "Residuals", pch = 19, col = "blue")

Conclusion

The Residuals vs Index plot was used to evaluate the performance of the model by visualizing the differences between the observed and predicted values (residuals). In this plot, the residuals are randomly scattered around zero, indicating that there are no apparent patterns in the errors. This is a positive result, as it suggests that the model is well-fitted and has captured the underlying relationships in the data without missing any important patterns.

The random spread of residuals means that the model’s predictions are not consistently off in one direction, and there are no obvious outliers or trends that would suggest model misspecification. The absence of any clear pattern indicates that the model is appropriately capturing the variability in the data and is performing reliably.

Overall, the model’s ability to predict the test data is satisfactory, and the residuals’ behavior suggests that the model does not require any immediate adjustments. This confirms that the model is robust and provides reliable predictions for the given dataset.

Flight Delay Data for U.S. Airports by Carrier August 2013 - August 2023

Farzaneh Yousefi

2024-12-24

Dataset Information

Load necessary libraries

Load the dataset

Data Description

Exploratory Data Analysis

Summarize the flight categories

Pie Chart of Flight Categories

Total number of arriving flights per year

Exploring Monthly Trends in Total Flights

Annual Trends in Flight Cancellations

Total Delays Per Year

Monthly Distribution of Arriving Flights

Season

Number of Flights by Each Airline

Subsetting Out Year 2020:

Why Remove 2020 Data?

Updated Metrics Excluding 2020

Comparison of Flight Metrics Before and After Excluding 2020

Total Delays by Month (after excluding pandamic)

Preprocessing

Preprocessing: Missing Values Percentage

Impute

Compute Log Ratio of Delays

Verify the Results:

Summary of Delay Ratio

Filtering for Extreme Cases

Handling Extreme Cases

Normalizing Delay Predictors by Flight Volume

Check Skewness

Key Observations:

Log Transformation of Delay Ratios

Converting Season from Character to Facto

Comprehensive Summary of Normalized and Log-Transformed Data

Data Frame Summary

delay_normalized_log

Proportional Breakdown of Delay Causes

Correlation analysis code for your delay_ratio and normalized predictors:

Starting the Model: Multiple Linear Regression

Adding Carrier Delays

Adding NAS Delays

Adding Weather Delays

Adding Security Delays

Season

Interaction Between Weather and Season

GAM

Here is the R code I used to detect outliers based on different thresholds:

Outlier Analysis in Flight Delay Data

Cross-Validation and RMSE

Training and Testing

Conclusion