Storm Events Database#

The purpose of this notebook is to download, assemble, and work with the storm events dataset. This notebook will have the following steps:

  1. Connect to the FTP server. “ftp.ncei.noaa.gov

  2. Download zipped CSV files of type “details”, “fatalities”, and “locations”.

  3. Decompress and load each of these files.

  4. Join them using the ‘Event_ID’ column.

  5. Delete individual csv and gz files

  6. Save the csv file

  7. Work with the data

The database currently contains data from January 1950 to April 2024 and is updated monthly. Data on all event types starts in 1996, so we will only download files from 1996 and after. Full documentation can be found here: https://www.ncdc.noaa.gov/stormevents/details.jsp

While there is some overlap with SHELDUS, the storm events database offers more detailed information at finer geographic scales.

Needed Imports#

import os
import pandas as pd
import time
import glob
import requests
from bs4 import BeautifulSoup

Connect to the website and get a list of files#

url = "https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/"
ext = "csv.gz"
page = requests.get(url).text
soup = BeautifulSoup(page, "html.parser")
csv_files_https = [url  + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]

Filter files by types and download#

We are going to download all zipped files from 2021 to the present.

#set the directory
data_dir = "./data"
os.makedirs(data_dir, exist_ok=True) 
# Filtering and downloading
storm_files = {}
file_types = ["details", "fatalities", "locations"]
start_year = "2021"
for file_type in file_types:
    storm_files[file_type] = pd.Series([filename for filename in csv_files_https if file_type in filename])
    years = storm_files[file_type].str.split("_").str[3].str[1:5]
    storm_files[file_type] = storm_files[file_type][years >= str(start_year)]
def download_and_combine(files, prefix, timeout=5):
    dataframes = []
    for csv_file in files:
        print(csv_file)
        dataframes.append(pd.read_csv(csv_file))
        time.sleep(timeout)
    if dataframes:
        combined_df = pd.concat(dataframes, ignore_index=True)
        combined_df.to_csv(os.path.join(data_dir, f"{prefix}.csv"), index=False)
    return 
for file_type in file_types:
    print(file_type)
    download_and_combine(storm_files[file_type], file_type, timeout=3)
details
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2021_c20240716.csv.gz
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2022_c20240716.csv.gz
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2023_c20240816.csv.gz
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2024_c20240816.csv.gz
fatalities
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_fatalities-ftp_v1.0_d2021_c20240716.csv.gz
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_fatalities-ftp_v1.0_d2022_c20240716.csv.gz
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_fatalities-ftp_v1.0_d2023_c20240816.csv.gz
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_fatalities-ftp_v1.0_d2024_c20240816.csv.gz
locations
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_locations-ftp_v1.0_d2021_c20240716.csv.gz
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_locations-ftp_v1.0_d2022_c20240716.csv.gz
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_locations-ftp_v1.0_d2023_c20240816.csv.gz
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_locations-ftp_v1.0_d2024_c20240816.csv.gz

Load the CSVs into pandas and join using EVENT_ID#

# Step 4: Load CSVs into pandas and join using 'Event_ID'
df_details = pd.read_csv(os.path.join(data_dir, "details.csv"), dtype=str)
df_fatalities = pd.read_csv(os.path.join(data_dir, "fatalities.csv"), dtype=str)
df_locations = pd.read_csv(os.path.join(data_dir, "locations.csv"), dtype=str)

merged_df = df_details.merge(df_fatalities, on="EVENT_ID", how="left")\
                      .merge(df_locations, on="EVENT_ID", how="left")
merged_df.head(5)
BEGIN_YEARMONTH BEGIN_DAY BEGIN_TIME END_YEARMONTH END_DAY END_TIME EPISODE_ID_x EVENT_ID STATE STATE_FIPS ... YEARMONTH EPISODE_ID_y LOCATION_INDEX RANGE AZIMUTH LOCATION LATITUDE LONGITUDE LAT2 LON2
0 202112 11 349 202112 11 350 165322 999750 TENNESSEE 47 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 202112 11 249 202112 11 254 165322 999613 TENNESSEE 47 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 202112 11 325 202112 11 327 165322 999636 TENNESSEE 47 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 202112 11 232 202112 11 239 165322 999604 TENNESSEE 47 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 202112 6 724 202112 6 724 165321 999306 TENNESSEE 47 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 71 columns

Deleting individual files#

Let’s clean up the downloaded files since we don’t need them anymore and they are just taking up space

# List of individual files to delete
individual_files = ["details.csv", "fatalities.csv", "locations.csv"]
gz_files = [file.replace('.csv', '.csv.gz') for file in individual_files]

# Combine both lists
files_to_delete = individual_files + gz_files

# Delete specified files
for file in files_to_delete:
    file_path = os.path.join(data_dir, file)
    if os.path.exists(file_path):
        os.remove(file_path)
        print(f"Deleted: {file_path}")
    else:
        print(f"{file_path} not found.")

# Additionally, ensure all .gz files in the directory are deleted
gz_files_in_dir = glob.glob(os.path.join(data_dir, "*.gz"))
for gz_file in gz_files_in_dir:
    os.remove(gz_file)
    print(f"Deleted: {gz_file}")
Deleted: ./data/details.csv
Deleted: ./data/fatalities.csv
Deleted: ./data/locations.csv
./data/details.csv.gz not found.
./data/fatalities.csv.gz not found.
./data/locations.csv.gz not found.

Working with the dataset#

Let’s focus on events in the area where the survey was administered (Southern California) and look at events since the start of the year. For this we’ll be using County FIPs codes for the counties where the survey is administered.

Using FIPS codes#

The Storm Events Database has a column (CZ_FIPS) which uses the last three numbers of the FIPS code. We’ll do a boolean filter using that and the state code (STATE_FIPS), which is 06, but the dataset drops the leading zero.

socal_fips_codes = ['083', '111', '037', '031', '073', '043', '053', '027', '071', '069', '065', '029', '019', '039', '047', '107', '059', '025', '079']
# Define the start and end dates as strings.  The format is XXXXMM (Year/Month)
start_date = '202401'
end_date = '202404'
# Filter the DataFrame for rows between the start and end dates
filtered_df = merged_df[(merged_df['BEGIN_YEARMONTH'] >= start_date) & (merged_df['END_YEARMONTH'] <= end_date)]
# Filter the DataFrame based on the list of FIPS codes
fips = filtered_df[(filtered_df['CZ_FIPS'].isin(socal_fips_codes)) & (filtered_df['STATE_FIPS'] == '6')]
fips.head()
BEGIN_YEARMONTH BEGIN_DAY BEGIN_TIME END_YEARMONTH END_DAY END_TIME EPISODE_ID_x EVENT_ID STATE STATE_FIPS ... YEARMONTH EPISODE_ID_y LOCATION_INDEX RANGE AZIMUTH LOCATION LATITUDE LONGITUDE LAT2 LON2
261072 202401 3 606 202401 3 806 186834 1146696 CALIFORNIA 6 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
261672 202403 2 1738 202403 2 1738 188027 1154871 CALIFORNIA 6 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
262197 202402 6 430 202402 6 430 187518 1151475 CALIFORNIA 6 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
262340 202402 19 1315 202402 19 1515 187711 1152775 CALIFORNIA 6 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
262341 202402 19 1319 202402 19 1519 187711 1152776 CALIFORNIA 6 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 71 columns

# What kinds of events occurred?  Group by the 'EVENT_TYPE' column
event_counts = fips.groupby('EVENT_TYPE').size().reset_index(name='Count')
event_counts
EVENT_TYPE Count
0 Flash Flood 16
1 Flood 9
2 Funnel Cloud 5
3 Hail 1
4 Thunderstorm Wind 3

At this point, you can drill down to determine if the events in the dataset are relevant.

# Subset the DataFrame based on flash floods
fips_flash_flood = fips[fips['EVENT_TYPE'] == 'Flash Flood']
fips_flash_flood
BEGIN_YEARMONTH BEGIN_DAY BEGIN_TIME END_YEARMONTH END_DAY END_TIME EPISODE_ID_x EVENT_ID STATE STATE_FIPS ... YEARMONTH EPISODE_ID_y LOCATION_INDEX RANGE AZIMUTH LOCATION LATITUDE LONGITUDE LAT2 LON2
276360 202402 4 1956 202402 4 2115 189405 1164907 CALIFORNIA 6 ... 202402 189405 1 1.71 W VENTURA 34.2815 -119.2999 3416890 11917994
276361 202402 4 1956 202402 4 2115 189405 1164907 CALIFORNIA 6 ... 202402 189405 2 1.66 W VENTURA 34.278 -119.2991 3416680 11917946
276362 202402 4 1956 202402 4 2115 189405 1164907 CALIFORNIA 6 ... 202402 189405 3 1.54 W VENTURA 34.278 -119.2969 3416680 11917814
276363 202402 4 1956 202402 4 2115 189405 1164907 CALIFORNIA 6 ... 202402 189405 4 1.57 W VENTURA 34.2816 -119.2974 3416896 11917844
276595 202402 5 243 202402 5 430 189405 1164909 CALIFORNIA 6 ... 202402 189405 1 10.45 SE SANTA SUSANA 34.1802 -118.5529 3410812 11833174
276596 202402 5 243 202402 5 430 189405 1164909 CALIFORNIA 6 ... 202402 189405 2 10.43 SE SANTA SUSANA 34.181 -118.5525 3410860 11833150
276597 202402 5 243 202402 5 430 189405 1164909 CALIFORNIA 6 ... 202402 189405 3 10.71 SE SANTA SUSANA 34.1806 -118.5468 3410836 11832808
276598 202402 5 243 202402 5 430 189405 1164909 CALIFORNIA 6 ... 202402 189405 4 10.73 SE SANTA SUSANA 34.18 -118.5469 3410800 11832814
276840 202402 4 1735 202402 4 1930 189405 1164922 CALIFORNIA 6 ... 202402 189405 1 1.5 NNE EL RIO 34.25 -119.16 3415000 1199600
276841 202402 4 1735 202402 4 1930 189405 1164922 CALIFORNIA 6 ... 202402 189405 2 1.5 NNE EL RIO 34.2497 -119.1588 3414982 1199528
276842 202402 4 1735 202402 4 1930 189405 1164922 CALIFORNIA 6 ... 202402 189405 3 1.52 S SATICOY 34.2581 -119.1525 3415486 1199150
276843 202402 4 1735 202402 4 1930 189405 1164922 CALIFORNIA 6 ... 202402 189405 4 1.43 S SATICOY 34.2595 -119.1535 3415570 1199210
279229 202402 4 2037 202402 4 2200 189405 1164888 CALIFORNIA 6 ... 202402 189405 1 4.0 WSW BARDSDALE 34.34 -118.98 3420400 11858800
279230 202402 4 2037 202402 4 2200 189405 1164888 CALIFORNIA 6 ... 202402 189405 2 3.78 WSW BARDSDALE 34.341 -118.9762 3420460 11858572
279231 202402 4 2037 202402 4 2200 189405 1164888 CALIFORNIA 6 ... 202402 189405 3 3.82 WSW BARDSDALE 34.3395 -118.9758 3420370 11858548
279232 202402 4 2037 202402 4 2200 189405 1164888 CALIFORNIA 6 ... 202402 189405 4 3.93 WSW BARDSDALE 34.3391 -118.9779 3420346 11858674

16 rows × 71 columns

# Let's look at injuries, deaths, property damage, and narrative related to these flash floods.
print(fips_flash_flood[['EVENT_TYPE', 'LOCATION', 'INJURIES_DIRECT', 'INJURIES_INDIRECT', 'DEATHS_DIRECT', 'DEATHS_INDIRECT', 'EVENT_NARRATIVE', 'DAMAGE_PROPERTY']])  
         EVENT_TYPE      LOCATION INJURIES_DIRECT INJURIES_INDIRECT  \
276360  Flash Flood       VENTURA               0                 0   
276361  Flash Flood       VENTURA               0                 0   
276362  Flash Flood       VENTURA               0                 0   
276363  Flash Flood       VENTURA               0                 0   
276595  Flash Flood  SANTA SUSANA               0                 0   
276596  Flash Flood  SANTA SUSANA               0                 0   
276597  Flash Flood  SANTA SUSANA               0                 0   
276598  Flash Flood  SANTA SUSANA               0                 0   
276840  Flash Flood        EL RIO               0                 0   
276841  Flash Flood        EL RIO               0                 0   
276842  Flash Flood       SATICOY               0                 0   
276843  Flash Flood       SATICOY               0                 0   
279229  Flash Flood     BARDSDALE               0                 0   
279230  Flash Flood     BARDSDALE               0                 0   
279231  Flash Flood     BARDSDALE               0                 0   
279232  Flash Flood     BARDSDALE               0                 0   

       DEATHS_DIRECT DEATHS_INDIRECT  \
276360             0               0   
276361             0               0   
276362             0               0   
276363             0               0   
276595             0               0   
276596             0               0   
276597             0               0   
276598             0               0   
276840             0               0   
276841             0               0   
276842             0               0   
276843             0               0   
279229             0               0   
279230             0               0   
279231             0               0   
279232             0               0   

                                          EVENT_NARRATIVE DAMAGE_PROPERTY  
276360  Significant flooding was reported in downtown ...           0.00K  
276361  Significant flooding was reported in downtown ...           0.00K  
276362  Significant flooding was reported in downtown ...           0.00K  
276363  Significant flooding was reported in downtown ...           0.00K  
276595  Significant flooding was reported near the com...           0.00K  
276596  Significant flooding was reported near the com...           0.00K  
276597  Significant flooding was reported near the com...           0.00K  
276598  Significant flooding was reported near the com...           0.00K  
276840  Flooding closed Vineyard Avenue between Montgo...           0.00K  
276841  Flooding closed Vineyard Avenue between Montgo...           0.00K  
276842  Flooding closed Vineyard Avenue between Montgo...           0.00K  
276843  Flooding closed Vineyard Avenue between Montgo...           0.00K  
279229  Flash flooding and debris flows were reported ...           0.00K  
279230  Flash flooding and debris flows were reported ...           0.00K  
279231  Flash flooding and debris flows were reported ...           0.00K  
279232  Flash flooding and debris flows were reported ...           0.00K  

Saving the dataset#

In case we to use it for later.

# In case we want to save this dataset for later use, let's go ahead and save it.

file_name= f'stormevents_{start_year}_2024.csv'
merged_df.to_csv(file_name)