Storm Events Database

Storm Events Database#

The purpose of this notebook is to download, assemble, and work with the storm events dataset. This notebook will have the following steps:

Connect to the FTP server. “ftp.ncei.noaa.gov”
Download zipped CSV files of type “details”, “fatalities”, and “locations”.
Decompress and load each of these files.
Join them using the ‘Event_ID’ column.
Delete individual csv and gz files
Save the csv file
Work with the data

The database currently contains data from January 1950 to April 2024 and is updated monthly. Data on all event types starts in 1996, so we will only download files from 1996 and after. Full documentation can be found here: https://www.ncdc.noaa.gov/stormevents/details.jsp

While there is some overlap with SHELDUS, the storm events database offers more detailed information at finer geographic scales.

Needed Imports#

import os
import pandas as pd
import time
import glob
import requests
from bs4 import BeautifulSoup

Connect to the website and get a list of files#

url = "https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/"
ext = "csv.gz"
page = requests.get(url).text
soup = BeautifulSoup(page, "html.parser")
csv_files_https = [url  + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]

Filter files by types and download#

We are going to download all zipped files from 2021 to the present.

#set the directory
data_dir = "./data"
os.makedirs(data_dir, exist_ok=True) 

# Filtering and downloading
storm_files = {}
file_types = ["details", "fatalities", "locations"]
start_year = "2021"
for file_type in file_types:
    storm_files[file_type] = pd.Series([filename for filename in csv_files_https if file_type in filename])
    years = storm_files[file_type].str.split("_").str[3].str[1:5]
    storm_files[file_type] = storm_files[file_type][years >= str(start_year)]

def download_and_combine(files, prefix, timeout=5):
    dataframes = []
    for csv_file in files:
        print(csv_file)
        dataframes.append(pd.read_csv(csv_file))
        time.sleep(timeout)
    if dataframes:
        combined_df = pd.concat(dataframes, ignore_index=True)
        combined_df.to_csv(os.path.join(data_dir, f"{prefix}.csv"), index=False)
    return 
for file_type in file_types:
    print(file_type)
    download_and_combine(storm_files[file_type], file_type, timeout=3)

details
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2021_c20240716.csv.gz

https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2022_c20240716.csv.gz

https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2023_c20240816.csv.gz

https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2024_c20240816.csv.gz

fatalities
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_fatalities-ftp_v1.0_d2021_c20240716.csv.gz

https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_fatalities-ftp_v1.0_d2022_c20240716.csv.gz

https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_fatalities-ftp_v1.0_d2023_c20240816.csv.gz

https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_fatalities-ftp_v1.0_d2024_c20240816.csv.gz

locations
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_locations-ftp_v1.0_d2021_c20240716.csv.gz

https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_locations-ftp_v1.0_d2022_c20240716.csv.gz

https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_locations-ftp_v1.0_d2023_c20240816.csv.gz

https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_locations-ftp_v1.0_d2024_c20240816.csv.gz

Load the CSVs into pandas and join using EVENT_ID#

# Step 4: Load CSVs into pandas and join using 'Event_ID'
df_details = pd.read_csv(os.path.join(data_dir, "details.csv"), dtype=str)
df_fatalities = pd.read_csv(os.path.join(data_dir, "fatalities.csv"), dtype=str)
df_locations = pd.read_csv(os.path.join(data_dir, "locations.csv"), dtype=str)

merged_df = df_details.merge(df_fatalities, on="EVENT_ID", how="left")\
                      .merge(df_locations, on="EVENT_ID", how="left")

merged_df.head(5)

	BEGIN_YEARMONTH	BEGIN_DAY	BEGIN_TIME	END_YEARMONTH	END_DAY	END_TIME	EPISODE_ID_x	EVENT_ID	STATE	STATE_FIPS	...	YEARMONTH	EPISODE_ID_y	LOCATION_INDEX	RANGE	AZIMUTH	LOCATION	LATITUDE	LONGITUDE	LAT2	LON2
0	202112	11	349	202112	11	350	165322	999750	TENNESSEE	47	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	202112	11	249	202112	11	254	165322	999613	TENNESSEE	47	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	202112	11	325	202112	11	327	165322	999636	TENNESSEE	47	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	202112	11	232	202112	11	239	165322	999604	TENNESSEE	47	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	202112	6	724	202112	6	724	165321	999306	TENNESSEE	47	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 71 columns

Deleting individual files#

Let’s clean up the downloaded files since we don’t need them anymore and they are just taking up space

# List of individual files to delete
individual_files = ["details.csv", "fatalities.csv", "locations.csv"]
gz_files = [file.replace('.csv', '.csv.gz') for file in individual_files]

# Combine both lists
files_to_delete = individual_files + gz_files

# Delete specified files
for file in files_to_delete:
    file_path = os.path.join(data_dir, file)
    if os.path.exists(file_path):
        os.remove(file_path)
        print(f"Deleted: {file_path}")
    else:
        print(f"{file_path} not found.")

# Additionally, ensure all .gz files in the directory are deleted
gz_files_in_dir = glob.glob(os.path.join(data_dir, "*.gz"))
for gz_file in gz_files_in_dir:
    os.remove(gz_file)
    print(f"Deleted: {gz_file}")

Deleted: ./data/details.csv
Deleted: ./data/fatalities.csv
Deleted: ./data/locations.csv
./data/details.csv.gz not found.
./data/fatalities.csv.gz not found.
./data/locations.csv.gz not found.

Working with the dataset#

Let’s focus on events in the area where the survey was administered (Southern California) and look at events since the start of the year. For this we’ll be using County FIPs codes for the counties where the survey is administered.

Using FIPS codes#

The Storm Events Database has a column (CZ_FIPS) which uses the last three numbers of the FIPS code. We’ll do a boolean filter using that and the state code (STATE_FIPS), which is 06, but the dataset drops the leading zero.

socal_fips_codes = ['083', '111', '037', '031', '073', '043', '053', '027', '071', '069', '065', '029', '019', '039', '047', '107', '059', '025', '079']

# Define the start and end dates as strings.  The format is XXXXMM (Year/Month)
start_date = '202401'
end_date = '202404'

# Filter the DataFrame for rows between the start and end dates
filtered_df = merged_df[(merged_df['BEGIN_YEARMONTH'] >= start_date) & (merged_df['END_YEARMONTH'] <= end_date)]

# Filter the DataFrame based on the list of FIPS codes
fips = filtered_df[(filtered_df['CZ_FIPS'].isin(socal_fips_codes)) & (filtered_df['STATE_FIPS'] == '6')]
fips.head()

	BEGIN_YEARMONTH	BEGIN_DAY	BEGIN_TIME	END_YEARMONTH	END_DAY	END_TIME	EPISODE_ID_x	EVENT_ID	STATE	STATE_FIPS	...	YEARMONTH	EPISODE_ID_y	LOCATION_INDEX	RANGE	AZIMUTH	LOCATION	LATITUDE	LONGITUDE	LAT2	LON2
261072	202401	3	606	202401	3	806	186834	1146696	CALIFORNIA	6	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
261672	202403	2	1738	202403	2	1738	188027	1154871	CALIFORNIA	6	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
262197	202402	6	430	202402	6	430	187518	1151475	CALIFORNIA	6	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
262340	202402	19	1315	202402	19	1515	187711	1152775	CALIFORNIA	6	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
262341	202402	19	1319	202402	19	1519	187711	1152776	CALIFORNIA	6	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 71 columns

# What kinds of events occurred?  Group by the 'EVENT_TYPE' column
event_counts = fips.groupby('EVENT_TYPE').size().reset_index(name='Count')
event_counts

	EVENT_TYPE	Count
0	Flash Flood	16
1	Flood	9
2	Funnel Cloud	5
3	Hail	1
4	Thunderstorm Wind	3

At this point, you can drill down to determine if the events in the dataset are relevant.

# Subset the DataFrame based on flash floods
fips_flash_flood = fips[fips['EVENT_TYPE'] == 'Flash Flood']
fips_flash_flood

	BEGIN_YEARMONTH	BEGIN_DAY	BEGIN_TIME	END_YEARMONTH	END_DAY	END_TIME	EPISODE_ID_x	EVENT_ID	STATE	STATE_FIPS	...	YEARMONTH	EPISODE_ID_y	LOCATION_INDEX	RANGE	AZIMUTH	LOCATION	LATITUDE	LONGITUDE	LAT2	LON2
276360	202402	4	1956	202402	4	2115	189405	1164907	CALIFORNIA	6	...	202402	189405	1	1.71	W	VENTURA	34.2815	-119.2999	3416890	11917994
276361	202402	4	1956	202402	4	2115	189405	1164907	CALIFORNIA	6	...	202402	189405	2	1.66	W	VENTURA	34.278	-119.2991	3416680	11917946
276362	202402	4	1956	202402	4	2115	189405	1164907	CALIFORNIA	6	...	202402	189405	3	1.54	W	VENTURA	34.278	-119.2969	3416680	11917814
276363	202402	4	1956	202402	4	2115	189405	1164907	CALIFORNIA	6	...	202402	189405	4	1.57	W	VENTURA	34.2816	-119.2974	3416896	11917844
276595	202402	5	243	202402	5	430	189405	1164909	CALIFORNIA	6	...	202402	189405	1	10.45	SE	SANTA SUSANA	34.1802	-118.5529	3410812	11833174
276596	202402	5	243	202402	5	430	189405	1164909	CALIFORNIA	6	...	202402	189405	2	10.43	SE	SANTA SUSANA	34.181	-118.5525	3410860	11833150
276597	202402	5	243	202402	5	430	189405	1164909	CALIFORNIA	6	...	202402	189405	3	10.71	SE	SANTA SUSANA	34.1806	-118.5468	3410836	11832808
276598	202402	5	243	202402	5	430	189405	1164909	CALIFORNIA	6	...	202402	189405	4	10.73	SE	SANTA SUSANA	34.18	-118.5469	3410800	11832814
276840	202402	4	1735	202402	4	1930	189405	1164922	CALIFORNIA	6	...	202402	189405	1	1.5	NNE	EL RIO	34.25	-119.16	3415000	1199600
276841	202402	4	1735	202402	4	1930	189405	1164922	CALIFORNIA	6	...	202402	189405	2	1.5	NNE	EL RIO	34.2497	-119.1588	3414982	1199528
276842	202402	4	1735	202402	4	1930	189405	1164922	CALIFORNIA	6	...	202402	189405	3	1.52	S	SATICOY	34.2581	-119.1525	3415486	1199150
276843	202402	4	1735	202402	4	1930	189405	1164922	CALIFORNIA	6	...	202402	189405	4	1.43	S	SATICOY	34.2595	-119.1535	3415570	1199210
279229	202402	4	2037	202402	4	2200	189405	1164888	CALIFORNIA	6	...	202402	189405	1	4.0	WSW	BARDSDALE	34.34	-118.98	3420400	11858800
279230	202402	4	2037	202402	4	2200	189405	1164888	CALIFORNIA	6	...	202402	189405	2	3.78	WSW	BARDSDALE	34.341	-118.9762	3420460	11858572
279231	202402	4	2037	202402	4	2200	189405	1164888	CALIFORNIA	6	...	202402	189405	3	3.82	WSW	BARDSDALE	34.3395	-118.9758	3420370	11858548
279232	202402	4	2037	202402	4	2200	189405	1164888	CALIFORNIA	6	...	202402	189405	4	3.93	WSW	BARDSDALE	34.3391	-118.9779	3420346	11858674

16 rows × 71 columns

# Let's look at injuries, deaths, property damage, and narrative related to these flash floods.
print(fips_flash_flood[['EVENT_TYPE', 'LOCATION', 'INJURIES_DIRECT', 'INJURIES_INDIRECT', 'DEATHS_DIRECT', 'DEATHS_INDIRECT', 'EVENT_NARRATIVE', 'DAMAGE_PROPERTY']])  

         EVENT_TYPE      LOCATION INJURIES_DIRECT INJURIES_INDIRECT  \
276360  Flash Flood       VENTURA               0                 0   
276361  Flash Flood       VENTURA               0                 0   
276362  Flash Flood       VENTURA               0                 0   
276363  Flash Flood       VENTURA               0                 0   
276595  Flash Flood  SANTA SUSANA               0                 0   
276596  Flash Flood  SANTA SUSANA               0                 0   
276597  Flash Flood  SANTA SUSANA               0                 0   
276598  Flash Flood  SANTA SUSANA               0                 0   
276840  Flash Flood        EL RIO               0                 0   
276841  Flash Flood        EL RIO               0                 0   
276842  Flash Flood       SATICOY               0                 0   
276843  Flash Flood       SATICOY               0                 0   
279229  Flash Flood     BARDSDALE               0                 0   
279230  Flash Flood     BARDSDALE               0                 0   
279231  Flash Flood     BARDSDALE               0                 0   
279232  Flash Flood     BARDSDALE               0                 0   

       DEATHS_DIRECT DEATHS_INDIRECT  \
276360             0               0   
276361             0               0   
276362             0               0   
276363             0               0   
276595             0               0   
276596             0               0   
276597             0               0   
276598             0               0   
276840             0               0   
276841             0               0   
276842             0               0   
276843             0               0   
279229             0               0   
279230             0               0   
279231             0               0   
279232             0               0   

                                          EVENT_NARRATIVE DAMAGE_PROPERTY  
276360  Significant flooding was reported in downtown ...           0.00K  
276361  Significant flooding was reported in downtown ...           0.00K  
276362  Significant flooding was reported in downtown ...           0.00K  
276363  Significant flooding was reported in downtown ...           0.00K  
276595  Significant flooding was reported near the com...           0.00K  
276596  Significant flooding was reported near the com...           0.00K  
276597  Significant flooding was reported near the com...           0.00K  
276598  Significant flooding was reported near the com...           0.00K  
276840  Flooding closed Vineyard Avenue between Montgo...           0.00K  
276841  Flooding closed Vineyard Avenue between Montgo...           0.00K  
276842  Flooding closed Vineyard Avenue between Montgo...           0.00K  
276843  Flooding closed Vineyard Avenue between Montgo...           0.00K  
279229  Flash flooding and debris flows were reported ...           0.00K  
279230  Flash flooding and debris flows were reported ...           0.00K  
279231  Flash flooding and debris flows were reported ...           0.00K  
279232  Flash flooding and debris flows were reported ...           0.00K  

Saving the dataset#

In case we to use it for later.

# In case we want to save this dataset for later use, let's go ahead and save it.

file_name= f'stormevents_{start_year}_2024.csv'
merged_df.to_csv(file_name)