Storm Events Database#
The purpose of this notebook is to download, assemble, and work with the storm events dataset. This notebook will have the following steps:
Connect to the FTP server. “ftp.ncei.noaa.gov”
Download zipped CSV files of type “details”, “fatalities”, and “locations”.
Decompress and load each of these files.
Join them using the ‘Event_ID’ column.
Delete individual csv and gz files
Save the csv file
Work with the data
The database currently contains data from January 1950 to April 2024 and is updated monthly. Data on all event types starts in 1996, so we will only download files from 1996 and after. Full documentation can be found here: https://www.ncdc.noaa.gov/stormevents/details.jsp
While there is some overlap with SHELDUS, the storm events database offers more detailed information at finer geographic scales.
Needed Imports#
import os
import pandas as pd
import time
import glob
import requests
from bs4 import BeautifulSoup
Connect to the website and get a list of files#
url = "https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/"
ext = "csv.gz"
page = requests.get(url).text
soup = BeautifulSoup(page, "html.parser")
csv_files_https = [url + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]
Filter files by types and download#
We are going to download all zipped files from 2021 to the present.
#set the directory
data_dir = "./data"
os.makedirs(data_dir, exist_ok=True)
# Filtering and downloading
storm_files = {}
file_types = ["details", "fatalities", "locations"]
start_year = "2021"
for file_type in file_types:
storm_files[file_type] = pd.Series([filename for filename in csv_files_https if file_type in filename])
years = storm_files[file_type].str.split("_").str[3].str[1:5]
storm_files[file_type] = storm_files[file_type][years >= str(start_year)]
def download_and_combine(files, prefix, timeout=5):
dataframes = []
for csv_file in files:
print(csv_file)
dataframes.append(pd.read_csv(csv_file))
time.sleep(timeout)
if dataframes:
combined_df = pd.concat(dataframes, ignore_index=True)
combined_df.to_csv(os.path.join(data_dir, f"{prefix}.csv"), index=False)
return
for file_type in file_types:
print(file_type)
download_and_combine(storm_files[file_type], file_type, timeout=3)
details
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2021_c20240716.csv.gz
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2022_c20240716.csv.gz
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2023_c20240816.csv.gz
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2024_c20240816.csv.gz
fatalities
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_fatalities-ftp_v1.0_d2021_c20240716.csv.gz
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_fatalities-ftp_v1.0_d2022_c20240716.csv.gz
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_fatalities-ftp_v1.0_d2023_c20240816.csv.gz
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_fatalities-ftp_v1.0_d2024_c20240816.csv.gz
locations
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_locations-ftp_v1.0_d2021_c20240716.csv.gz
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_locations-ftp_v1.0_d2022_c20240716.csv.gz
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_locations-ftp_v1.0_d2023_c20240816.csv.gz
https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_locations-ftp_v1.0_d2024_c20240816.csv.gz
Load the CSVs into pandas and join using EVENT_ID#
# Step 4: Load CSVs into pandas and join using 'Event_ID'
df_details = pd.read_csv(os.path.join(data_dir, "details.csv"), dtype=str)
df_fatalities = pd.read_csv(os.path.join(data_dir, "fatalities.csv"), dtype=str)
df_locations = pd.read_csv(os.path.join(data_dir, "locations.csv"), dtype=str)
merged_df = df_details.merge(df_fatalities, on="EVENT_ID", how="left")\
.merge(df_locations, on="EVENT_ID", how="left")
merged_df.head(5)
BEGIN_YEARMONTH | BEGIN_DAY | BEGIN_TIME | END_YEARMONTH | END_DAY | END_TIME | EPISODE_ID_x | EVENT_ID | STATE | STATE_FIPS | ... | YEARMONTH | EPISODE_ID_y | LOCATION_INDEX | RANGE | AZIMUTH | LOCATION | LATITUDE | LONGITUDE | LAT2 | LON2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 202112 | 11 | 349 | 202112 | 11 | 350 | 165322 | 999750 | TENNESSEE | 47 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 202112 | 11 | 249 | 202112 | 11 | 254 | 165322 | 999613 | TENNESSEE | 47 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 202112 | 11 | 325 | 202112 | 11 | 327 | 165322 | 999636 | TENNESSEE | 47 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 202112 | 11 | 232 | 202112 | 11 | 239 | 165322 | 999604 | TENNESSEE | 47 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 202112 | 6 | 724 | 202112 | 6 | 724 | 165321 | 999306 | TENNESSEE | 47 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 71 columns
Deleting individual files#
Let’s clean up the downloaded files since we don’t need them anymore and they are just taking up space
# List of individual files to delete
individual_files = ["details.csv", "fatalities.csv", "locations.csv"]
gz_files = [file.replace('.csv', '.csv.gz') for file in individual_files]
# Combine both lists
files_to_delete = individual_files + gz_files
# Delete specified files
for file in files_to_delete:
file_path = os.path.join(data_dir, file)
if os.path.exists(file_path):
os.remove(file_path)
print(f"Deleted: {file_path}")
else:
print(f"{file_path} not found.")
# Additionally, ensure all .gz files in the directory are deleted
gz_files_in_dir = glob.glob(os.path.join(data_dir, "*.gz"))
for gz_file in gz_files_in_dir:
os.remove(gz_file)
print(f"Deleted: {gz_file}")
Deleted: ./data/details.csv
Deleted: ./data/fatalities.csv
Deleted: ./data/locations.csv
./data/details.csv.gz not found.
./data/fatalities.csv.gz not found.
./data/locations.csv.gz not found.
Working with the dataset#
Let’s focus on events in the area where the survey was administered (Southern California) and look at events since the start of the year. For this we’ll be using County FIPs codes for the counties where the survey is administered.
Using FIPS codes#
The Storm Events Database has a column (CZ_FIPS) which uses the last three numbers of the FIPS code. We’ll do a boolean filter using that and the state code (STATE_FIPS), which is 06, but the dataset drops the leading zero.
socal_fips_codes = ['083', '111', '037', '031', '073', '043', '053', '027', '071', '069', '065', '029', '019', '039', '047', '107', '059', '025', '079']
# Define the start and end dates as strings. The format is XXXXMM (Year/Month)
start_date = '202401'
end_date = '202404'
# Filter the DataFrame for rows between the start and end dates
filtered_df = merged_df[(merged_df['BEGIN_YEARMONTH'] >= start_date) & (merged_df['END_YEARMONTH'] <= end_date)]
# Filter the DataFrame based on the list of FIPS codes
fips = filtered_df[(filtered_df['CZ_FIPS'].isin(socal_fips_codes)) & (filtered_df['STATE_FIPS'] == '6')]
fips.head()
BEGIN_YEARMONTH | BEGIN_DAY | BEGIN_TIME | END_YEARMONTH | END_DAY | END_TIME | EPISODE_ID_x | EVENT_ID | STATE | STATE_FIPS | ... | YEARMONTH | EPISODE_ID_y | LOCATION_INDEX | RANGE | AZIMUTH | LOCATION | LATITUDE | LONGITUDE | LAT2 | LON2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
261072 | 202401 | 3 | 606 | 202401 | 3 | 806 | 186834 | 1146696 | CALIFORNIA | 6 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
261672 | 202403 | 2 | 1738 | 202403 | 2 | 1738 | 188027 | 1154871 | CALIFORNIA | 6 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
262197 | 202402 | 6 | 430 | 202402 | 6 | 430 | 187518 | 1151475 | CALIFORNIA | 6 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
262340 | 202402 | 19 | 1315 | 202402 | 19 | 1515 | 187711 | 1152775 | CALIFORNIA | 6 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
262341 | 202402 | 19 | 1319 | 202402 | 19 | 1519 | 187711 | 1152776 | CALIFORNIA | 6 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 71 columns
# What kinds of events occurred? Group by the 'EVENT_TYPE' column
event_counts = fips.groupby('EVENT_TYPE').size().reset_index(name='Count')
event_counts
EVENT_TYPE | Count | |
---|---|---|
0 | Flash Flood | 16 |
1 | Flood | 9 |
2 | Funnel Cloud | 5 |
3 | Hail | 1 |
4 | Thunderstorm Wind | 3 |
At this point, you can drill down to determine if the events in the dataset are relevant.
# Subset the DataFrame based on flash floods
fips_flash_flood = fips[fips['EVENT_TYPE'] == 'Flash Flood']
fips_flash_flood
BEGIN_YEARMONTH | BEGIN_DAY | BEGIN_TIME | END_YEARMONTH | END_DAY | END_TIME | EPISODE_ID_x | EVENT_ID | STATE | STATE_FIPS | ... | YEARMONTH | EPISODE_ID_y | LOCATION_INDEX | RANGE | AZIMUTH | LOCATION | LATITUDE | LONGITUDE | LAT2 | LON2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
276360 | 202402 | 4 | 1956 | 202402 | 4 | 2115 | 189405 | 1164907 | CALIFORNIA | 6 | ... | 202402 | 189405 | 1 | 1.71 | W | VENTURA | 34.2815 | -119.2999 | 3416890 | 11917994 |
276361 | 202402 | 4 | 1956 | 202402 | 4 | 2115 | 189405 | 1164907 | CALIFORNIA | 6 | ... | 202402 | 189405 | 2 | 1.66 | W | VENTURA | 34.278 | -119.2991 | 3416680 | 11917946 |
276362 | 202402 | 4 | 1956 | 202402 | 4 | 2115 | 189405 | 1164907 | CALIFORNIA | 6 | ... | 202402 | 189405 | 3 | 1.54 | W | VENTURA | 34.278 | -119.2969 | 3416680 | 11917814 |
276363 | 202402 | 4 | 1956 | 202402 | 4 | 2115 | 189405 | 1164907 | CALIFORNIA | 6 | ... | 202402 | 189405 | 4 | 1.57 | W | VENTURA | 34.2816 | -119.2974 | 3416896 | 11917844 |
276595 | 202402 | 5 | 243 | 202402 | 5 | 430 | 189405 | 1164909 | CALIFORNIA | 6 | ... | 202402 | 189405 | 1 | 10.45 | SE | SANTA SUSANA | 34.1802 | -118.5529 | 3410812 | 11833174 |
276596 | 202402 | 5 | 243 | 202402 | 5 | 430 | 189405 | 1164909 | CALIFORNIA | 6 | ... | 202402 | 189405 | 2 | 10.43 | SE | SANTA SUSANA | 34.181 | -118.5525 | 3410860 | 11833150 |
276597 | 202402 | 5 | 243 | 202402 | 5 | 430 | 189405 | 1164909 | CALIFORNIA | 6 | ... | 202402 | 189405 | 3 | 10.71 | SE | SANTA SUSANA | 34.1806 | -118.5468 | 3410836 | 11832808 |
276598 | 202402 | 5 | 243 | 202402 | 5 | 430 | 189405 | 1164909 | CALIFORNIA | 6 | ... | 202402 | 189405 | 4 | 10.73 | SE | SANTA SUSANA | 34.18 | -118.5469 | 3410800 | 11832814 |
276840 | 202402 | 4 | 1735 | 202402 | 4 | 1930 | 189405 | 1164922 | CALIFORNIA | 6 | ... | 202402 | 189405 | 1 | 1.5 | NNE | EL RIO | 34.25 | -119.16 | 3415000 | 1199600 |
276841 | 202402 | 4 | 1735 | 202402 | 4 | 1930 | 189405 | 1164922 | CALIFORNIA | 6 | ... | 202402 | 189405 | 2 | 1.5 | NNE | EL RIO | 34.2497 | -119.1588 | 3414982 | 1199528 |
276842 | 202402 | 4 | 1735 | 202402 | 4 | 1930 | 189405 | 1164922 | CALIFORNIA | 6 | ... | 202402 | 189405 | 3 | 1.52 | S | SATICOY | 34.2581 | -119.1525 | 3415486 | 1199150 |
276843 | 202402 | 4 | 1735 | 202402 | 4 | 1930 | 189405 | 1164922 | CALIFORNIA | 6 | ... | 202402 | 189405 | 4 | 1.43 | S | SATICOY | 34.2595 | -119.1535 | 3415570 | 1199210 |
279229 | 202402 | 4 | 2037 | 202402 | 4 | 2200 | 189405 | 1164888 | CALIFORNIA | 6 | ... | 202402 | 189405 | 1 | 4.0 | WSW | BARDSDALE | 34.34 | -118.98 | 3420400 | 11858800 |
279230 | 202402 | 4 | 2037 | 202402 | 4 | 2200 | 189405 | 1164888 | CALIFORNIA | 6 | ... | 202402 | 189405 | 2 | 3.78 | WSW | BARDSDALE | 34.341 | -118.9762 | 3420460 | 11858572 |
279231 | 202402 | 4 | 2037 | 202402 | 4 | 2200 | 189405 | 1164888 | CALIFORNIA | 6 | ... | 202402 | 189405 | 3 | 3.82 | WSW | BARDSDALE | 34.3395 | -118.9758 | 3420370 | 11858548 |
279232 | 202402 | 4 | 2037 | 202402 | 4 | 2200 | 189405 | 1164888 | CALIFORNIA | 6 | ... | 202402 | 189405 | 4 | 3.93 | WSW | BARDSDALE | 34.3391 | -118.9779 | 3420346 | 11858674 |
16 rows × 71 columns
# Let's look at injuries, deaths, property damage, and narrative related to these flash floods.
print(fips_flash_flood[['EVENT_TYPE', 'LOCATION', 'INJURIES_DIRECT', 'INJURIES_INDIRECT', 'DEATHS_DIRECT', 'DEATHS_INDIRECT', 'EVENT_NARRATIVE', 'DAMAGE_PROPERTY']])
EVENT_TYPE LOCATION INJURIES_DIRECT INJURIES_INDIRECT \
276360 Flash Flood VENTURA 0 0
276361 Flash Flood VENTURA 0 0
276362 Flash Flood VENTURA 0 0
276363 Flash Flood VENTURA 0 0
276595 Flash Flood SANTA SUSANA 0 0
276596 Flash Flood SANTA SUSANA 0 0
276597 Flash Flood SANTA SUSANA 0 0
276598 Flash Flood SANTA SUSANA 0 0
276840 Flash Flood EL RIO 0 0
276841 Flash Flood EL RIO 0 0
276842 Flash Flood SATICOY 0 0
276843 Flash Flood SATICOY 0 0
279229 Flash Flood BARDSDALE 0 0
279230 Flash Flood BARDSDALE 0 0
279231 Flash Flood BARDSDALE 0 0
279232 Flash Flood BARDSDALE 0 0
DEATHS_DIRECT DEATHS_INDIRECT \
276360 0 0
276361 0 0
276362 0 0
276363 0 0
276595 0 0
276596 0 0
276597 0 0
276598 0 0
276840 0 0
276841 0 0
276842 0 0
276843 0 0
279229 0 0
279230 0 0
279231 0 0
279232 0 0
EVENT_NARRATIVE DAMAGE_PROPERTY
276360 Significant flooding was reported in downtown ... 0.00K
276361 Significant flooding was reported in downtown ... 0.00K
276362 Significant flooding was reported in downtown ... 0.00K
276363 Significant flooding was reported in downtown ... 0.00K
276595 Significant flooding was reported near the com... 0.00K
276596 Significant flooding was reported near the com... 0.00K
276597 Significant flooding was reported near the com... 0.00K
276598 Significant flooding was reported near the com... 0.00K
276840 Flooding closed Vineyard Avenue between Montgo... 0.00K
276841 Flooding closed Vineyard Avenue between Montgo... 0.00K
276842 Flooding closed Vineyard Avenue between Montgo... 0.00K
276843 Flooding closed Vineyard Avenue between Montgo... 0.00K
279229 Flash flooding and debris flows were reported ... 0.00K
279230 Flash flooding and debris flows were reported ... 0.00K
279231 Flash flooding and debris flows were reported ... 0.00K
279232 Flash flooding and debris flows were reported ... 0.00K
Saving the dataset#
In case we to use it for later.
# In case we want to save this dataset for later use, let's go ahead and save it.
file_name= f'stormevents_{start_year}_2024.csv'
merged_df.to_csv(file_name)