Project for EECS 398-003 at the University of Michigan By David Lisbonne
In this project, I investigate a comprehensive dataset published by Purdue University that catalogued power outages across the United States, along with a suite of external variables recorded for each outage. These additional features include geographical location of the outages, regional climate classifications, customer distributions, electricity consumption patterns and economic characteristics of the states affected by the outages.
Initially, I needed to clean the data and perform an initial foray into analyzing the dataset. It contains 1534 rows and 57 columns, with some more niche columns –– for example, HURRICANE_NAME –– being largely null. As a result, I first needed to sanitize, organize and normalize the dataset.
Then, I explored a univariate analysis, focusing on understanding the distribution of causes in the “CAUSE.CATEGORY” column of the dataset. This was of particular interest because I thought it might be a key variable for a model to leverage while learning.
Next, I wanted to dive deeper into a bivariate analysis, and here I did a lot of analysis, leveraging combined and related features to better understand patterns in the dataset, and thus build a better predictive model. First, I looked at correlations of two variables using scatter plots. These were: outage start time vs. outage duration, outage month vs. outage duration, and outage cause category vs outage duration. My hypotheses for these three variables were as follows: Do outages that begin outside of working hours take longer to fix? Is there a correlation between outage duration and peak energy grid usage hours? Further, do higher usage months –– like the winter months –– correlate to longer or shorter outages? Do certain causes of outages correlate to longer fix times, eg. hurricanes compared to vandalism?
Then I expanded my multivariate analysis and looked at combinations of three variables. Here, the first combination of features I examined was climate category and cause category. I was particularly curious if harsher climate environments experiencing weather related outages took longer to fix than warmer climate areas. Finally, I looked at cause category and cause category detail to investigate if more granular documentation helped in finding correlations.
Below is a table outlining the columns of interest in this dataset:
Column | Description | |
---|---|---|
'MONTH' |
The month when the power outage occurred. | |
'U.S._STATE' |
The state in which the power outage took place. | |
'CLIMATE.REGION' |
One of the nine U.S. Climate regions as defined by the National Centers for Environmental Information. | |
'ANOMALY.LEVEL' |
The Oceanic El Niño/La Niña (ONI) index value, indicating whether the climate was experiencing cold (La Niña) or warm (El Niño) episodes during the season. | |
'OUTAGE.START.DATE' |
The specific day of the year when the power outage began. | |
'OUTAGE.START.TIME' |
The exact time during the day when the outage started. | |
'CAUSE.CATEGORY' |
A categorical classification of the main causes behind the power outage event. | |
'OUTAGE.DURATION' |
The length of time (in minutes) for which the power outage lasted. | \ |
'CUSTOMERS.AFFECTED' |
The number of residential, commercial, and industrial customers affected by the power outage. | |
'TOTAL.CUSTOMERS' |
The total number of customers (residential, commercial, and industrial) served by electricity providers in the state. | |
'RES.CUST.PCT' |
The percentage of region’s customers who are residential |
It is critical when working on any data science analysis of a large dataset to thoroughly sanitize and normalize the data. This is especially true when trying to later build a predictive model for generalizing on unseen data, as the more regularized the training data the better the model can recognize true patterns within the data.
There were two cases of imputation necessary for this dataset: numerical and categorical imputation. For numerical imputation I decided to use the mean of all rows for the given state, to reduce the scope of the means to a more representative range. For categorical imputation, I took it one step further, devising a more detailed custom scheme. First, I calculated the most common values for the column in question, “cause.category.detail”, for each postal code and annual quarter (calculated by binning the month values into four quarters). Then, I merged that new group back into the main DataFrame. Finally, I filled the missing values of the “cause.category.detail” column with the corresponding values, and dropped the temporary imputation columnn from the DataFrame. Below is the head of the cleaned DataFrame with a subset of columns for readability:
MONTH | ANOMALY.LEVEL | CLIMATE.CATEGORY | RES.CUST.PCT | CAUSE.CATEGORY | |
---|---|---|---|---|---|
1 | 7 | -0.3 | normal | 88.9448 | severe weather |
2 | 5 | -0.1 | normal | 88.8335 | intentional attack |
3 | 10 | -1.5 | cold | 88.9206 | severe weather |
4 | 6 | -0.1 | normal | 88.8954 | severe weather |
5 | 7 | 1.2 | warm | 88.8216 | severe weather |
Before delving into building the predictive models, I was curious to see how anomaly levels and outages related to the cause category of outages. To better understand this, I created a pivot table with cause category as the index, and then displayed the ranges of anomaly levels, as well as the median outage duration for those categories. This pivot table is shown below.
CAUSE.CATEGORY | (‘ANOMALY.LEVEL’, ‘max’) | (‘ANOMALY.LEVEL’, ‘min’) | (‘OUTAGE.DURATION’, ‘median’) |
---|---|---|---|
equipment failure | 1.3 | -1.4 | 221 |
fuel supply emergency | 2 | -1.4 | 3960 |
intentional attack | 2.3 | -1.3 | 56 |
islanding | 2 | -1.5 | 77.5 |
public appeal | 2 | -1.4 | 455 |
severe weather | 2.3 | -1.6 | 2460 |
system operability disruption | 2.3 | -1.5 | 215 |
One of the most interesting columns of the dataset is cause category, and so I wanted to uunderstand the univariate distribution of the column. This would help me gain insight into whether or not it was as a good training parameter – for example, if the column were incredibly heavily weighted to one cause, then it would be hard for a model to learn anything from that feature. Below is the histogram of the cause category column values.
Figure 1: Univariate distribution of cause category
As mentioned in the introduction, I performed three biivariate analyses to better understand correlations between features I expected might be important, and my ultimate target variable “outage duration”. The first examined the outage start time vs the outage duration, under the hypothesis that perhaps outages ocurring outside of working hours might take longer to fix. This theory doesn’t appear to be very well supported by the data, and the plot shows weak correlation between the timing of an outage and its duration.
Figure 2: Plot of outage start time vs outage duration
The next analysis I performed was to plot the outage month vs the outage duration, testing the hypothesis that harsher environmental conditions in winter might make repairs or fixes more difficult. This also proved to have a weak direct correlation, largely due to the number of outages itself.
Figure 3: Plot of outage month vs outage duration
Finally, the last univariate analysis I looked into was the outage cause category vs the outage duration. Here, the hypothesis was that harsher, more extreme causes like severe weather, might also correlate to longer outages. This proved to be true, especially as shown by the medians of the outage durations by categories.
Figure 4: Bar chart of median outage durations by cause category
I also examined two multivariate correlations, the first between cause category and climate category, vs outage duration, and the second between cause category and cause category detail (more granular description of the cause) vs outage duration. My thought process for the first analysis was to understand whether certain types of outage causes are exascerbated by specific climate conditions, wherein they result in longer outages. This relationship proved to be evidently clear with fuel shortages, notably how normal climate conditions were the shortest duration. This makes sense because colder environment likely have higher fuel requirements, thus when shortages occur it would take longer to assemble the requisite fuel to end the outage. Similarly, for warmer climates, they are likely less prepared for such an eventuality so any shortage of fuel would take longer to rebuild what little supply they likely had originally. Below is the box plot for my first analysis.
Figure 5: Bivariate box plot showing relationship between cause category and climate category vs outage duration.
I aim to predict the outage duration based on historical data and various features such as weather conditions, region demographics, and outage characteristics.
This is a regression problem, since the response variable (outage duration) is continuous.
The target variable is the outage duration (a continuous variable representing the duration in minutes of a given outage).
The outage duration is a key metric for utility companies, as it helps to assess the severity of the outage. Knowing this allows better resource allocation, such as deploying repair crews, communicating with customers, and prioritizing areas for restoration. It also provides valuable insights into the reliability of the power grid in different regions and under different conditions.
When making the prediction, I would only have access to the features available before or during the early stages of the outage. These could include:
We can approach this problem using a regression model. We’ll train the model using the selected features and predict the continuous response variable: outage duration.
For a regression problem, the most common evaluation metrics are:
The goal is to minimize these metrics, and depending on the application, we might prefer one over the other based on whether we want to penalize larger errors more (RMSE) or simply understand average performance (MAE). It’s worth noting that I also normalize the MAE to gain a better intuition for the quality of the model’s prediction, given that the target variable’s values are very large.
The baseline model uses a RandomForestRegressor to predict the outage duration. The model is trained using the following features:
The baseline model provides a solid starting point, with reasonable accuracy based on the performance metrics. The model utilizes simple preprocessing steps (scaling and encoding) and does not incorporate advanced feature engineering or hyperparameter tuning.
Based on the performance of the baseline model, it can be considered a reasonable first step, albeit with quite good accuracy at 4% off. However, further improvements can be made by exploring additional feature engineering (e.g., combining features or adding new ones) and performing hyperparameter tuning to optimize the RandomForestRegressor for better predictive power.
While the model’s performance can be seen as satisfactory, especially in comparison to a non-optimized or simple model, there is still room for improvement with more sophisticated techniques, such as better feature transformations or using different algorithms.
For the final model, I chose the RandomForestRegressor, an ensemble learning algorithm that combines multiple decision trees to improve predictive performance and reduce overfitting. It is particularly effective for regression tasks with complex relationships between features and target variables. I also created two new features, the log value of customers affected as well as the residential customers per region and used those to improve my model. Both of this give the model a better understanding of the importance of an outage, using the number of customers as a proxy. My hope is that this, in combination with the climate features, will better round out the model’s ability to learn from prior data such that for new unseen data, it can generalize better using broad population metrics.
n_estimators
: Number of trees in the forest. Best value: 100max_depth
: Maximum depth of each tree. Best value: 20min_samples_split
: Minimum samples required to split an internal node. Best value: 2min_samples_leaf
: Minimum samples required at a leaf node. Best value: 2I used GridSearchCV with 3-fold cross-validation to search over the hyperparameter grid and select the best combination. The performance was evaluated using Negative Mean Absolute Error (neg_mean_absolute_error) as the scoring metric.
The Final Model outperformed the baseline by using optimal hyperparameters and additional feature transformations, leading to better generalization and lower error. I am happy with the Final Model’s performance, cutting down NMAE by 50%.
However, the model certainly isn’t perfect, and can no doubt be improved upon with more complex methods and more training data. To get a sense of it’s practical accuracy, here is a chart of the residuals of the final model.
Figure 6: Histogram of residuals for final model