More than a month has passed, and the investigation into how and why the blackout happened is still in its early stages. Nonetheless, a few pieces of the jigsaw are in place, and we can start to consider underlying causes that might have led to this situation. Of course, there is still a great deal of work to be done in piecing together all the relevant events, and understanding the links between them.

Because of the extent of the work remaining, looking at causes of the blackout involves some conjecture at the moment. The whole detailed picture of what happened will take some time to emerge, and after that there could be a great many interpretations of the detail. The need for solutions to avoid future occurrences, however, is urgent. The process of looking for ways forward can start before every detail of the event is filled out.

It is most likely that the cause of the blackout will turn out to involve complex interactions between

  • chance occurrences and coincidences
  • random events made more likely by conditions in the electrical system and the environment
  • cascading chains of events
  • human interventions

There appears to be two distinct periods in the time leading up to the blackout. Up to about 3.40pm1 , there are a number of seemingly unconnected events in the Midwest. Separated in time and geography, these events conspired to create a situation with a high risk of large-scale cascading outage. Then, from about 3.40pm onwards, incidents occurred with increased frequency, locally at first, eventually cascading to create the biggest blackout in history.

Two questions are important. Firstly, how were the circumstances set up so that an event on such a scale could happen? Secondly, why did the cascade propagate through the network so effectively? Unfortunate Chance, or a Weakness Exposed? Considering the events between around noon and the “watershed” around 3.40pm, we see isolated events occurring throughout Indiana, Ohio and Michigan States. Some of these were mini-cascades, others apparently isolated occurrences. All were contained, not causing any widespread outages. Among these events were:

  • transmission lines tripping in Indiana, at least one case tripping from high voltage to lower voltage levels in a mini-cascade
  • two generating units in Ohio tripping
  • one generating unit in Michigan tripping
  • brush fires under an Ohio transmission line causing the line to trip
  • other lines tripping in Ohio

Did all these events really happen independently? If so, the blackout was a very unfortunate accident! The probability of coincidence of these events, or other events leading to a similar likelihood of collapse is extremely small. The system is designed, built and operated so that such coincidence is vanishingly improbable.

However, the other possibility is that these events are not all entirely independent. By its nature, the transmission network interconnects a very large number of components, and each component has an influence on the others. In an unstressed system, one component failure (for example, a line tripping or a generator outage) has little influence on other components, but in a system that is highly stressed, failure of one component has a much greater influence on the rest of the system. Thus, in a stressed system, failure of one component can increase the likelihood of other subsequent failures. So the evolution of the high-risk condition arising around 3.40pm may not have been as unlikely as traditional reliability analyses would suggest, because of some form of underlying stress in the system.

The system may have been stressed by one or more factors, including:

  • high loading of transmission and generation facilities
  • depressed voltages
  • high temperature and humidity
  • dynamic interactions between interconnected generators and other machines

Certainly, there were some heavily loaded lines and some unusual power flows in the Midwest. With the Davis-Besse nuclear plant out of action, Ohio generation and voltage support reserves were not high for summer demand. Loading on West-East lines was relatively high.

Depressed voltages are a stress to the system because they cause generation to work to the limits of their capability to support voltage. But depressed voltage tends to be a local issue. Voltage was clearly low in areas of Indiana from around noon, but it is not clear that Ohio or Michigan had low voltage until well into the afternoon, after certain lines tripped between 3pm and 3.40pm.

Temperature and humidity may have contributed to system stress, with increasing air conditioning loads and reducing cooling capability of some generation. But there were no record temperatures that day, nor was the load especially high for that time of year.

The possibility of dynamic interactions between generators interconnected by the grid is of interest. These interactions can occur over a wide region. Since incidents are rarely traced to the issue, it can often be overlooked, particularly in regions like the Northeast where there is not a history of known problems. But oscillations between interconnected machines are most likely to cause problems when parts of a system are stressed by loading and voltage issues, power flow patterns are unusual, and bottlenecks exist in the grid. All of these conditions were present, increasingly as the afternoon progressed. The possibility of system stress induced by this mechanism should not be ignored.

The way in which the events before the 3.40pm watershed were interrelated is not yet known. But it is most likely that these seemingly isolated events were related, through some form of stress to the grid.

The Cascade Proper
The stage was set by around 3.40pm for a large-scale blackout. Then several lines at sub-transmission voltage level tripped in quick succession, along with some events at transmission level. Near Cleveland the lower voltage grid started to separate from the high voltage transmission grid, resulting in overloads in the lower voltage, and further trips. Once voltage could not be maintained in the region with the line outages, generation began to trip, causing a slide that resulted in the blackout of not only eastern Michigan and northern Ohio, but also New York and Ontario.

The final event that blacked out East Michigan was the tripping of the interconnection between Michigan and Ontario. Power that had been flowing into East Michigan from the north suddenly shifted to flow south of Lake Erie through New York into Ontario. New York operators noticed flows into Canada increasing by 500-600MW.

Within seconds, however, the New York operators noticed the power flow reverse, as a huge power surge back from Ontario into New York, and an associated frequency spike to 63Hz. This enormous power swing appears to be a dynamic effect associated with the changing shape of the grid around the Great Lakes, including the loss of a vital power route through Michigan to Ontario. What happened in that power surge is unlikely to be explained as simple redistribution of power flows. More likely, it was the system’s response in trying to maintain synchronism between the interconnected areas of Ontario and New York through a weak link, following a huge, sudden change in the shape of the grid.

Managing Blackout Risks
From an engineering perspective, there are two ways to reduce the risks of future blackouts. One is to strengthen the electricity infrastructure, and the other is to address the methods used to control and manage the existing facilities.

There are clear needs for more investment in electricity infrastructure. Strengthening the transmission grid and adding generation appropriately located and specified would certainly improve the situation. But it is not the whole solution, since:

  • Major infrastructure investments have long lead times associated. In the meantime, there could be another blackout.
  • Strengthening the grid without addressing the methods and rules to manage it could lead to the same risks recurring as load growth fills up the spare capacity.

The tried and tested methods to ensure secure operation of the system broadly involve the use of system models to determine boundaries, or constraints, within which the system must be operated. Then the control operators (and also automatic protection mechanisms) ensure that the system stays within these defined boundaries.

It is worth noting that on 14th August, there were issues in both of the above areas. The IT systems in Midwest ISO to define secure constraints was out of action for an hour and a half. Later, FirstEnergy’s monitoring system for the control room operators to view what was happening on the grid gave problems. Aside from these problems, the systems normally available to the operators in the region would not make it easy for them to take in the big picture of the state of the whole system quickly.

These standard industry practices must be rigorously applied. In particular, the necessity to maintain detailed and validated models of the network is a vital piece of good practice that is certainly not a trivial task. In particular, accurate prediction of dynamic behaviour of the grid is a big issue.

By improving the management and operation of these security mechanisms, the risk of outages can certainly be reduced. But this does not address the core issue in large-scale cascading blackouts, which is that underlying system conditions and interactions, in certain circumstances, weaken the grid in such a way that the probability of catastrophic failures is increased.

In the discussion of the 2003 blackout, some potential linking factors were listed that could contribute to the risk of a condition in which a blackout could occur. There is a need for intelligent grid monitoring tools capable of giving early warning of conditions in which there is a heightened risk of instability and cascading failures. Such tools could empower operators across a wide interconnected region to detect and mitigate these risk conditions. There are monitoring systems developed to address aspects of the blackout risk with early warning and risk mitigation facilities. These are not yet in widespread use, but could be deployed in a short timeframe, in response to the urgent need to prevent recurrences.

While the need for grid reinforcement is clear, and the need for a sound regulatory environment is vital, we must ensure that the core issues of risk of large blackouts are addressed. The engineering management and oversight of the grid must provide appropriate, enforceable rules to constrain the system in a way that effectively reduces large blackout risks. In addition, that the best possible tools to enable system planners and operators to follow these rules effectively, must be made available.

1 All times in Eastern Daylight Time