Therac-25 and industrial design engineering of socio-technical systems

therac-25-and-industrial-design-engineering-of-socio-technical-systems

Therac-25 was a radiation therapy machine produced by Atomic Energy of Canada Limited (AECL) and CGR of France after the Therac-6 and Therac-20 units. It was involved with at least six known accidents between 1985 and 1987, in which patients were given massive overdoses of radiation, which were in some cases on the order of hundreds of grays. At least five patients died of the overdoses. These accidents highlighted the dangers of software control of safety-critical systems, and they have become a standard case study in Risk Engineering.

Therac‐25, a medical linear accelerator, was the newest version of their previous models, the Therac‐6 and Therac‐20. These machines accelerated electrons that created energy beams that destroyed tumors. For shallow tissue penetration, the electrons are used; and to reach deeper tissue, the beam was converted into x‐ray form.

The Therac‐25 was computer controlled from a separate room to protect the operator from any unnecessary doses of radiation. Patients usually came in for a series of low energy radiation treatments to gradually and safely remove any remaining cancerous growth.

The Therac‐25 had two main types of operation:

A low energy mode, which is consisted of an electron beam of 200 rads that was aimed at the patient directly; and
A high energy mode, which uses the full power of the machine at 25 million electron volts. When used on patients, a metal plate was inserted between the beam and the patient, which would transform the beam into an x‐ray.

Problem description

The machine offered two modes of Radiation therapy:

Direct electron-beam therapy, which delivered low doses of high-energy (5 MeV to 25 MeV) electrons over short periods of time;
Megavolt X-ray therapy, which delivered X-rays produced by colliding high-energy (25 MeV) electrons into a “target”.

When operating in direct electron-beam therapy mode, a low-powered electron beam was emitted directly from the machine, and then spread to safe concentration using scanning magnets. When operating in megavolt X-ray mode, the machine was designed to rotate four components into the path of the electron beam: a target, which converted the electron beam into X-rays; a flattening filter, which equalized the x-ray beam intensity; a set of movable blocks (also called a collimator), which shaped the X-ray beam; and an X-ray ion chamber, which measured the strength of the beam.

The accidents occurred when the high-power electron beam was activated for x-ray therapy, without the target having been rotated into place. The machine’s software did not detect that this had occurred, and did not therefore prevent the patient from receiving a potentially lethal dose of radiation. The high-powered electron beam directly struck the patients causing the feeling of an intense electric shock and the occurrence of thermal and radiation burns. In some cases, the injured patients died later from radiation poisoning.

The Hardware

AECL combined forces with a French company, CGR, and created two linacs before the Therac-25: the Therac-6 and the Therac-20.

The Therac-6 is a six million electron volt (MeV) accelerator that produced X-rays only;
and the Therac-20 is a 20-MeV X-ray or electron accelerator.

An eV, the electron volt, is a unit of work needed to move an electron through a potential of 1 volt. Eventually, after the companies ended their partnership, AECL developed the Therac-25.

Like the Therac-20, the Therac-25 is a dual-mode machine, but it requires much less space because it has a unique design structure.

The Therac-25 uses two magnets to fold the electrons 180 degrees and 270 degrees before reaching their target. By positioning elements correctly, a turntable controls which mode the machine will use.

When the machine is in electron mode, magnets on the turntable spread the beam to a safe concentration. In electron mode, various levels of energy are available.
In photon mode, a much greater electron-beam current is needed because a “beam flattener” is used to produce a consistent treatment area. Only one level of energy (25-MeV) is available in photon mode. If the beam flattener is not in position, a dangerously high output rate will occur; this is a significant hazard of a dual-mode machine, because it is possible that not all the devices will be lined up properly and a high output could occur.

The turntable also includes a third mode, the field-light position, which uses a light to help position patients correctly. When the machine is in field light position, no mechanism is used to control the beam concentration because no beam is expected. This produces another possible hazard of the machine, in the event that a beam is incorrectly produced.

The Therac-25 is enclosed in a radiation treatment room in order to prevent unnecessary radiation exposure to individuals working near the machine. The machine operator has contact with the patient through visual and audio monitors located within the treatment room.

The Software

The design of real-time computing systems is the most challenging and complex task that can be undertaken by a software engineer. By its very nature, software for real-time systems makes demands on analysis, design, and testing techniques that are unknown in other application areas.

The Therac-25’s software was developed from the Therac-20’s software, which was developed from the Therac-6’s software. One programmer, over several years, revised the Therac-6 software into the Therac-25 software. An important difference between the Therac-20 software and the Therac-25 software is the overall role that each plays in the machine. In the Therac-20, the role of software is limited. The software simply adds convenience to the hardware. However, in the Therac-25, software exclusively performs many of the critical safety checks of the system; these safety checks are also included in the hardware of the Therac-20, but were not included in the Therac-25 hardware. The Therac-25 software is responsible for:

monitoring the machine status
accepting input about the treatment
setting the machine up for the treatment
turning on the treatment beam
turning off the treatment beam, either after a successful treatment or under a malfunction
detecting hardware malfunction and delivering diagnostic messages and either a pause or suspend of treatment

The last two responsibilities reveal some of the ways that the software is responsible for the safety of the system.

The Therac-25 runs on a custom-designed real-time operating system. The software has four major components:

stored data,
a scheduler,
a set of critical and non-critical tasks, and
interrupt services. The interrupt services include:
- a treatment console screen interrupt handler, and
- a treatment console keyboard interrupt handler.

The scheduler directs all non-interrupt events and orders simultaneous events. Tasks are divided into critical and non-critical categories. Every 0.1 seconds tasks are initiated and critical tasks are executed first, with non-critical tasks taking up any remaining time. Critical tasks include:

The treatment monitor (Treat) directs and monitors patient setup and treatment
The servo task regulates gun emission, dose rate, symmetry, and machine motions, machine parameters, and does some error handling
The housekeeper task checks setup verification and takes care of system-status interlocks and limit checks

Non-critical tasks include:

Treatment console keyboard processor which acts as the interface between the software and the operator
Treatment console screen processor
Calibration processor which allows the operator to examine and change system setup parameters and interlock limits

The software of the Therac-25 also controls the positioning of the turntable, a possible hazard discussed previously, and checks the position of the turntable so that all necessary devices are in place.

The Therac-25 software also contained several “user-friendly” features. During system testing, operators complained that it took too long to enter the treatment plan, since it had to be done twice: once in the treatment room and a second time at a terminal outside of the room. For convenience, AECL redesigned the software so operators could simply use a set of carriage returns, at the terminal outside the treatment room, to verify the data input within in the room. Another “convenient” feature of the Therac-25 involved a “proceed” key. There were two ways that the Therac-25 could shut down:

A treatment suspends. A treatment suspend indicated a serious error and required a complete system restart. or
A treatment pauses. A treatment pause, which was apparently not as serious, required only a single-key command (the “P” key) to restart the machine, and all treatment specifications remained intact.

A treatment pause could occur five times before the machine required a complete system restart. With a treatment pause, a simple error message would occur, i.e. “malfunction” followed by a number of the malfunction. However, there was no indication in the user’s manual as to what each malfunction number meant.

Changes from Previous Machines

The Therac‐25 massively overdosed patients at least six times between June 1985 and January 1987. Each overdose was over 100 times the normal therapeutic dose and resulted in the patient’s severe injury or even death. Overdoses occurred primarily because of the bugs in the Therac‐25’s software and because the manufacturer did not follow proper software engineering practices.

The following features of the Therac‐25 are necessary to review in relating to the accidents:

The Therac‐25 was designed to be computer controlled. Contrastingly, the previous machines had separate pieces of machinery and hardware to monitor safety factors.

The software used had more convenience to the operation and responsibility in controlling safety. The designers believed that they could save time and money in the Therac‐25 by using only software safety control.
A final feature was that some of the old software used in Therac‐6 and Therac‐20 was used in the Therac‐25. A bug that was discovered in Therac‐25 was later also found in the Therac‐20.

The Accidents

Six accidents involving enormous radiation overdoses to patients took place between 1985 and 1987. In this section I will simply give a brief overview of the accidents.

The first accident occurred at Kennestone Regional Oncology Center in Marietta. On June 3, 1985, a sixty-one year old woman was receiving follow-up treatment after a malignant tumor was removed from her breast. When the machine was activated, she felt “a tremendous rush of heat…” She told the operator of the Therac-25 “you burned me.” Although later she developed reddening and swelling in the center of the treatment area, AECL denied that the machine burned the patient. The swelling was attributed to normal treatment reaction. Eventually, her shoulder froze and she began to experience spasms. She was admitted to the hospital, but her doctors continued to send her for Therac-25 radiation treatments. Eventually the patient’s breast had to be removed, and she completely lost the use of her shoulder and arm.

The second accident occurred at the Ontario Cancer Foundation clinic in Canada. On 26 July 1985, a 40-year old patient received her 24th Therac-25 treatment. During the treatment, the machine caused a treatment pause and issued an “H-tilt” error message. The operator proceeded to push the “P” button since the machine indicated that no dose had been delivered to the patient. The machine continued to shut down and the operator pushed the “P” button each time until the machine suspended after the fifth attempt. Each time the machine indicated that no dose had been given to the patient. The operator of the Therac-25 was used to this type of behavior from the machine and called the technician, who found nothing wrong with the machine. This also was a common situation. The patient, however, complained of an “electric tingling shock” in her hip. Eventually radiation overexposure was suspected and the patient was hospitalized. She died three months later of cancer, but a total hip-replacement would have been necessary if she had continued to live.

The third accident involved a woman who developed red parallel stripes on her hip, the treatment area. She was treated at the Yakima Valley Memorial Hospital in 1985. Her doctors continue to order treatments for her even after these stripes appeared. Radiation overexposure was not considered as a cause until over a year later. Eventually, the patient received surgical treatment due to minor disability and scarring.

Another Therac-25 accident, the fourth in the series, developed at the East Texas Cancer Center in March of 1986. A male patient was to receive therapy on his upper back. The Therac-25 operator had typed in incorrect treatment information by indicating X-ray mode instead of electron mode. She merely used the “cursor up” key to edit the mode entry and then quickly pressed “enter” (one of the user-friendly features), and started treatment. The machine shut down with treatment pause, and a “malfunction 54” error message was displayed on the screen. This error message indicated that either a dose too high or a dose too low had been delivered. Since an underdose value appeared on the screen and the operator was used to quirks in the machine, she hit the “P” key to continue with the treatment. The machine repeated the “Malfunction 54” error message and indicated the same underdose was delivered. The operator had no contact with the patient, because the usual audio and video monitors were not working properly. After the first attempt at treatment, the patient felt an “electric shock” or as if “someone had poured hot coffee” on his back. He knew this was not normal and began to get up from the treatment table when the second treatment was delivered. The patient felt a tremendous shock in his arm, and felt that “his hand was leaving his body”. He had to pound on the treatment room door to get the operator’s attention. The patient eventually loss the use of his left arm and both legs, was unable to speak, and had several other complications. He died from complications five months later.

A fifth accident occurred, the second at the East Texas Cancer Center, in April of 1986, just one month later. As in the previous accident, the same operator entered the wrong mode of treatment and quickly edited the correct mode in and hit a quick serious of enter keys. The machine shut down again with a “Malfunction 54” message. This time, however, the intercom had been working and the operator heard a loud noise followed by moaning from the patient. The patient was receiving radiation on the side of his face. He died three weeks after the accident, after falling into a coma and suffering severe neurological damage (Leveson and Turner, 1993, p. 28) .

The last of the accidents occurred at the Yakima Valley Memorial Hospital. On January 17, 1987 an operator placed a patient on the turntable in the field-light position for small position verification doses. After attempting to administer the treatment dose, the machine shut down with a quick malfunction message and a treatment pause. The operator pushed the “P” button, and the machine paused again. The machine indicated that the patient had received his prescribed 7 rad of treatment. The patient, however, complained of a “burning sensation” and died three months later from complications related to the overdose.

Therac-25: Radiation Accident Summary
Date of the Accident	Location of the Accident	Extent of injuries to patient	Number of months after the first accident
June 3, 1985	Marietta, GA	Breast removal, loss of use of arm
July 26, 1985	Ontario, Canada	Total hip replacement needed	1
January 6, 1986	Yakima, WA	Minor disability and scarring	7
March 21, 1986	Tyler, TX	Death	9
April 11, 1986	Tyler, TX	Death	10
January 17,1987	Yakima, WA	Death	19

Software Bugs

The Therac‐25 inherited some functionality from the Therac‐20 codebase. The inherited Therac‐20 code proved to contain bugs, which were not detected , because there are the hardware safety interlocks design in Therac‐20 device. Contrastingly, the Therac‐25 put more confidence on software and removed the hardware safety mechanisms. There were two main software bugs.

The first bug was a concurrency bug attributed to a race condition.
The second software bug was also a concurrency bug. In this instance, the software may allow the operator to active the device in an erroneous state without warning.

Race Condition

The first of these errors involved the entering of treatment data by the machine operator. Once an operator enters treatment information at the terminal outside of treatment room, the magnets used to filter and control radiation levels are set. There are several magnets, and the process takes about 8 seconds. If the operator makes a very, very quick change of the treatment information, within 1 second, the change is registered. Or, if the operator is rather slow about it, takes more than 8 seconds, the change is also registered. However, if the change occurs within the eight seconds it takes to set the magnets, the change is not detected and the magnets continue to be set up improperly, and thus the level of radiation is set up improperly. This is the main hazard of a dual -mode system, and is what happened in the fourth and fifth accidents. Once the magnets are set, there is no test performed to double check that the treatment information entered matches how the magnets are set. Another variable, which controls whether photon or electron mode is to be used, does detect the operator edit and sets the mode to the edited mode. As mentioned earlier, much higher levels of radiation are needed in photon mode to produce the same levels of output in electron mode. Therefore, if the beam is set for photon mode, but the turntable is set up for electron mode, a radiation overdose occurs. Here is the detailed process:

The operator entered all of the prescription data and confirmed this by moving the cursor to the command line, but if they found a mistake and made the appropriate change by moving the cursor off the command line all within a specific short period of time, a race condition may occur.
The task called “Treat”, which has several containing subroutines, controls the various phases of treatment. The variable called “Tphase”, indicating which subroutine the task should execute, is associated with the Treat. And if one of these subroutines has been executed, Treat reschedules itself.
Furthermore, one of Treat’s subroutines called “Datent”, via a shared variable called Data Entry Complete, is responsible to communicate with the concurrently running keyboard handler task.
The keyboard handler plays the role to recognize data entry completed, and changes Data Entry Complete. Treat detects Data Entry Complete’s change and sets the Tphase variable from 1 (Datent) to 3 (Set Up Test).
Upon entry into Datent, the Mode/Energy Offset variable is checked. If it has been set, the subroutine uses the high‐order byte to index into a table of preset operating parameters. When all of the parameters are set, the “Magnet” subroutine is executed, which is responsible for setting the bending magnets.
It takes Magnet approximately 8 seconds to Set the bending magnets. The subroutine called “Ptime”, called by Magnet, is responsible for introducing a time delay. Since several magnets need to be set, Ptime is called several times. A flag to indicate that the bending magnets are being set is initialized upon entry to Magnet and cleared at the end of Ptime. Furthermore, Ptime checks a shared variable, set by the keyboard handler. If there are edits, then Ptime clears the bending magnet variable and exits to Magnet, which then exits to Datent. But the edit change variable is checked by Ptime only if the bending magnet flag is set.
Because Ptime clears it during its first execution, any edits performed during each succeeding pass through Ptime will not be recognized.

Thus, an edit change in the mode or energy will not be detected by Datent and the incorrect treatment parameters will be used.

Erroneous State

The second of the software errors, causing the sixth and possibly other accidents, involved the nature of the real-time system. When the turntable is in test mode, a variable called Class3 is set to a non-zero value. As long as the operator is testing the position of the light beam, the variable increments. Once testing procedures are complete, the variable is set to zero and the radiation beam is allowed to pass. Class3, however, was stored in one byte of memory. As a result, every 265th increment results in the value of zero assigned to it. In the sixth accident, the operator pushed the set button at the exact moment that Class3 rolled over to zero. As a result, a full prescription beam was released without any of the beam flatteners in place. Here is the detailed process:

After the prescription data is entered and verified by Datent, Tphase is changed and Set Up Test is entered. Every pass through Set Up Test increments a variable called “Class3”, which acts as the upper collimator position check. If Class3 is non‐zero, there is an inconsistency and the treatment should not proceed. A zero value for Class3 indicates that the relevant parameters are consistent with treatment, and the software does not inhibit the beam.
After setting the Class3 variable, Set Up Test next checks for any malfunctions in the system by checking another shared variable called F$mal to see if it has a non‐zero value. A non‐zero value for this variable indicates that the machine is not ready for treatment, and Set Up Test is rescheduled. When the value is zero, the Set Up Test sets Tphase to 2 which schedules the Set Up Done subroutine and the treatment is allowed to continue.
The upper collimator position check is performed by a subroutine of Hkeper called LmtChk. LmtChk first checks the Class3 variable. If Class3 contains a non‐zero value, LmtChk calls the Check Collimator (ChkCol) subroutine. If Class3 is zero, ChkCol is bypassed and the upper collimator position check is not performed. The ChkCol subroutine sets or resets bit 9 of the F$mal shared variable depending on the position of the upper collimator, which in turn is checked by the Set Up Test subroutine of Treat to decide whether to reschedule itself or to proceed to Set Up Done.
During machine setup, Set Up Test will be executed several hundred times because it reschedules itself waiting for other events to occur. In the code, the Class3 variable is incremented by one in each pass through Set Up Test. Since the Class3 variable is one byte, it can only contain a maximum value of 255. Thus, on every 256th pass through Set Up Test, the variable will overflow and have a zero value. This means that when the variable overflows to zero, the upper collimator will not be checked and an upper collimator fault will not be detected.

The problem occurred when the operator hit the “set” button at the precise moment that Class3 rolled over to zero.

Causes of the Failure

The Therac‐25 incidents demonstrate that several misconceptions in the manufacturer’s attitude led to the accidents:

Overconfidence in the software’s abilities
Unreasonably low risk assessments
Failure to properly assess the old software when using it for new machinery
Not well designed error and warning messages
Did not fix or even understand the frequent recurring problems
Should have installed proper hardware to catch safety glitches
Manufacturer would not believe that machine could fail
Lack of communication and organization between hospitals, government and manufacturer

The researchers also found several engineering issues:

The design did not have any hardware interlocks to prevent the electron-beam from operating in its high-energy mode without the target in place.
The engineer had reused software from older models. These models had hardware interlocks that masked their software defects. Those hardware safeties had no way of reporting that they had been triggered, so there was no indication of the existence of faulty software commands.
The hardware provided no way for the software to verify that sensors were working correctly. The table-position system was the first implicated in Therac-25’s failures; the manufacturer gave it redundant switches to cross-check their operation.
The equipment control task did not properly synchronize with the operator interface task, so that race conditions occurred if the operator changed the setup too quickly. This was evidently missed during testing, since it took some practice before operators were able to work quickly enough for the problem to occur.
The software set a flag variable by incrementing it. Occasionally an arithmetic overflow occurred, causing the software to bypass safety checks.
The software was written in assembly language. While this was more common at the time than it is today, assembly language is harder to debug than most high-level languages.

Industrial Design Engineering: Safety is a Socio-Technical System Property

The safety of the Therac-25 is not really a property of the machine alone. Accidents that go unreported contribute to (or at least fail to stop) later accidents. When the TV camera in the room is unplugged, the operator cannot see that the patient is in trouble. So safety is really a property of the entire technical and social system (socio-technical system).

The Therac-25 Medical Linear Accelerator is a large machine that sits in a room designed just for it. We may think of the machine itself or the machine-in-the-room as the system. But the larger system, or the Socio-Technical system, that we need to think about includes:

Hardware: The mechanics of the machine itself, including its associated computer
Software: the operating system of the computer and the operating system of the machine
Physical surroundings: the room with its shielding, cameras, locking doors, etc.
People: operators, medical physicists, doctors, engineers, salespeople, managers at AECL, regulators
Institutions: AECL, regulators, each medical facility, associations of operators, etc.
Procedures
- Management models: AECL’s model of how risk is managed
- Reporting relationships: who was required to report accidents to whom
- Documentation requirements: for the software, for the facilities, for the government regulators
- Data flow: how different parts of AECL shared information, how information was shared among agencies and organizations, how data was used by the Therac software.
- Rules & norms: what patients are “normally” told, what operator & physicist responsibilities are, expectations set for the programmer
Laws and regulations: Reporting requirements, regulators enforcement mechanisms, medical liability law
Data: data was collected in regulators’ approval process, use of data in Therac software,

The following table presents some of these items in a schematic form.

Therac-25: The Socio-Technical System

The Machine

Supporting Systems (video, audio, etc.)
Hardware
Software Systems

Hospitals and Clinics

Doctors, Medical Physicists
Management, User Groups
Operators, Reporting Procedures

Atomic Energy Canada, Limited

Management, Reporting Procedures,
Design Teams, Sales Staff, Support and Field Engineers

Medical Device Regulation

Regulators
Reporting Procedures

When addressing complex societal problems, Industrial Design Engineering has been recognized in literature as a solution likely to achieve better and more sustainable results than a traditional product design approach. Industrial Design Engineering considers different system hierarchies within a particular socio-technical system. A socio-technical system is a number of clustered elements, such as technology, policies, user practices, markets, culture and infrastructure, which are linked together to attain a specific functionality in a system. By broadening the scope and complexity of design practice, Industrial Design Engineering increases the capacity of the (socio-technical) system to address its function.

Industrial Design Engineering understands that solely (re)designing products to be affordable is not enough to guarantee their adoption and thus render the function of the system to be comprehensively accomplished. As such, it develops the design of a coherent combination of processes and products that together fulfill the function of the system. For designers this means handling a larger degree of complexity and making a more sustainable change by considering value creation through a long-term timeframe and the involvement of a larger network of stakeholders. Therefore, relying on existing product development knowledge (i.e. methods, tools and techniques) restricts the design process creating an inability to understand the local context. Using Industrial Design Engineering encourages designers to consider aspects beyond technology, related to business, lifecycle and stakeholders motivations. Because designers and researchers are typically educated to apply traditional product development, in this novel innovation network, universities increasingly gain relevance as essential partners for system change.

See More

https://www.crcpress.com/Industrial-Design-Engineering-Inventive-Problem-Solving/Wang/p/book/9781498709590