ME: 8259A Bug - fordsfords/fordsfords.github.io GitHub Wiki
This is just a geezer's "war story" from the misty past. I'll say it was 1987, +/- 2 years (it was definitely prior to Siemens moving to Hoffman Estates in 1989).
The product was "Digitron", an X-ray system based on a Multibus II backplane and an 8086 (or was it 80286?) CPU board running iRMX/86. I think the CPU board was off-the-shelf from Intel, but most of the rest of the boards were custom, designed in-house. The CPU board used an 8259A interrupt controller chip (datasheet here or here) that supported 8 vectored interrupts. There was a specific hardware handshake between the 8259A and the CPU chips that let the 8259A tell the CPU the interrupt vector. The interrupt line is asserted and had to be held active while the handshake took place. The ISR (Interrupt Service Routine - low-level software interrupt handler) would typically interact with the interrupting hardware to release the asserted interrupt prior to returning from the ISR.
At some point, we discovered there was a problem. We got frequent "spurious interrupts" to interrupt level 7. I *think* these spurious interrupts degraded system performance to the degree that sometimes the CPU couldn't keep up with its work, resulting in a system failure. But I'm not sure -- maybe they just didn't like having the mysterious interrupts. At any rate, I was either assigned to it (probable), or I just looked into it on my own (not sure, but I doubt it).
According to the 8259A datasheet, spurious interrupts are the result of an interrupt line being asserted, but then removed before the 8259 could complete the proper handshake. Essentially the chip wasn't smart enough to remember which interrupt line was asserted if it went away too quickly. So the 8259A just defaults to level 7.
I don't remember how I narrowed it down, but I somehow identified the peripheral board that was responsible.
For most of the peripheral boards, I believe there was a single source of interrupt, which used an interrupt line on the Multibus. But there was one custom board (don't remember which one) where they wanted multiple sources of interrupt, so the hardware designer included an 8259A on that board. Ideally, it would have been wired to the CPU board's 8259A in its cascade arrangement, but the Multibus didn't allow for that. So they left it to the software. A ISR for the board's interrupt level would read the status of the peripheral 8259A to determine which of the board's interrupts had fired. The ISR would then call the correct handler for that "sub-interrupt".
Using an analog storage scope, I was able to prove that the peripheral board's 8259A did something wrong when used in its polled mode. The peripheral board's 8259A asserted the interrupt level, which led to the CPU board properly decoding the interrupt level and invoking the ISR. The ISR then performed the polling sequence, which consisted of reading the status and then writing something to clear the interrupt. However, the scope showed that during the status read operation, while the multibus read line was asserted, the 8259A released its interrupt output. When the read completed, the 8259A re-asserted its interrupt. This "glitch" informed the CPU board's 8259A that there was another interrupt starting. Then, when the ISR cleared the interrupt, the 8259A again released its interrupt. But from the CPU board's 8259A's point of view, that "second" interrupt was not asserted long enough for it to handshake with the CPU, so it was treated as a spurious interrupt.
(Aside: although I use the word "glitch" to describe the behavior, but that's not right terminology. A glitch is typically caused by a hardware race condition and would have zero width of all hardware had zero propagation delay. This wasn't a glitch because the release and re-assert of the interrupt line was tied to the bus read line. But the behavior resembles a glitch, so I'll keep using it.)
I designed a simple workaround that consisted of a chip (I think it was a triple, 3-input NAND gate, possibly open collector). The interrupt line was active low, so by driving it with an AND gate, it is possible to force it to active (low). I glued the chip upside-down onto the CPU board and wire-wrapped directly to the pins. One NAND gate was used as an inverter to make another NAND gate into an AND circuit. One input to the resulting AND was driven by the interrupt line from the Multibus, and the other input was driven by an output line from a PIO chip that the CPU board came with but wasn't being used. I assume I had to cut at least one trace and solder wire-wrap wire to pads.
The PIO output bit is normally kept high so that when the peripheral board asserts an interrupt, the interrupt is delivered to the CPU. When the ISR starts executing, the code writes a "0" to the PIO bit, which forces the AND output to stay low. Then the 8259A is polled, which glitched the multibus interrupt line, but the AND gate keeps the interrupt active, masking the glitch. Then the ISR writes a "1" to the PIO and clears the interrupt, which releases the Multibus interrupt line. No more spurious interrupt.
Kludge? Hell yes! And a hardware engineer assigned to the problem figuratively patted me on the head and said they would devise a "proper" solution to the spurious interrupt problem. After several weeks, that "proper" solution consisted of using a wire-wrap socket with its pins bent upwards so that instead of wire-wrapping directly to the chip's pins, they wire-wrapped to proper posts.
Back in those days, digital cameras were not common consumer items, so I have no copy of the picture I took of the glitch. And I'm not confident that all the details above are remembered correctly. (E.g. I kind of remember it was a NOR gate, but that doesn't make logical sense. Unless maybe I used all 3 gates and boolean algebra to make an AND out of NOR gates? I don't remember. But for sure the point was to mask the glitch during the execution of the ISR.)