Spurious and Spuriouser

An ass-biting from the ARM7-TDMI Spurious Interrupt

 

Recently I was tasked with updating an old (>10 years) embedded design (hereafter “The Device”) based on the NXP LPC2210, an ARM7-TDMI processor. This was a project I more or less finished, but the product was cancelled after less than a year, and the design shelved.

With the development of a new sensor, it was decided that the project be revived as a testbed, while a new version of the product based on a current ST Micro ARM Cortex process was designed.

I set to work, getting an update of the original Windows IDE I used, rebuilding the code, and implementing the modifications required.

Did I mention that this thing was the first ARM7 device I had ever worked with?  Subsequent to it, I did several more products using the LPC2210, LPC2220, and NXP’s ARM9, the LPC3000, but this one had remained frozen in time, the code untouched.

My janitorial instincts kicked in, and I started going through the code, cleaning up some design tradeoffs / kludges I had left behind.  This included changing the keypad handling from polled to interrupt-driven.  I can’t remember why I hadn’t done it that way in the first place, but I didn’t like the erratic response of the polled method, so ripped and replaced it.

Another aspect of the device was a proprietary communications protocol running over “two-wire”, or half-duplex RS‑485.  This talked to a PC application that was also pretty old in the tooth (it was originally written for Windows NT, I think).  With some tinkering, I got that package running more or less successfully on Windows 11, and started using it to hammer on the device, requesting and displaying various measurements.

After a few hundred, or a few thousand transactions, the PC would report timeouts from the device, which had apparently stopped responding to data requests.

Now, the RS‑485 interface is serial, and I wasn’t running it especially fast, 38.4 kbps, but the nature of the half-duplex means that only one device may transmit at a time.  RS‑485 transceivers typically have a Transmit Enable/Driver Enable (TxE or DE) pin, which turns on the line driver, and level-shifts the bits from the micro at its Transmit (Tx) pin on to the wire.  When the micro wants to send bytes, it must first drive TxE to the enable state, then send the data.  Only after the last stop-bit of the last byte has been sent, the micro disables the driver, and allows bytes to be received.

In the device, the TxE was tied to a standard GPIO pin, so my code would put the message to transmit into a buffer, drive the GPIO pin high, and write the first byte of the message directly to the on-chip UART’s Transmit Holding Register (THR).

Now, the LPC2210 has two UARTs, which are pretty much identical in operation to the 16550A UART, which was descended from the 8250 UART used in the original IBM-PC.  The 16550A has both Receive and Transmit FIFOs, on-chip buffers that can hold up to sixteen bytes of data received or waiting to be transmitted. The UART can be configured to generate an interrupt when 1, 4, 8, or 14 characters are waiting in the Receive FIFO. This allows operation at higher speeds with much less risk of losing characters due to overrun.

Once that first byte of the message is written to the THR, the UART begins clocking the bits out at the baud rate it is configured for. After the last bit is sent, the UART checks the Transmit FIFO for any characters waiting. If there are any, the UART grabs the First In (the “FI” in FIFO), places it in the THR, and the cycle continues. If there are none, the UART asserts a Transmit Holding Register Empty (THRE) interrupt.

The ARM7TDMI family of processors has a useful feature, the Vectored Interrupt Controller (VIC).  This allows the programmer to assign an interrupt source, such as a GPIO pin or a UART, to a particular Vector, or slot. Then the address of an Interrupt Service Routine (ISR) or handler is assigned to the same vector. If enabled, when an interrupt is asserted, the processor saves the contents of the most critical registers, and jumps to the address of the associated ISR. The ISR code does whatever it is supposed to, writes a dummy value to a VIC register VICVectAddr, which resets the VIC hardware for the next interrupt, and executes a ReTurn from Interrupt (RTI) instruction, which causes the processor to restore the register contents, and continue execution where it left off.

The VIC also prioritizes interrupts.  If two or more sources try to interrupt at the same time, the interrupt at the lowest numbered vector gets serviced.

Now, the UART can generate an interrupt for several reasons, but it can only trigger the VIC with a single vector/handler. So it’s left to the ISR to read the Interrupt Identification Register of the UART to find the reason, then take the appropriate steps. When my ISR sees the ID is for THRE, it writes as many bytes to the THR as the Tx FIFO can hold from the message buffer and returns from the interrupt.

After the last byte of the message has been sent, when the next THRE interrupt occurs, the ISR checks that the last byte has cleared the UART, turns off the TxE pin of the RS‑485 line driver, and returns.

Receiving a message is similar; a command from the PC comes into the RS‑485 transceiver, which is in the receive state (transmitter NOT enabled). The transceiver Receive (Rx) pin is tied to the processor’s UART Rx pin, the character is detected by the UART and placed in the Rx FIFO, which has been configured to interrupt when it reaches 8 bytes.  The same ISR gets called, and when it checks the ID of the interrupt, sees that it is due to Received Data Available (RDA), and copies the bytes into a received message buffer until there are none waiting, in which case it sets a “message ready” flag, then it returns from the interrupt.

So far, so good.  It all works. I get a message from the PC into my receive buffer, I process the message, put the response into the transmit buffer, and send it off.

Until it would stop responding to messages.

I combed through the code.  Running the device with a debugger and a JTAG probe, the rest of the code was stalling waiting for that message ready flag. I had written a timeout test in case that happened, which would wait a predetermined amount of time for the complete message to be received, and abandon the message if it took too long.  So the device continued to operate, but wouldn’t communicate. 

I checked the status of the UART.  Everything looked normal, but the UART was not receiving characters.

In desperation, I tried something. To help users properly connect and test the RS‑485 connection, I provided a “Test” function which, when selected from a menu, would transmit in ASCII the name and ID of the device, along with the baud rate it was set for.  This allowed the connection and settings to be checked with Tera Term, PuTTY, or even HyperTerminal, back when that was a thing.

The PC application, after timing out, would automatically retry after an excruciating 15 seconds.  It would then try a command repeatedly, time out again, wait 15 seconds, etc.

I selected the Test menu item.  The PC timeout counted down to zero. The next retry succeeded, and everything worked great, until it would eventually break again after a few minutes or tens of minutes. I could restore comms with the test function, but it would always fail soon after.

What was going on?  The Test function uses the exact same functions to send its message as the PC interface code does.

Now, the PC interface is a strict Client-Server relationship.  The PC is the client, the device is the server, and the server doesn’t send anything unless the client asks for it. Once the device went deaf, it would never transmit again, until I pressed the Test button.

I put an oscilloscope on to the RS‑485 lines. I could tell when either the PC or the device was transmitting because they did so at different voltage levels.  On the PC side, I was using an RS‑232 ↔ RS‑485 converter that was externally powered, while the device transceiver only had a 3.3V rail to operate from, so the signals differed greatly in amplitude. I could see each request and response. In quiet moments, the signal would return close to zero volts, which is normal for RS‑485 when all devices are listening. But when the PC client eventually timed out, I noticed that the flat line was at approximately 3 volts. The device’s RS‑485 transmitter was ON. Every 15 seconds I would see the PC client fruitlessly try again, but the device was stuck in transmit, so would never hear the message.

When I pressed Test, the code would turn the transmitter on (which it already was), send the test message, then turn the transmitter OFF. It was clear from the scope. The commands and responses would resume.

So WHY was the transmitter getting stuck on?

Now, I had several ISRs assigned in the VIC. One handled the UART, as described.  Another was connected to a hardware timer, which would generate an interrupt every 10 milliseconds. That ISR would increment a counter, which served as the system elapsed time.

The keypad, which was a small array of six switches in two columns and three rows, wired to more GPIO pins. A falling edge on either column would trigger an interrupt, which would then scan the row inputs to determine which key had been pressed.

Another ISR handled sending and receiving SPI data used with an LCD, an EEPROM, and an A/D converter.

The system timer and the keypad appeared to be functioning correctly even after the UART hang, so the VIC had to be working.

Now, there’s one last interrupt handler: the default handler.  The VIC has a dedicated Default Vector Address register (VICDefVectAddr). As described in the user manual for the LPC21xx and LPC22xx, this register holds the address of the ISR for non-vectored IRQs. What that means is, when an interrupt occurs that the VIC doesn’t have a slot for, or that slot is empty, the VIC jumps to the address stored in VICDefVectAddr.

Now, I had read the manual ten years ago, and dutifully created a default handler.  Here’s the code I wrote for that:

void DefaultIntHandler ( )
{
  return;
}
 

Now, this code is wrong on so many levels.

First, I neglected to tell the compiler that this was an ISR.  GCC has an __attribute__((interrupt(“IRQ”))) function prefix which automatically inserts code to preserve all the registers before jumping to the ISR, and restoring them all after.  Now, since this handler does absolutely nothing, that wouldn’t make much difference; the processor saves the critical flags and address to resume from, so it won’t necessarily crash.

But even more importantly, I fail to write to VICVectAddr before returning from the ISR. This means that the VIC is left in a state where its priority hardware isn’t reset. As near as I can tell, that means that the VIC will no longer service lower priority interrupts.

But wait a minute. Even if this code is a train wreck, it shouldn’t matter because I don’t have any interrupt sources that don’t have a proper ISR assigned in the VIC.  Enter the Spurious Interrupt.

As the name implies, a spurious interrupt is one that isn’t anticipated, and reasonably should never occur.  And yet, they can and do.  It’s a problem NXP was aware of, enough so that they wrote Application Note AN10414, entitled “Handling of spurious interrupts in the LPC2000”.

In the introduction, they write (emphasis added):

Spurious interrupts can occur in the LPC2000 just like in any ARM7TDMI-S based

microcontroller using the Vectored Interrupt Controller (VIC) and if handled correctly,

spurious interrupts can be serviced just like any other interrupt request.

In the LPC2000, spurious interrupts have occurred while using the watchdog and the UART peripherals.

Since the root cause of spurious interrupts lies in the interaction of the VIC and the ARM7 core, it is recommended to always program a small handler to service these interrupts.

 

Oh.

The app note continues:

How can spurious interrupts occur?

Let’s consider a real-life application:

1. Vectored Interrupt Controller (VIC) detects an IRQ interrupt request and sends the IRQ signal to the core.

2. Core latches the IRQ state.

3. Processing continues for a few cycles due to pipelining.

4. IRQ Handler loads Interrupt Service Routine (ISR) address from VIC.

 

Spurious interrupts can occur if, in step 3 the VIC state changes. This could happen under the following conditions:

1. The instruction being executed in step 3 disables interrupts.

2. The interrupt which caused the IRQ signal in the first place got cleared.

 

The first condition may take place while using a watchdog in the application.

The second condition may take place when the RDA/CTI interrupt is enabled in the UART. 

Please note that using these peripherals does not necessarily mean that spurious interrupts will

always take place. The timing of the interrupt coupled with the instructions in the pipeline

would lead to the occurrence of spurious interrupt. It is recommended to program a spurious interrupt handler if the above peripherals are used.

 

Now, if you’ve been paying attention, you remember that I do use the RDA interrupt to read in the messages sent from the client. What’s CTI?  That’s the Character Timeout Interrupt, which is necessitated by the use of a Receive FIFO. Remember I set the Rx FIFO to interrupt when there were eight characters in the FIFO?  So what happens if the number of bytes in a message isn’t exactly eight or a multiple of eight?

Let’s imagine the client sends a request that happens to be 13 bytes long. The first eight bytes are received and stuffed into the FIFO, so the UART asserts an interrupt. My ISR gets called, reads those eight out of the FIFO, clears the UART interrupt flag, and returns from the ISR.  Now the remainder of the message arrives. Those five bytes get placed in the Rx FIFO. The 8-byte threshold hasn’t been reached, so how will I ever be notified to read them?  That’s the purpose of the CTI: if there is one or more bytes in the FIFO, but no attempt to read them out occurs in 3.5 to 4.5 character times (~0.9 to 1.2 milliseconds at 38.4 kbps), the UART will assert an interrupt with the CTI ID.  When my ISR handles the CTI, it does exactly the same thing as an RDA: read characters into the message buffer until there are none left.

It’s not exactly clear why RDA/CTI can generate a spurious interrupt, though I suspect that problems can arise because it’s possible to have another byte come in immediately after the UART has asserted a CTI and de-asserts the interrupt, but before the processor core has read the vector from the VIC.

Whatever the cause, the UART would cause a spurious interrupt, my inept default ISR would get called, and result in the transmit interrupt failing, leaving the RS-485 transmitter on.

So, my less than elegant, but serviceable solution was to rewrite the default ISR:

I took the Scorched Earth route:

  • Read any characters waiting in the UART and discard them.

  • Turn off the transmitter.

  • Set the message ready flag in case I’m waiting for it.  This might mean I’ll try to process an incomplete message, but there is plenty of validation testing on messages, so it will get discarded.

  • I also clear all the other possible interrupt sources (SPI, keypad pins, and the system timer), and finally reset the VICVectAddr as required.

 

The result is no more hung communication.  I get a dropped message occasionally, but the PC interface was designed to expect unreliability, so recovers transparently.

 

Now, an alternative approach would be to duplicate the UART ISR in the default ISR, but what if the source of the spurious interrupt isn’t the UART? The ISR would have to verify that the UART was actually asserting an interrupt, which kind of defeats the purpose of using vectored interrupts in the first place…