Guide to hardware troubleshooting (Part 2)
First, to review, here below are the 11 steps I described in Part 1:
Step #1: Picture success
Step #2: Keep notes
Step #3: Reproduce the problem
Step #4: Gather the evidence
Step #5: Try the easy stuff first
Step #6: Break the problem down
Step #7: Talk it over with a colleague
Step #8: Apply the fix
Step #9: Try to break it again
Step #10: Remember 'disappearing' bugs are still there if you haven't fixed them
Step #11: Celebrate
Bug example #1—crossed lines
The concept was pretty simple within a complex system. Fred had designed a video surveillance board that took two analogue camera inputs, digitized them, compressed them into MPEG4, and then streamed that data over Ethernet to a PC.
The project was running late because the original aim of streaming over USB had been abandoned due to the low performance of the MPEG4 compressor's USB interface. So the marginally more expensive option of Ethernet was used instead. But now the Ethernet was behaving strangely.
An experienced engineer with years of Ethernet designs under his belt, Fred thought the problem had to do with it being one of those interfaces that "just worked" (unlike a certain other well-known interface with a three-letter name). In this case, the MPEG4 SOCs streamed their data via a 3-port hub. The software supplied with the SoC had turned out to be pretty dodgy and unreliable, but as a core function, the Ethernet solution should have worked out of the box. Instead, something was causing Ethernet output to be horribly corrupted. Time for the ten steps.
Fred's colleague, John, was writing the software that handled the Ethernet output on the SoCs. He was good at getting to the bottom of bugs so Step 1 was not a problem. (Throughout this article, refer to the list at the top to remind yourself what each step entails, or go to Part 1 for details.) The real question was how long the debugging process would take.
He started by using Wireshark to record what data was actually being received, then he entered this data into a spreadsheet (Steps 2, 3, and 4). Initially this was a mess, as the board was booting and attempting to pass streaming video data to a known IP address. John stripped the software back to the basics (Steps 5 and 6), running just a bootloader to try and establish a connection through a broadcast write.
John wrote to Fred: "I've made a hex dump of all frames received by device, and imported them into Wireshark. It seems that they are all valid Ethernet frames (or at least they start as valid Ethernet frames). At the moment I'm comparing what I've got with what is actually transmitted in the network (imported data with Wireshark capture on my PC network card). On the first look, there is some similarity, but many frames are missing, and those received are truncated. It doesn't seem that the data within the frames is damaged, just truncated.
"Further analysis shows that all frames have 1 to 6B chopped off the end of the frame."
John could see that the problem was on the RX (receive) side of the Ethernet interface. The TX (transmit) data was getting through fine and the RX was sending broadcast packets but not unicasts. However, on RX broadcast receives, the data was strangely truncated. Having already waded through a swamp of software bugs in the vendor's SDK, John had assumed that it was another software issue. But this looked like a hardware bug. Was it a problem with the hub, or with the SoC itself? It was time to talk it over with Fred (Step 7). Together, they pored over the schematics once again (figures 1 and 2).
Figure 1 and figure 2 show the incorrect and correct implementation, respectively, of a segment of the Ethernet hub schematic. Can you spot the difference?
Figure 1: Wrong implementation of Ethernet wiring.
Figure 2: Right implementation of Ethernet hub wiring.
Did you find it?
|Related Articles||Editor's Choice|
|Related Articles||Editor's Choice|