Bugs in software are bad, bugs in hardware are even worse, but inconsistent results are what makes it a total nightmare.
Not my first post on crazy voodoo wasting precious time, I previously approached a similar issue here, so you can call this article "voodoo reloaded" without being far from the truth. Because sometimes those devices we spend so many hours to build - using the best of our logic and know how, seem to defy all that is normal and there remains very little we can do to fix the issues.
This time, things are different (or maybe not?). I was working on the uRADMonitor model D, to summarise it is 60x110cm board with an Atmega2561, a 2500mAh battery, two inverters based on LTC3440 for 3.3V / 5.0V used independently, a TP4056 battery charger, a DS1337 RTC (I2C), an EEPROM (I2C), a FT232 USB2Serial, a software configurable high voltage inverter with multiplier (configured for 480V), a microSD slot (SPI), a NEO6M GPS (UART), a ESP8266-ESP03 Wifi board (UART), a Bosch BME680 (I2C) and a Sharp gp2y1010au0f and an ILI9341 2.4" LCD (SPI) with touchscreen. All in one a complex portable device, with sensors and Internet connectivity, that passed through several iterations intended to make the hardware and the features better:
I finally got to the fifth variant which was performing good. I assembled two PCBs and completed all the software: drivers for the various modules and sensors, a fat32 implementation for the sdcard, a gps nmea parser, an lcd driver with a minimalistic library, everything. So all this worked just fine.
I pushed it to production, not before doing some changes to the PCB layout, nothing fancy, just moving the speaker a little and adjusting the size of the SMD pads. So the factory finished the assembly of a few devices:
However the very same code failed to run. Here's the crazy part:
1. If the firmware code inits the BME680 (over I2C) and tries to write something to the SDCard, the device will NOT start. Before the I2C code there is some LCD code that puts some text on screen. That will not fire either, nothing will work.
2. If I take out the BME680 init code, and its read function, the code can write to the SDcard just fine and all the code works, for all the modules and components.
This behaviour is not happening on the first two assembled test units. And debugging is slow, because of the big distance between my location and the factory in China.
The usual suspects
1. Different Atmega2561 microcontroller batch with some weird memory issues (unlikely)
2. Code memory corruption (unlikely) - the code is nothing but a collection of well verified libraries for all the modules used.
With the help from the excellent EEVBLOG, here are some additional possible causes:
3. ATmega 2561 used on 3.3V, given the datasheet indicates a shorter interval Atmega2561-xx 4.5V - 5.5V that might be responsible of the weird behaviour. Going outside the specified rating even a little means basically anything can happen. The ones that are working might be doing so just out of pure luck. Makes sense. However further discussions, but also my test results for the two boards that are working, seem to point away from this variant. "The datasheets aren't very clear about the voltage ratings, but from what I have heard, both variants are actually identical. They are only being tested under different conditions.
If all boards fail the same way it is clearly a software bug or a hardware issue (like supply voltage dropping when everything is active at the same time)." And "It almost looks like they designed the chip for 2.7-5.5V but experienced some problems and changed the specifications to 4.5-5.5V.
In the errata for ATmega2560 for older revisions: Part does not work under 2.4 volts" . I was able to use them perfectly on 3.3V with 14.7MHz crystals.
4. Perhaps just a simple stack overflow issue. The amount of static allocation has reached the point where stack collides with it and things get trashed. Possible.
5. Running out of memory(ram)?
Driving towards a solution
So what I did was to work with the factory and try all the good advises from EEVBlog. First, several boards were modified for the following combinations:
1. Atmega1281 microcontroller with 7.37MHz crystal running on 3.3V
2. Atmega2561V-8AU instead of the Atmega2561 non-V previously used, with a 14.7MHz crystal (out of spec)
3. Atmega2561V-8AU with a proper 7.37MHz crystal
The behaviour and results were identical. Meaning code and various modules (LCD, RTC, Dust sensor, 480V inverter, Tube counter, GPS, Wifi) all work ok, but when the BME680 code is initialised and then the SDCard is initialised, the unit won't start (even if there is LCD display code before initing bme and the sdcard, nothing gets on the screen).
So weird. Still at this point I can only conclude it is a software issue. But ... the same code works perfectly on my two test units.
What I did was to pay for express shipping and have the factory send me two units.
The final findings
Two weeks later, I got the boards preprogrammed with different test firmware variants. These were the original variant with the Atmega2561 set to run on 14.7MHz. I powered them up and recognised all the issues presented here, that I also observed remotely. The board that didn't call the BME680 code was working, while the other board with full firmware code would not work. Flashed them both with the latest firmware. Surprise! They were now fully functional.
Yes, it was a software issue. But the software issue was related to the programming software used to write the HEX to the microcontroller. The Factory engineers were using PROGISP1.72 , while I was using avrdude. This was the only difference that eluded my investigation efforts for so long. Like in the previous post, the code was being downloaded and almost worked in all cases, absolving this part of the process of any suspicion.
So we had some code that works in a sequential manner, meaning it does some things (lcd text display), and then it does some more things (bme680 sensor reading). We add a third set of things (sdcard ops),but then, the entire thing breaks. We no longer see the first set of things being executed. So of course you would assume it's a software glitch, except that the very same program runs perfectly on another set of devices (not one, but 3 other devices). So ok, then you figure it must be related to hardware. So you send some devices all the way around the globe, to test the code on the suspect hardware. And when you finally do that, you see that everything is working properly. Who could have blamed the hex programmer's software?