A Bug Hunting Story
A colleague and I recently had to track down a tricky bug in our system, and I thought the story might be interesting.
Our system includes a number of custom hardware devices that we communicate with via TCP/IP sockets. When we installed the latest (unreleased) version of our software on a system, we started getting periodic communication errors with one of the devices; all of the other devices were working fine. Sometimes the problem would show up immediately; other times, it would take a few minutes.
This particular device is notoriously flaky, so we suspected the hardware at first. However, when we went back to the released version of the software, the errors went away. Moving forward again, the problems reappeared. Obviously, the problem was in the software.
We looked at all of the changes to our code since the last release, but there was nothing suspicious there. However, we had upgraded from Visualworks Smalltalk 7.7.1 to 7.10 during this time, so it wasn’t just our code that we had to worry about.
We decided to try to narrow down where the problem was introduced. If
you’re using git, you can use git bisect
for this. This was in our
Smalltalk code, which has its own version control system (Store), so
we had to do this manually. We tested the version of our software
from just before the upgrade to 7.10, and then tested the version
right after the upgrade. The pre-upgrade version worked fine, and the
post-upgrade version didn’t. So we were pretty sure that something
in VW 7.10 was causing our problem.
We started running our code in a development image, using a simple workspace script to exercise the code in question while cutting out any extraneous activity. We inserted debugging output and set breakpoints in the debugger. As with many bugs, trying to step through the code in the debugger caused everything to work just fine. We weren’t getting very far with this approach.
We decided to look at the network traffic between our software and the device. Perhaps we could see a pattern there. We used tcpdump to get traces of the network traffic. Using these traces, we could see what a normal communication pattern looked like, and then we could see what was different when the problem started.
We were able to determine that our software would start behaving as if communication with the device had timed out. However, this was happening earlier than it should have, as if the timeout had expired too early.
This information allowed us to focus on the parts of our code where the timeout behavior was implemented. We tried to set a breakpoint, but that hung our image. Timer expiry runs at the highest possible priority in Visualworks, so if you suspend that process, nothing else in the image can run. We had to resort to debugging output instead.
We started dumping the value of Time microsecondClock
at various
points in the process. When the problem occurred, we noticed that
timers were being scheduled to expire up to 100 seconds in the past.
That wasn’t right. My colleague also noticed that all of the timers
were being set to expire at exactly the same time.
At this point, we were pretty sure we knew what the problem was: some kind of numerical representation issue where significant digits were being lost; we just had to find where it was.
We finally zeroed in on the method
ClassicTimerSupport>>enableTimer:after:every:
which looks like this:
In this code, Time microsecondClock
returns a large positive integer
representing the number of microseconds since 1901 began.
initialPeriod
is a Duration object representing the desired delay.
In VW 7.10, unlike in VW 7.7.1, Duration>>asMicroseconds
can
sometimes return a Float
instead of an Integer
. For the device we
were dealing with, that was exactly what was happening.
If you add a very large positive integer to a floating point value in Smalltalk, the addition is done by promoting the integer to a float, then adding. The microsecond clock value is too large to be represented exactly as a floating point value, so this promotion lost some significant digits off the end. Adding in a relatively small number of microseconds does essentially nothing. You can see the net effect by running the following in a workspace:
Our solution was to override the problem method by sending #floor
to
the result of initialPeriod asMicroseconds
before adding it to the
clock value. So, we replaced the method with this:
Or, in other words, we removed two characters: a (
and a )
. But
it took quite a while to learn which two characters we needed to remove.
We’ve submitted this issue to Cincom and they are working on it.