A Bug Hunting Story

A colleague and I recently had to track down a tricky bug in our system, and I thought the story might be interesting.

Our system includes a number of custom hardware devices that we communicate with via TCP/IP sockets. When we installed the latest (unreleased) version of our software on a system, we started getting periodic communication errors with one of the devices; all of the other devices were working fine. Sometimes the problem would show up immediately; other times, it would take a few minutes.

This particular device is notoriously flaky, so we suspected the hardware at first. However, when we went back to the released version of the software, the errors went away. Moving forward again, the problems reappeared. Obviously, the problem was in the software.

We looked at all of the changes to our code since the last release, but there was nothing suspicious there. However, we had upgraded from Visualworks Smalltalk 7.7.1 to 7.10 during this time, so it wasn’t just our code that we had to worry about.

We decided to try to narrow down where the problem was introduced. If you’re using git, you can use git bisect for this. This was in our Smalltalk code, which has its own version control system (Store), so we had to do this manually. We tested the version of our software from just before the upgrade to 7.10, and then tested the version right after the upgrade. The pre-upgrade version worked fine, and the post-upgrade version didn’t. So we were pretty sure that something in VW 7.10 was causing our problem.

We started running our code in a development image, using a simple workspace script to exercise the code in question while cutting out any extraneous activity. We inserted debugging output and set breakpoints in the debugger. As with many bugs, trying to step through the code in the debugger caused everything to work just fine. We weren’t getting very far with this approach.

We decided to look at the network traffic between our software and the device. Perhaps we could see a pattern there. We used tcpdump to get traces of the network traffic. Using these traces, we could see what a normal communication pattern looked like, and then we could see what was different when the problem started.

We were able to determine that our software would start behaving as if communication with the device had timed out. However, this was happening earlier than it should have, as if the timeout had expired too early.

This information allowed us to focus on the parts of our code where the timeout behavior was implemented. We tried to set a breakpoint, but that hung our image. Timer expiry runs at the highest possible priority in Visualworks, so if you suspend that process, nothing else in the image can run. We had to resort to debugging output instead.

We started dumping the value of Time microsecondClock at various points in the process. When the problem occurred, we noticed that timers were being scheduled to expire up to 100 seconds in the past. That wasn’t right. My colleague also noticed that all of the timers were being set to expire at exactly the same time.

At this point, we were pretty sure we knew what the problem was: some kind of numerical representation issue where significant digits were being lost; we just had to find where it was.

We finally zeroed in on the method ClassicTimerSupport>>enableTimer:after:every: which looks like this:

ClassicTimerSupport>>enableTimer:after:every: (broken)

enableTimer: aTimer after: initialPeriod every: aPeriod
	self
		enableTimer: aTimer
		atMicroseconds: (Time microsecondClock + initialPeriod asMicroseconds)
				floor
		every: aPeriod

In this code, Time microsecondClock returns a large positive integer representing the number of microseconds since 1901 began. initialPeriod is a Duration object representing the desired delay. In VW 7.10, unlike in VW 7.7.1, Duration>>asMicroseconds can sometimes return a Float instead of an Integer. For the device we were dealing with, that was exactly what was happening.

If you add a very large positive integer to a floating point value in Smalltalk, the addition is done by promoting the integer to a float, then adding. The microsecond clock value is too large to be represented exactly as a floating point value, so this promotion lost some significant digits off the end. Adding in a relatively small number of microseconds does essentially nothing. You can see the net effect by running the following in a workspace:

Losing Microseconds

clock := Time microsecondClock.
difference := clock - (clock asFloat floor).
"48199885"

Our solution was to override the problem method by sending #floor to the result of initialPeriod asMicroseconds before adding it to the clock value. So, we replaced the method with this:

ClassicTimerSupport>>enableTimer:after:every: (fixed)

enableTimer: aTimer after: initialPeriod every: aPeriod
	self
		enableTimer: aTimer
		atMicroseconds: Time microsecondClock + initialPeriod asMicroseconds floor
		every: aPeriod

Or, in other words, we removed two characters: a ( and a ). But it took quite a while to learn which two characters we needed to remove.

We’ve submitted this issue to Cincom and they are working on it.

Comments