
Fixing bugs on our brand-new Java application sometimes takes us back to
the 1960's. There are often bugs that only show up at customer sites,
whether due to scale of implementation or manner of use or some external
environmental factors, and the only way to fix them is to make one's
Best GuessTM at where the program is going wrong, make a
change, compile it, and send it to the customer to try. This means that
any typographical or cut-n-paste error will not be caught at our
site--as long as it compiles--but the new code won't fix the problem. So
we have to try all over again, with at least a 24-hour turnaround before
the new code can go out (in the case I'm working with right now, it's
more like 72 hours--they won't let me send it until Friday) and an
additional several hours or days before we can tell whether it worked.
Might as well be writing code in longhand, submitting it to a typist,
and waiting for a scheduled compile and a scheduled run.
These bugs are usually members of what I call the "Mysterious" class:
bugs that leave clear evidence that they've occurred, but without any
trace of evidence of how. There's no stack trace, or if there is one the
error is not reproducible. If there isn't, we can see data integrity
issues after the fact but the result is irreproducible. Sometimes, if
I'm lucky, I can find an obvious glaring hole in the code that updates
the data in question, but usually it appears to be fine.
Much of my role here seems to be "collector of mysterious bugs". I end
up with a large number of open Change Requests all reporting the same
irreproducible problem, and poke at them from time to time hoping that
something might have changed in the meantime allowing the bug to be
reproduced, though usually it isn't. And then, miraculously, I get
another CR of the same issue which has just one more vital piece of the
scenario, and voila! Half an hour later, the whole issue is resolved.
A recent example of this type was a situation where all the CR's
reported "The SO closes but the accounts are left pending." (Details as
to what this means are not really relevant.) I could see that the data
was indeed left in the invalid state they claimed, but every time I
repaired the data and did the same process myself--using their software
and everything--the process worked perfectly. Until this week, when I
finally hit one that *repeatedly* threw an error. After the error, if
you continued, you'd get the stated result. And the same thing happened
to me on the retry, over and over. Woohoo! Fourteen "Mystery" CR's, all
closed.