mcottondesign

Loving Open-Souce One Anonymous Function at a Time.

What is the weirdest bug you've ever dealt with?

I was asked this once at an informal interview and didn't have an immediate answer. I told a story about the time I was trying to debug RF signals and the ultimate cause was that there was another remote using the same hardware ID. It was extremely unprobabel, but it turned out to be the cause.

Since then I've had more time to reflect on how I would answer that question if I was asked again.

The weirdest bugs I've ever dealt with all have a common theme, distributed systems. In my last two startups we have used distributed architectures. This has caused somewhat simple tasks to become very difficult.

In the old days of a monolithic app server (Django, Rails, PHP) all the logs were in one place. As you scaled you would need to add a load balancer and then spread the requests over multiple app servers. Reasoning about the path an HTTP request took was still fairly simple. At this level a relational database is perfectly able to handle the load.

In a distributed system, HTTP is used not just for the inital request but also for communication between servers. This means you now have to deal with internal HTTP responses and they have different meansing from the external request/response that you're used to. For example, what to do when an internal call to the cache returns a 401, 403, or 502? Do the unauthorized/forbidden calls mean that the external's request is 401/403 or does this mean your app server is? What about 502 Gateway Timeout? You now have to deal with timing and latency between internal services.

Distributed computer and networking are a reality of life for most new projects. It is something we'll have to learn and deal with. In the meantime, it is also the source of my weirdest bugs.