November 12, 2009

Rule 1 of Programming: It's Always Your Fault (Almost)

Over the past week or so, I've been working on putting together some micro benchmarks for Project Darkstar. There has been a significant uptick in forum activity lately relating to stress testing and performance issues. In particular, we've seen many questions along the lines: "I can only connect X users to my darkstar server, what's wrong?" First of all, this is great news. It means that people are making significant progress with their darkstar based games/applications and are working to push the limits of the technology. However, I also think that this is the completely wrong question to ask. As I've demonstrated before, Project Darkstar has a pretty high ceiling for raw capacity in terms of number of users. A properly tuned app with a light load can easily handle tens of thousands of users per node. However, connecting mostly idle users to a mostly idle server is not very interesting. These capacity numbers naturally decrease as the number of messages between the clients and server and the amount of processing per message increases. This seems obvious, but people still ask the capacity question as though all games developed with darkstar are going to have identical limitations. This is simply not the case.

With this said, though, we can still strive to identify upper bounds on Project Darkstar's performance at a more fine-grained level. Project Darkstar is an event driven transactional system, so all operations are not without cost. With these micro benchmarks, I'm hoping that I can establish a relative cost to each of the operations using the DataManager, the ChannelManager, and the TaskManager. For example, how expensive is it to retrieve an object using DataManager.getBinding() vs ManagedReference.get(). How much overhead is involved with each transaction? How expensive is it to create a Channel or send a message on a Channel? With more or less users? While the cost of retrieving data from Darkstar's data store should be an order of magnitude faster than using J2EE and a RDBMS, it is also likely an order of magnitude slower than retrieving data from a data structure that is already in memory and using no synchronization. This is information that users really need to be aware of and be able to take into perspective when designing their game, structuring their tasks, and establishing their own expectations of what the performance should be like.

So over the past couple of days, I've been debugging a problem in these benchmarks. In one particular test, I was attempting to measure the raw execution time per call to DataManager.getBinding() from the Project Darkstar API. The test was pretty simple, I just set a large number of bindings in a single set of setup transactions. Then I would time the execution of another set of transactions that would make some subset of calls to getBinding() on the names that I had just setup. Taking into account previously measured transaction overhead I could then come up with a reasonable estimate of the cost per operation. Seems easy right? Well it turns out that I hit a snag. In running this test, I was repeatedly getting a situation where a seemingly random name binding was not being set properly during setup. Most of the calls to getBinding() would work fine, but a couple were throwing NameNotBoundException. What? This didn't make much sense. I went back and looked over my code many times, I tried a myriad of variations, logging output, and print lines, but still no luck. I was still getting NameNotBoundException for what seemed like a random name in the sequence. Hmmph.

At this point, I went through a whole series of exercises, all centered around one assumption, that my code was right. I tested the native edition vs. the Java edition of BDB, suspecting maybe there was a weird bug in one of them: same result. I tried longer transactions, more operations, larger serialized data objects: same result. I tried running my benchmarks in different orders: same result. I even started writing test cases for DataManager.setBinding() that simulated transaction rollback and retry, large numbers of consecutive calls to setBinding() and binding and rebinding of the same name. I thought I was going to uncover some weird corner case bug. But those tests were passing! I was at a loss. After probably two days of sporadic attempts at debugging this, I finally went back and looked really hard at my own test code. And... I found a bug (doh!). It turns out that I was being too cute with my setup transactions, and was modifying a non-local counter variable inside of my anonymous nested transaction class. In random situations, this class would abort and retry (a normal darkstar operation), but since it was modifying a variable that lived outside of the task itself, this value was not being rolled back. The result was that a name binding would be skipped periodically (exactly the behavior that I was seeing).

So here's my question. Why did I assume that code that I wrote in less than a day was more likely to be correct than Berkeley DB itself, a project that's been developed and tested for a couple of decades? Why did I assume that code that I wrote in less than a day was more likely to be correct than Project Darkstar's Data Service code which has been developed and tested for years? I mean, I knew better than to think that Tim's code is the likely culprit, but I still started writing test cases thinking I was going to heroically find some obscure bug. This, my friends, is a violation of the number 1 rule of programming: If you're having problems, It's Always Your Fault (almost). I mean, don't get me wrong, I've found (and reported) bugs in well established open source projects before, but those situations are actually few and far between. I also don't mean to suggest that Project Darkstar is bug free. I do think, though, that sometimes it's too tempting to conclude that there's a bug in that library you're using, or there's a performance limitation in that technology that is fundamentally impossible to overcome. Maybe that's true, but 99% of the time, it's your fault.

And with regard to those micro benchmarks, I'm hoping to publish some results soon (assuming I don't get hung up with any more boneheaded mistakes!)

No comments :

Post a Comment