September 25, 2009

Austin GDC 2009

agdcLast week, a number of us from the Project Darkstar team were in Austin for the Austin Game Developer's Conference. Like last year, we had a large booth on the expo floor. However, while last year we were largely focused on demonstrating Project Darkstar's capabilities to scale and distribute load across multiple cores and processors on a single node, this year our focus was on showcasing the team's progress on multi-node capabilities. Here's a recap of the week's events:

If you've been following along with Project Darkstar's progress over the years, you know that transparent multi-node scaling capabilities are one of its main attractions as a platform. On the other hand, you should also know that with Project Darkstar being a still maturing research project in Sun Labs, these features are not yet done. In fact, we have still not proven if what we are attempting can even be done. Our fearless lead architect, Jim Waldo, has put together an excellent series of posts on his blog outlining why and how we hope to achieve this multi-node scaling. He calls it the "Four Miracles."

Our goal for Austin was to put together some measure of a compelling visual demonstration of our current progress towards achieving these "Four Miracles." Specifically, we hoped to show how Ann's miracle of transparently moving clients from one node to another in a cluster worked in tandem with Keith's miracle of monitoring a node's health and intelligently making decisions about which clients to relocate and when. Also, while Jane's miracle of detecting and organizing clients into groups of affinity groups based on social networking algorithms is not complete, we also hoped to portray how it should work by rigging up an app that formed affinity groups based on a players location in the game (specifically which chat room it was in).

In the weeks leading up to the conference, Keith and I came up with two demos that for the most part seemed to hit the mark. Using the JMX facilities already built into Darkstar, I hacked up a monitoring GUI that tapped into a running Darkstar cluster and displayed each of the nodes as a vertical bar. The height of the bar represented the total number of clients connected to that node, and the color optionally represented either the node health, or fragments of colors represented the affinity group of each connected client. For the first demo, we had Project Snowman running in a multi-node cluster and showed how the new node health monitoring features of darkstar caused client traffic to spill over and be intelligently distributed between the nodes. Here's a quick video of Keith talking through it on expo floor in Austin:

The second demo had Darkchat running on a multi-node cluster and was designed to show how clients would relocate between nodes depending on which rooms they were connected to in the application. Here's another quick video given by yours truly on the floor at Austin:

Both of these demos we had running throughout the week on the floor and the response from people who came by was generally positive. I think one of the main differences that I noticed between this year and last year was the amount of quality traffic that we had come through the booth. I talked to a lot of people at Austin last year, but a large percentage of them had never heard of Project Darkstar or were just vaguely familiar with it. This year many people came by who were already committed to a project using Darkstar, or were very interested in our progress, or were familiar with the technology and had a strong desire to learn more. I think the best piece of anecdotal evidence was one person from the Intel booth who came by and said "Oh, so this is for real now?" Referring, of course, to the significant headway that we are finally making and showing on the multi-node scaling capabilities of the Project Darkstar platform.

In a tough year for everyone in just about every industry, I think going to this conference and putting together these demos have injected some energy into both the Project Darkstar community and the team. A few more personal observations:
  • The expo floor seemed smaller this year and overall attendance appeared to be down. Not too surprising but hopefully a sign of things past and not things to come.
  • A little tidbit on some of the pains of getting the demos setup. Leading up to the conference, we had everything running just fine in Burlington and the demos packaged up and ready to go. After a few hours of pulling cables, moving pods around, and booting up systems, we fired up the node health demo on Tuesday, the day before the expo floor opened. Except, it didn't work. At least not like it did in Burlington. When simulating an overloaded node, instead of offloading clients onto the other node right away, there was some seemingly random and arbitrary delay of 20 - 30 seconds before any clients would be moved. Huh?
  • After some hours of debugging, and pulling Seth into the mix for help, we finally tracked it down. The (still unfinished) node health code offloads identities from a node when it gets overloaded. However, it doesn't just move client identities, it also moves identities of robots and other NPC's in the system. Since each snowman game has a number of robots, it was choosing to move the robot identities before moving the client identities. The question, of course, is why didn't we ever see this behavior in Burlington? Well it turns out that the order in which identities are chosen to be moved is deterministic and seemingly alphabetical. While in Burlington, our client simulated players were generating identity names according to the hostname of the client machine (dstar1, dstar2, dstarX, ...). The hostnames of the machines used in Austin? x2250-1, x2250-2, etc. So in our Burlington deployment, the client identities were always getting chosen before the robot identities since they started with a d; in our Austin deployment, the client identities were always getting chosen after the robot identities since they started with an x. Unbelievable.
  • Keith gave a talk during a one hour session which was awesome. He went through a number of obstacles and challenges he faced when building Darkchat for JavaOne and I think it came across as very real and genuine.
  • One final note, it appears as though Chuck Norris still doesn't need scalable server technology. All of his CPU's run faster to get away from him. Also, any code that he writes cannot be optimized. For anyone else, though, Project Darkstar could be a solution.

1 comment :