May 12, 2009

Capacity Testing

There's one question that we get asked a lot about Project Darkstar: "How many users can you connect to one server?" This is a difficult question to answer, mainly because it's extremely sensitive to the context. The game type, game behavior, and hardware specifications all can have an extremely large effect.

Today I decided to see if I can establish an upper bound for this question. My goal was to put together an ad-hoc test to see how many idle clients I can log into a server. I used Tim's request app which is basically a little performance testing widget that accepts commands from clients (such as "JOIN_CHANNEL", "LEAVE_CHANNEL", etc.). It doesn't do anything when a client logs in, though, and will happily sit idly if the client doesn't send any commands. This makes it a perfect candidate for this test. I wrote a simple client that does nothing but login a configurable number of users. Here's what I found:

Machine configurations (1 used as server, 4 as clients):
Sunblade 6220
2 dual core AMD 2200 2.8Ghz
16GB RAM
Solaris 10u6

Maximum connected clients:
32bit JVM, 128MB max heap : ~800
32bit JVM, 1GB max heap : ~6000
32bit JVM, 2GB max heap : ~12000

I noticed when a limit was reached because each time, the server would throw an exception that looked something like this:

[INFO] SEVERE: acceptor error on 0.0.0.0/0.0.0.0:11469
[INFO] java.lang.OutOfMemoryError: Direct buffer memory
[INFO]  at java.nio.Bits.reserveMemory(Bits.java:633)
[INFO]  at java.nio.DirectByteBuffer.(DirectByteBuffer.java:95)
[INFO]  at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:288)
[INFO]  at com.sun.sgs.impl.protocol.simple.AsynchronousMessageChannel.(AsynchronousMessageChannel.java:86)
[INFO]  at com.sun.sgs.impl.protocol.simple.SimpleSgsProtocolImpl.(SimpleSgsProtocolImpl.java:167)
[INFO]  at com.sun.sgs.impl.protocol.simple.SimpleSgsProtocolImpl.(SimpleSgsProtocolImpl.java:139)
[INFO]  at com.sun.sgs.impl.protocol.simple.SimpleSgsProtocolAcceptor$ConnectionHandlerImpl.newConnection(SimpleSgsProtocolAcceptor.java:316)
[INFO]  at com.sun.sgs.impl.transport.tcp.TcpTransport$AcceptorListener.completed(TcpTransport.java:331)
[INFO]  at com.sun.sgs.impl.nio.AsyncGroupImpl$CompletionRunner.run(AsyncGroupImpl.java:161)
[INFO]  at com.sun.sgs.impl.nio.Reactor$ReactiveAsyncKey.runCompletion(Reactor.java:858)
[INFO]  at com.sun.sgs.impl.nio.Reactor$PendingOperation$1.done(Reactor.java:630)
[INFO]  at java.util.concurrent.FutureTask$Sync.innerSet(FutureTask.java:251)
[INFO]  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
[INFO]  at java.util.concurrent.FutureTask.run(FutureTask.java:138)
[INFO]  at com.sun.sgs.impl.nio.Reactor$PendingOperation.selected(Reactor.java:563)
[INFO]  at com.sun.sgs.impl.nio.Reactor$ReactiveAsyncKey.selected(Reactor.java:803)
[INFO]  at com.sun.sgs.impl.nio.Reactor.performWork(Reactor.java:323)
[INFO]  at com.sun.sgs.impl.nio.ReactiveChannelGroup$Worker.run(ReactiveChannelGroup.java:268)
[INFO]  at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
[INFO]  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
[INFO]  at java.lang.Thread.run(Thread.java:619)

From these numbers and the exception above, it looks like the maximum capacity of the server closely correlates with the configured maximum heap size, which makes sense. However, there are a few things that are odd:
  • Why are the numbers so low? The clients aren't doing anything once they login and yet they are eating up memory seemingly very quickly.
  • Even though increasing the heap size helps, connecting to the server JVM using JConsole shows that the memory usage never comes close to the max heap limit.
After digging through the stack trace as well as the darkstar I/O code, I discovered that the culprit lies in our use of DirectByteBuffers. First, for each client that connects, a DirectByteBuffer of length 128K is allocated to serve as buffer space for incoming packets. Second, memory allocated for DirectByteBuffers is not recorded as used in the Java heap space (even though the heap limit seemingly does have an effect) so it is confusing to monitor the JVM.

Fortunately, there are a couple of things I can do with this information to help improve my numbers. First, Project Darkstar provides a configuration property (com.sun.sgs.impl.protocol.simple.read.buffer.size) where you can change the read buffer size. Instead of 128K, I switched it to 8K, it's specified minimum. In most games, packet sizes should be very small, much much smaller than 128K, so changing this limit may be an acceptable solution in many cases. Second, and more of a big hammer approach is to switch to using a 64bit JVM. This would allow us to configure a heap limit greater than 2GB. Here's what I observed with these changes:

Maximum connected clients:
32bit JVM, 2GB max heap, com.sun.sgs.impl.protocol.simple.read.buffer.size=8192 : ~64000
64bit JVM, 16GB max heap, com.sun.sgs.impl.protocol.simple.read.buffer.size=131072 : ~64000
In both of these cases, the limit was tripped up by throwing a different exception this time:
[INFO] SEVERE: acceptor error on 0.0.0.0/0.0.0.0:11469
[INFO] java.io.IOException: Too many open files
[INFO]  at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
[INFO]  at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145)
[INFO]  at com.sun.sgs.impl.nio.AsyncServerSocketChannelImpl$1.call(AsyncServerSocketChannelImpl.java:254)
[INFO]  at com.sun.sgs.impl.nio.AsyncServerSocketChannelImpl$1.call(AsyncServerSocketChannelImpl.java:251)
[INFO]  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
[INFO]  at java.util.concurrent.FutureTask.run(FutureTask.java:138)
[INFO]  at com.sun.sgs.impl.nio.Reactor$PendingOperation.selected(Reactor.java:563)
[INFO]  at com.sun.sgs.impl.nio.Reactor$ReactiveAsyncKey.selected(Reactor.java:803)
[INFO]  at com.sun.sgs.impl.nio.Reactor.performWork(Reactor.java:323)
[INFO]  at com.sun.sgs.impl.nio.ReactiveChannelGroup$Worker.run(ReactiveChannelGroup.java:268)
[INFO]  at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
[INFO]  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
[INFO]  at java.lang.Thread.run(Thread.java:619)
This is a much better looking number, and our exception also suggests that we're now running into a different problem, most likely the max file descriptors limitation of the OS. This is likely configurable as well, but I haven't tried to increase it. A few closing thoughts:
  • The current, default implementation of the server has a perhaps overly conservative, fixed buffer size allocated for each connected client. This can be tweaked with the com.sun.sgs.impl.protocol.simple.read.buffer.size to reduce memory usage.
  • Properly tweaking this property gives us an upper bound of approximately 64000 connected clients (on Solaris 10, without making an effort to increase the max file descriptor setting for the OS).
  • It should be noted that the server handled login storms with minimal effort. In the final tests I bombarded the server with about 20000 logins at a time. Under these circumstances each client averaged a round trip (login initiation to login completion) of anywhere from 500 milliseconds to 15 seconds.
  • The clients were overloaded long before the server. I was unable to spin up more than 2000 clients per JVM before hitting out of memory errors and was forced to manage 20 to 30 client JVM's spread across 4 machines in these tests. A bit of a pain, and suggests that both the client could/should be optimized, and that some automation would be helpful.

3 comments :

  1. com.sun.sgs.impl.protocol.simple.read.buffer.size is a great tip!

    ReplyDelete
  2. Great blog on Darkstar! Keep up the good work and good code. Your classes in the Snowman example served as a great tutorial for me. Thanks

    ReplyDelete