Today I decided to see if I can establish an upper bound for this question. My goal was to put together an ad-hoc test to see how many idle clients I can log into a server. I used Tim's request app which is basically a little performance testing widget that accepts commands from clients (such as "JOIN_CHANNEL", "LEAVE_CHANNEL", etc.). It doesn't do anything when a client logs in, though, and will happily sit idly if the client doesn't send any commands. This makes it a perfect candidate for this test. I wrote a simple client that does nothing but login a configurable number of users. Here's what I found:
Machine configurations (1 used as server, 4 as clients):
Sunblade 6220
2 dual core AMD 2200 2.8Ghz
16GB RAM
Solaris 10u6
Maximum connected clients:
32bit JVM, 128MB max heap : ~800
32bit JVM, 1GB max heap : ~6000
32bit JVM, 2GB max heap : ~12000
I noticed when a limit was reached because each time, the server would throw an exception that looked something like this:
[INFO] SEVERE: acceptor error on 0.0.0.0/0.0.0.0:11469
[INFO] java.lang.OutOfMemoryError: Direct buffer memory
[INFO] at java.nio.Bits.reserveMemory(Bits.java:633)
[INFO] at java.nio.DirectByteBuffer.(DirectByteBuffer.java:95)
[INFO] at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:288)
[INFO] at com.sun.sgs.impl.protocol.simple.AsynchronousMessageChannel.(AsynchronousMessageChannel.java:86)
[INFO] at com.sun.sgs.impl.protocol.simple.SimpleSgsProtocolImpl.(SimpleSgsProtocolImpl.java:167)
[INFO] at com.sun.sgs.impl.protocol.simple.SimpleSgsProtocolImpl.(SimpleSgsProtocolImpl.java:139)
[INFO] at com.sun.sgs.impl.protocol.simple.SimpleSgsProtocolAcceptor$ConnectionHandlerImpl.newConnection(SimpleSgsProtocolAcceptor.java:316)
[INFO] at com.sun.sgs.impl.transport.tcp.TcpTransport$AcceptorListener.completed(TcpTransport.java:331)
[INFO] at com.sun.sgs.impl.nio.AsyncGroupImpl$CompletionRunner.run(AsyncGroupImpl.java:161)
[INFO] at com.sun.sgs.impl.nio.Reactor$ReactiveAsyncKey.runCompletion(Reactor.java:858)
[INFO] at com.sun.sgs.impl.nio.Reactor$PendingOperation$1.done(Reactor.java:630)
[INFO] at java.util.concurrent.FutureTask$Sync.innerSet(FutureTask.java:251)
[INFO] at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
[INFO] at java.util.concurrent.FutureTask.run(FutureTask.java:138)
[INFO] at com.sun.sgs.impl.nio.Reactor$PendingOperation.selected(Reactor.java:563)
[INFO] at com.sun.sgs.impl.nio.Reactor$ReactiveAsyncKey.selected(Reactor.java:803)
[INFO] at com.sun.sgs.impl.nio.Reactor.performWork(Reactor.java:323)
[INFO] at com.sun.sgs.impl.nio.ReactiveChannelGroup$Worker.run(ReactiveChannelGroup.java:268)
[INFO] at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
[INFO] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
[INFO] at java.lang.Thread.run(Thread.java:619)
From these numbers and the exception above, it looks like the maximum capacity of the server closely correlates with the configured maximum heap size, which makes sense. However, there are a few things that are odd:- Why are the numbers so low? The clients aren't doing anything once they login and yet they are eating up memory seemingly very quickly.
- Even though increasing the heap size helps, connecting to the server JVM using JConsole shows that the memory usage never comes close to the max heap limit.
Fortunately, there are a couple of things I can do with this information to help improve my numbers. First, Project Darkstar provides a configuration property (com.sun.sgs.impl.protocol.simple.read.buffer.size) where you can change the read buffer size. Instead of 128K, I switched it to 8K, it's specified minimum. In most games, packet sizes should be very small, much much smaller than 128K, so changing this limit may be an acceptable solution in many cases. Second, and more of a big hammer approach is to switch to using a 64bit JVM. This would allow us to configure a heap limit greater than 2GB. Here's what I observed with these changes:
Maximum connected clients:
32bit JVM, 2GB max heap, com.sun.sgs.impl.protocol.simple.read.buffer.size=8192 : ~64000
64bit JVM, 16GB max heap, com.sun.sgs.impl.protocol.simple.read.buffer.size=131072 : ~64000
In both of these cases, the limit was tripped up by throwing a different exception this time:
[INFO] SEVERE: acceptor error on 0.0.0.0/0.0.0.0:11469
[INFO] java.io.IOException: Too many open files
[INFO] at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
[INFO] at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145)
[INFO] at com.sun.sgs.impl.nio.AsyncServerSocketChannelImpl$1.call(AsyncServerSocketChannelImpl.java:254)
[INFO] at com.sun.sgs.impl.nio.AsyncServerSocketChannelImpl$1.call(AsyncServerSocketChannelImpl.java:251)
[INFO] at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
[INFO] at java.util.concurrent.FutureTask.run(FutureTask.java:138)
[INFO] at com.sun.sgs.impl.nio.Reactor$PendingOperation.selected(Reactor.java:563)
[INFO] at com.sun.sgs.impl.nio.Reactor$ReactiveAsyncKey.selected(Reactor.java:803)
[INFO] at com.sun.sgs.impl.nio.Reactor.performWork(Reactor.java:323)
[INFO] at com.sun.sgs.impl.nio.ReactiveChannelGroup$Worker.run(ReactiveChannelGroup.java:268)
[INFO] at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
[INFO] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
[INFO] at java.lang.Thread.run(Thread.java:619)
This is a much better looking number, and our exception also suggests that we're now running into a different problem, most likely the max file descriptors limitation of the OS. This is likely configurable as well, but I haven't tried to increase it. A few closing thoughts:- The current, default implementation of the server has a perhaps overly conservative, fixed buffer size allocated for each connected client. This can be tweaked with the com.sun.sgs.impl.protocol.simple.read.buffer.size to reduce memory usage.
- Properly tweaking this property gives us an upper bound of approximately 64000 connected clients (on Solaris 10, without making an effort to increase the max file descriptor setting for the OS).
- It should be noted that the server handled login storms with minimal effort. In the final tests I bombarded the server with about 20000 logins at a time. Under these circumstances each client averaged a round trip (login initiation to login completion) of anywhere from 500 milliseconds to 15 seconds.
- The clients were overloaded long before the server. I was unable to spin up more than 2000 clients per JVM before hitting out of memory errors and was forced to manage 20 to 30 client JVM's spread across 4 machines in these tests. A bit of a pain, and suggests that both the client could/should be optimized, and that some automation would be helpful.
com.sun.sgs.impl.protocol.simple.read.buffer.size is a great tip!
ReplyDeleteGood job!
ReplyDeleteGreat blog on Darkstar! Keep up the good work and good code. Your classes in the Snowman example served as a great tutorial for me. Thanks
ReplyDelete