There's more to configuring ThreadPools than thread pool size

We just found a customer issue and it's worth talking about. The customer was preloading data into WebSphere eXtreme Scale. They were multi-threading the preload to speed it up. They were using a default fixed size ExecutorService as the pool. They were fetching records from the database and then doing a submit call on the pool to hit the grid.

They were running out of memory on the preload client. The reason is as follows. The thread pool size was fixed. Once those threads are busy then queueing additional items just puts them in the queue. The queue was getting longer and as a result holding more items. Items in a preload case might be 10k records to push in to the grid. Items are big, they can take a lot of space. A lot of space if you are able to grab stuff from the database very quickly.

This means if the thread pool is undersized then the pool will queue up lots of items and when the items are big, your preload client can run out of memory pretty easily. There are a couple of things to do here. But, the first is to fix that thread pool so that this can't happen. If you initialize the pool like this then you'll get a better behavior:

// this means that once there are 3x numThreads jobs queued waiting for
// a thread then it will start running jobs on the submitting thread.
LinkedBlockingQueue<Runnable> queue = new LinkedBlockingQueue<Runnable>(numThreads * 3);

// Once the queue reports that it is full, the CallerRunPolicy will run job
// on the submitter thread.

ExecutorService p = new ThreadPoolExecutor(numThreads, numThreads, 2L,
TimeUnit.MINUTES, queue,
new ThreadPoolExecutor.CallerRunsPolicy());

This shows us creating a LinkedBlockingQueue with a specific maximum capacity. I used 3 times the thread pool size but you need to figure out what works best. The major change here is the CallerRunsPolicy parameter. Now, that the thread pool uses a fixed size queue then if the queue fills then normally the pool will throw exceptions when this happens. We'd have N threads + M items on the queue at that point. CallerRunsPolicy means that instead of throwing an exception, the job is instead executed on the submitter thread. This will prevent the memory runaway that was happening in the case I described earlier and will throttle the preload operation itself rather than let it go crazy.

The other things to watch here would be to use larger batch sizes. If you are using WXSUtils.putAll_noLoader to bulk load the items in to the grid then make sure you have around 1000 key/value pairs PER partition. This is a good rule of thumb to start from when tuning things. If you have 200 partitions and a batch size for PutAll of 1000 items then thats only 5 per partition which isn't enough to really start to get the benefits of batching the puts. A number per partition in the hundreds is probably better.

If you have a grid with 200 partitions then a thread pool size of 32 isn't great either. You should at least have the same number of threads otherwise you'd overrun the pool instantly when doing any bulk operation. The default pool size for WXSUtils is indeed 32 and is too low for production. You can specify the size of the pool using a constructor parameter or by editing wxsutils.properties if thats how you are connecting to the grid.

I've already updated wxsutils to configure the threadpool as described above given recent experience. I have not changed the default thread pool size from 32 but thats something I'm considering also. Maybe, I'll default it to one or two times the number of partitions in the grid but bounded by some 'reasonable' number.

Anyway, it's worth pointing out the problem and how it's resolved just for every bodies education. An unbounded thread pool happens not just because of a growable thread pool for example but they can be memory unbounded pretty easily as in this case when the pool cannot service the pool queue fast enough.

Billy Newport's complete blog can be found at: http://www.devwebsphere.com/devwebsphere

About Billy Newport

Billy is a Distinguished Engineer at IBM. He's been at IBM since 2001. Billy was the lead on the WorkManager/ Scheduler APIs which were later standardized by IBM and BEA and are now the subject of JSR 236 and JSR 237. Billy lead the design of the WebSphere 6.0 non blocking IO framework (channel framework) and the WebSphere 6.0 high availability/clustering (HAManager). Billy currently works on WebSphere XD and ObjectGrid. He's also the lead persistence architect and runtime availability/scaling architect for the base application server.

Before IBM, Billy worked as an independant consultant at investment banks, telcos, publishing companies and travel reservation companies. He wrote video games in C and assembler on the ZX Spectrum, Atari ST and Commodore Amiga as a teenager. He started programming on an Apple IIe when he was eleven, his first programming language was 6502 assembler.

Billys current interests are lightweight non invasive middleware, complex event processing systems and grid based OLTP frameworks.

Why Attend the NFJS Tour?

» Cutting-Edge Technologies
» Agile Practices
» Peer Exchange

Current Topics:

Languages on the JVM: Scala, Groovy, Clojure
Enterprise Java
Core Java, Java 8
Agility
Testing: Geb, Spock, Easyb
REST
NoSQL: MongoDB, Cassandra
Hadoop
Spring 4
Cloud
Automation Tools: Gradle, Git, Jenkins, Sonar
HTML5, CSS3, AngularJS, jQuery, Usability
Mobile Apps - iPhone and Android
More...

Learn More »

There's more to configuring ThreadPools than thread pool size

Posted by: Billy Newport on May 13, 2011

About Billy Newport

Why Attend the NFJS Tour?

Current Topics: