Apache Lucene and DataGrids, a good match?

I'm not so sure anymore. I was recently investigating this with a large customer using a big set of Lucene indexes, about 25GB. They had the index on disk and were copying the index in to a RAMDirectory. This resulted in fast search times but the JVM heap was very large, well over 40GB. Garbage collection pauses were long because of the large heap.

So, the plan was to store the index in a remote IBM WebSphere eXtreme Scale grid instead. Lucene makes this pretty easy. I was able to write a Lucene directory store for WXS and measure the performance. In theory, we'd be storing the 25GB in a collection of JVMs. Lets say 7 JVMs with heaps in the 9GB range. The data would be replicated normally so we're storing about 50GB of data in 63GB of JVM heap. The clients running the lucene engine and the Lucene plugin for WXS (wxslucene) would now just need a small amount of memory and everything is great.

But, performance can be a problem. We found we needed a near cache on the lucene side to have acceptable performance. This shouldn't be a surprise. The cost in the RAMDirectory to find a block is practically zero. Fetching that block from a network remote JVM simply costs more and when the baseline is zero, even 2-3ms is a lot slower. It's hard to beat zero.

We enabled a near cache which helped significantly with some queries but other queries touched a lot of blocks in the indexes and the game then was to make the near cache bigger and bigger to have higher hit rates. This worked for some simpler queries but more complex queries still caused high near cache misses and resulted in poor performance.

We could have simply make the near caches large enough to hold most of the blocks but thats kind of pointless as the whole point of the exercise was to reduce the amount of memory per client JVM, not use the same as before PLUS the memory requirements of the data grid. We found a pretty dramatic fall off in performance once the hit rate fell even a small amount from the high 90s. It would be better described as a cliff for some of the queries. It was very sensitive to the hit rate being extremely high.

So, in the end, we were unable to get the desired level of performance for the more complex queries and a competitor had the same conclusion with their directory implementation. After speaking with Shay Banon (Apache Compass, Elastic Search, Lucene expert) he basically agreed and said the way to go with Lucene is to shard it and keep the search engine close to the data. Then do parallel queries across all shards keeping the data and the processing within a single box as much as possible and then combine the results. This is basically what google does after all so this shouldn't be a surprise but it was a useful exercise.

I've seen a Cassandra based store for Lucene, Lucassandra or something, also but it's going to have the same problems as mentioned above except it'll be even slower than a memory datagrid probably (don't start benchmarking this for pete sake, moving a byte[] from memory to a socket is faster than anything thats disk based as a general rule).

The code for the WXS lucene directory is here. It may work well for certain types of Lucene query but I couldn't say that in general that it's always better and I think the same applies to anybodies datagrid directory implementation unless there is something else going on that can be played to.

Billy Newport's complete blog can be found at: http://www.devwebsphere.com/devwebsphere

About Billy Newport

Billy is a Distinguished Engineer at IBM. He's been at IBM since 2001. Billy was the lead on the WorkManager/ Scheduler APIs which were later standardized by IBM and BEA and are now the subject of JSR 236 and JSR 237. Billy lead the design of the WebSphere 6.0 non blocking IO framework (channel framework) and the WebSphere 6.0 high availability/clustering (HAManager). Billy currently works on WebSphere XD and ObjectGrid. He's also the lead persistence architect and runtime availability/scaling architect for the base application server.

Before IBM, Billy worked as an independant consultant at investment banks, telcos, publishing companies and travel reservation companies. He wrote video games in C and assembler on the ZX Spectrum, Atari ST and Commodore Amiga as a teenager. He started programming on an Apple IIe when he was eleven, his first programming language was 6502 assembler.

Billys current interests are lightweight non invasive middleware, complex event processing systems and grid based OLTP frameworks.

Why Attend the NFJS Tour?

» Cutting-Edge Technologies
» Agile Practices
» Peer Exchange

Current Topics:

Languages on the JVM: Scala, Groovy, Clojure
Enterprise Java
Core Java, Java 8
Agility
Testing: Geb, Spock, Easyb
REST
NoSQL: MongoDB, Cassandra
Hadoop
Spring 4
Cloud
Automation Tools: Gradle, Git, Jenkins, Sonar
HTML5, CSS3, AngularJS, jQuery, Usability
Mobile Apps - iPhone and Android
More...

Learn More »

Apache Lucene and DataGrids, a good match?

Posted by: Billy Newport on

About Billy Newport

Why Attend the NFJS Tour?

Current Topics: