Lucene and Amazon S3

I spent some time trying to have the ability to store Lucene index on Amazon S3 service. Amazon S3 is a really cool idea, and having the ability to store Lucene index on top of it will provide a simple way to allow storing Lucene index in a distributed environment supporting HA. It will also make a lot of sense for applications deployed on Amazon EC2, since working with S3 from EC2 is free.

It was pretty simply to implement Lucene Directory interface on top of Amazon S3. A bucket is considered to be a Lucene index, and each file has one file object that holds its meta data, and 0 or more file objects holding portions of it (naturally, it is configurable). This, with Compass support for such storage, and Compass local cache support, should provide minor performance overhead when switching from local file system to S3.

Even before I embarked on this quick hacking session, the main thing I was concerned about was how to implement locking on top of S3. There is no formal locking API for it, but I heard somewhere that bucket creation is atomic. Assuming that it is, a very simple locking support can be done (creating a bucket and succeeding indicates a lock obtained, failure means it is locked already, deletion of a bucket releases the lock). Sadly, this is not the case and bucket creation is certainly not atomic. Funnily enough, it does not even fail when trying to create an already existing bucket.

So for now I shelved the implementation. It would be great if the good people at Amazon would allow for simple locking support. I understand that this is not simple to do in a distributed environment (hey, I work at GigaSpaces), but it must be there in some form, it will make S3 much a more attractive offer.

Shay Banon's complete blog can be found at: http://www.kimchy.org

About Shay Banon

Shay is the founder of the Compass open source project, a unique solution enabling search capabilities into any application model. He started working on mission critical real time C/C++ systems, later moving to Java (and never looked back). Within the Java world, Shay has worked on a propriety implementation of a distributed rule engine(RETE) server, your typical Java based web projects, and messaging based projects within the financial industry. Currently, Shay is a System Architect at GigaSpaces, GigaSpaces provides a single platform for end-to-end scalability of high performance and stateful distributed applications. GigaSpaces’ unique approach enables developers to Write their business logic Once and then seamlessly Scale out the application linearly Anywhere.

Why Attend the NFJS Tour?

» Cutting-Edge Technologies
» Agile Practices
» Peer Exchange

Current Topics:

Languages on the JVM: Scala, Groovy, Clojure
Enterprise Java
Core Java, Java 8
Agility
Testing: Geb, Spock, Easyb
REST
NoSQL: MongoDB, Cassandra
Hadoop
Spring 4
Cloud
Automation Tools: Gradle, Git, Jenkins, Sonar
HTML5, CSS3, AngularJS, jQuery, Usability
Mobile Apps - iPhone and Android
More...

Learn More »

Posted by: Shay Banon on November 16, 2007

About Shay Banon

Why Attend the NFJS Tour?

Current Topics: