Bloom Filters for efficient fraud detection with IBM WebSphere eXtreme Scale

I've written and uploaded a Bloom Filter implementation and uploaded it as part of the wxsutils library I keep on github. You can look at the source code here. If you want to use it with your WebSphere eXtreme Scale project just grab the jar from here and add it to your class path.

A bloom filter is best explained elsewhere in terms of how it works. The use cases for where it's useful are worth going over again here though.

Cassandra uses a bloom filter to track all the keys where are stored on disk in a space efficient data structure, i.e. the bloom filter. It uses this to avoid disk I/Os that would otherwise be used looking up keys definitely not on the disk.

Another more customer friendly example is a fraud detection scenario. Lets say you're a credit card company that has the spending history over 6 months of all your customers. You want to catch purchases to vendors that are not in that history. One way would be to keep all the customer vendor history in a database or in a datagrid and check if a new transaction is for an existing relationship. If not then you'd escalate it for scrutiny. For a large credit card company, this can easily be billions of rows. Lets say 200 million credit cards with 200 vendors each? That could be a lot of memory.

A better approach would be to create a custom bloom filter for each customer. The bloom filter is initialized for the number of actual vendors the customer deals with, maybe 200 vendors on average. Each vendor is then added to the bloom filter. This filter is a byte array when this is done. It's about 256 bytes. Pretty small to store 200 vendor strings, no? This filter can then be loaded in to a DataGrid and when a new transaction comes in, the filter for that customer can check with a 97% accuracy whether the vendor was used before by the customer. This is very accurate and was achieved with a huge memory saving and we can improve the accuracy further at the expense of memory if needed. This is the beauty of bloom filters. This example can be used for:

Tracking unusual telephone numbers for cell phone fraud
Cities/zip codes that the card was used in
Categories of vendor the customer deals with (i.e. electronics is common but not motorsports)
and so on.

Basically any time you want a check a list of things for whether something is present or not present then it's a very memory efficient way to do this.

Another use case would be a security system that wants to track every password every used by a user and try to prevent them using one every again or prevent a customer from using banned passwords. A bloom filter could be used to track existing passwords or the list of banned passwords and allow those lists to be checked quickly without taking a lot of memory or disk space. It has huge advantages in the existing password case as the passwords are not physically stored, the bloom filter is like a fixed size set of password hashes.

IBM WebSphere eXtreme Scale customers can create Maps keyed by customer name whose value is a Bloom Filter. A WXS agent can be used to quickly check if a vendor or some related entity probably not in the 'list' of valid things for that customer. A batch job can generate the bloom filters from the actual vendor data and the grid can be preloaded with the resulting bloom filters periodically. The actual vendor data stays out of the grid.

Security is also improved with a Bloom filter. The grid is not storing any sensitive information. The actual vendors the customer deals with are NOT there. The bloom filter is just a bit array that can be used to check if a vendor is not there, it's not a list of vendors that could be hacked.

Bloom Filters are a very useful data structure to save memory as well as CPU path when checking lists of things for exclusion. Applications can use DataGrids such as IBM WebSphere eXtreme Scale to build extremely scalable, high performance systems exploiting techniques like BloomFilters for fraud detection as well as other scenarios.

Billy Newport's complete blog can be found at: http://www.devwebsphere.com/devwebsphere

About Billy Newport

Billy is a Distinguished Engineer at IBM. He's been at IBM since 2001. Billy was the lead on the WorkManager/ Scheduler APIs which were later standardized by IBM and BEA and are now the subject of JSR 236 and JSR 237. Billy lead the design of the WebSphere 6.0 non blocking IO framework (channel framework) and the WebSphere 6.0 high availability/clustering (HAManager). Billy currently works on WebSphere XD and ObjectGrid. He's also the lead persistence architect and runtime availability/scaling architect for the base application server.

Before IBM, Billy worked as an independant consultant at investment banks, telcos, publishing companies and travel reservation companies. He wrote video games in C and assembler on the ZX Spectrum, Atari ST and Commodore Amiga as a teenager. He started programming on an Apple IIe when he was eleven, his first programming language was 6502 assembler.

Billys current interests are lightweight non invasive middleware, complex event processing systems and grid based OLTP frameworks.

Why Attend the NFJS Tour?

» Cutting-Edge Technologies
» Agile Practices
» Peer Exchange

Current Topics:

Languages on the JVM: Scala, Groovy, Clojure
Enterprise Java
Core Java, Java 8
Agility
Testing: Geb, Spock, Easyb
REST
NoSQL: MongoDB, Cassandra
Hadoop
Spring 4
Cloud
Automation Tools: Gradle, Git, Jenkins, Sonar
HTML5, CSS3, AngularJS, jQuery, Usability
Mobile Apps - iPhone and Android
More...

Learn More »

Bloom Filters for efficient fraud detection with IBM WebSphere eXtreme Scale

Posted by: Billy Newport on

About Billy Newport

Why Attend the NFJS Tour?

Current Topics: