Wednesday, August 26, 2009

I just realized databases aren't always the way to go.

My molecular database for Bioclipse has been storing the molecules in Blobs inside the database. Since it is based on HSQLDB — in order to be easy to set up — I have been suffering from some scalability problems when the database grows big. Things like starting and stopping the database takes a long time and memory usage is high. Since it has to share memory with the rest of Bioclipse this has led to some trouble...

Anyhow, I have now made some experiments with storing the molecules as cml files outside the database instead of as Blobs inside it. So far I have only got create and retrieve up and running — no delete or update. But this was just what was needed to start doing some benchmarking. I have performed each step three times and things go faster and faster for each time. This is to be expected since the JVM will optimize things that run often.

The first operation is to import the data. I am using the Drugbank small_mol_drugs.sdf file from the Bioclispe drugbank sample data plugin for these tests. It contains about 1000 compunds and is about 7 MB in size. Figure 1 shows that the import time is about the same, maybe a little higher, for the file based approach. This is to be expected since most time is spent calculating fingerprints and such, so no speedup here but at least about the same.

Figure 1: Import time is negligibly higher for the file approach

To really show the speedup I performed a SMARTS query. The query I used was "CC=O", a fairly naive one but as it is mainly the loading we are looking at here that's fine. The SMARTS query method loads CML for each molecule, instantiates a CDK object and performs SMARTS matching using CDK code for that. Figure 2 enlightens the difference in time for that SMARTS query when the molecules are stored in the database and as files.

Figure 2: SMARTS querying time is significantly lower for the file approach

This is where files really pay off. With the help of YourKit I measured how much time was spent on doing the actual SMARTS matching in the two cases. Figure 3 shows that for files we are spending about 70% of the time on SMARTS matching compared to about 10% for the Blob appraoch.


Figure 3: In the files case about 70% of the time is spent doing actual SMARTS matching compared to about 10% when the molecules are stored in the databse

So with the files outside the database Bioclipse is not only more stable and quicker to start the SMARTS querying also performs much better. Now I just have to implement update and delete and make sure the files are stored in many folder enough to keep this approach scaling for huge amounts of data as well.


  1. Hm,

    what about SMARTS search *inside* the database ?

    For a biggish setup you'd go postgres (or such)
    anyway, which can embed Java stuff.


  2. I am considering switching out HSQLDB and use Mychem instead. But that means a complicated setup for the end user I am afraid... It needs to be installed separately.

    Perhaps a feature to migrate an outgrown HSQLDB based database to Mychem could be a way to go in the future.

    Or did you suggest SMARTS matching inside HSQLDB? That might be one way to go but then the molecules really have to be in the database and since that gave memory trouble I am not sure that is the path to success in this case...
    Besides wouldn't it be more or less the same thing? I mean all molecules till has to be read and instntiated first...