Friday, October 30, 2009

I just created yet another little Bioclipse-manager feature

Do you hate it when that long running job suddenly reports result while you are deep in work with something else? Bioclipse 1 had it and now so does Bioclipse 2. Result reporting only when the user want by clicking in the progress view.

I have made this possible using an annotation. If you have a manager method taking a BioclipseUIJob that shows the result from the method (run as a job) then you can annotate the method (in the interface) with @SilentNotification to have the result only pop up when the user clicks the link in the progress view. I also added a little time out so if the job is finnished really fast the result pops up at once. The default value of this is half a second. So if the job is done in less than half a second the result pops up at once. This value can be changed by setting the silentAfter paramater of the annotation. There is also a message paramater with default value: "Job completed, results available!", that is simply the message shown in the progress view when the job is finished.

Thursday, October 15, 2009

I just had to get this out of me

It is thursday evening. An evenining after a day when nothing has worked. And even though I would have prefered writing this with a nice 10 year old scotch whisky in my company, I will not.

On the topic of a structure database

In an already too distant past a project for a molecular structure database was started. It got the name StructureDB, in lack of a better name, and StructureDB it has remained. From the beginning the idea was to have a system that was easy to get started with, no fancy installation and stuff, just fire it up and start playing around. So we settled for HSQLDB. However we also wanted a server version which we thought we were gonna use MySQL for. So far so good. Then we started designing a fancy model with auditing and annotations and stuff because those things are a must have for a big system. In order to do auditing we needed users, different users that is -- which were to log in to the system. So the model turned out to be something like this:

Fast forward to today. As I said I spent the whole day struggling with things that didn't work. Now struggling with things that doesn't work is business as usual. What was different today was that the things that didn't work was things that had been working fine up until today. At least that is what I thought. One of the things I was fighting with was creating a default 'local' user for each new database instance and how to keep the auditing correct in regards to who created this user. I was trying to make it so that it created itself. This had been working fine earlier but was missbehaving in some cases it seemed, and while I was messing around, exploring possibilities for how to solve it the whole thing literary came crashing down upon me and the last thing that happend before I went home was the it somehow used up 500MB of memory while loading a 5MB file into memory.

Anyway, I gave up and thought something like it's clearly a bad day and I better sleep on this. However during the bike ride home a voice in my head told me:
'-You are doing it wrong!'
'Why?' I asked the voice. What was I doing wrong?

Have you spotted it yet? Well, I will tell you now. StructureDB today is a one user system running locally on one client. There is absolutly no need for users, and no need for auditing. There simply is no point in being able to go: "This is wrong, who did this?" to the system because the answer is always gonna be: "You did it!". Furthermore, the fancy ChoiceAnnotation based on pre defined values is probably not the way we want to work with molecules either. Normally we import the molecules from something like a huge SDF file and then we do searches on them. Maybe we calculate some properties for them and store them. But there is no reason not to simply use text fields or number fields for this. We don't need a predefined enumeration of valid property values.

The only reason for keepign the user system was for that distant day when we were going to run this all on a MySQL installation and share data among many clients. But that day is looking more and more distant to me.

So I need to bring out the red pen and cross out some things from that diagram:

Now this should make a lot of things a lot easier. Whenever I find the time to get back to this taht is...

...and who knows, it wouldn't suprise me if it actually will bring that MySQL day a bit closer too!

Wednesday, August 26, 2009

I just realized databases aren't always the way to go.

My molecular database for Bioclipse has been storing the molecules in Blobs inside the database. Since it is based on HSQLDB — in order to be easy to set up — I have been suffering from some scalability problems when the database grows big. Things like starting and stopping the database takes a long time and memory usage is high. Since it has to share memory with the rest of Bioclipse this has led to some trouble...

Anyhow, I have now made some experiments with storing the molecules as cml files outside the database instead of as Blobs inside it. So far I have only got create and retrieve up and running — no delete or update. But this was just what was needed to start doing some benchmarking. I have performed each step three times and things go faster and faster for each time. This is to be expected since the JVM will optimize things that run often.

The first operation is to import the data. I am using the Drugbank small_mol_drugs.sdf file from the Bioclispe drugbank sample data plugin for these tests. It contains about 1000 compunds and is about 7 MB in size. Figure 1 shows that the import time is about the same, maybe a little higher, for the file based approach. This is to be expected since most time is spent calculating fingerprints and such, so no speedup here but at least about the same.

Figure 1: Import time is negligibly higher for the file approach

To really show the speedup I performed a SMARTS query. The query I used was "CC=O", a fairly naive one but as it is mainly the loading we are looking at here that's fine. The SMARTS query method loads CML for each molecule, instantiates a CDK object and performs SMARTS matching using CDK code for that. Figure 2 enlightens the difference in time for that SMARTS query when the molecules are stored in the database and as files.

Figure 2: SMARTS querying time is significantly lower for the file approach

This is where files really pay off. With the help of YourKit I measured how much time was spent on doing the actual SMARTS matching in the two cases. Figure 3 shows that for files we are spending about 70% of the time on SMARTS matching compared to about 10% for the Blob appraoch.


Figure 3: In the files case about 70% of the time is spent doing actual SMARTS matching compared to about 10% when the molecules are stored in the databse

So with the files outside the database Bioclipse is not only more stable and quicker to start the SMARTS querying also performs much better. Now I just have to implement update and delete and make sure the files are stored in many folder enough to keep this approach scaling for huge amounts of data as well.

Wednesday, August 12, 2009

I just created two small dialogs for Bioclipse

I have made two dialogs for picking molecules from the Bioclipse workbench - PickMoleculeDialog and PickMoleculesDialog. One is for picking one molecule and it only shows files containing one molecule and allow for picking one of them. The other is for picking many molecules and show files containing one or many molecules and one or many files can be selected. Both dialogs can be resized and the tree viewer rescales to fit in the resized dialog.

Wednesday, April 15, 2009

I just came up with yet-another-way-of-making-a-Bioclipse-Manager

There are a few things not so very nice with the way of implementing a manager called "The New World Order". Before you all give up muttering something about things changing all the time I want take this opportunity and say that it is not so easy for a bear of very little brain and I need a few iterations to get things decent. Furthermore I want to say that I am not forcing you to do your managers in a certain way and that there is nothing stopping you from doing your manager without Spring and all my fancy inventions -- of course you won't get recording, automagic job creation, translation from String to IFile and all that stuff , you will have to do it yourself, nevertheless if that is want you want I won't stop you.

Now for the list of things not optimal with "The new World Order".
  1. Most importantly. All the methods defined in the interface but not implemented in the manager. Yes those pesky "This manager method should not have been called"-ones.
  2. Ola raised the problem of calling multiple long running jobs parallel in one job and then wait for all of them to finish. Something I can definitely see would be useful when doing things like QSAR calculations on any computer with more than one core (basically that means any machine these days...).
So much for the background here comes the suggestion. First of all in order to get rid all the methods that we don't want to implement the manager will not implement the manager-interface -- no more XManager implements IXManager. When doing managers in this way the coupling between the manager and the interface is loose. The actual dispatching of methods would be done by a MethodInterceptor which would catch all method calls on the manager and call the right method with the right arguments.

Basically I see the need for three different sort of methods when dealing with long running operations. I will show with 2 examples. First a method receiving a BioObject and returning another BioObject.

Methods on the interface

public IMolecule 
generate3dCoordinates( IMolecule molecule );

public void generate3dCoordinates( IMolecule molecule,
BioclipseUIJob uiJob );

public BioclipseJob
generate3dCoordinates( IMolecule molecule,
String jobName );

First we have the standard method that will be used from JavaScript and that will be run in the gui thread (freeze Bioclipse) if run from Java. Next is another old friend but in a slightly different appearance. This is the method used when writing actions in context menus for example. It is void (we can't hang around waiting for the result) and creates a Job. The code updating the GUI afterwards is given through the BioclipseUIJob -- in the method named runInUI. Finally we have a new friend. This method returns a job. This method is meant to be used from other manager methods. For example imagine a method in some fancy Manager. Our method needs to generate 3d coordinates for a bunch of molecules and than do some fancy calculations on all of them. The idea is that you can write it a little bit like this:

BioclipseJob job1 
= cdk.generate3dCoordinates(mol1, "first job");
BioclipseJob job2
= cdk.generate3dCoordinates(mol2, "second job");
IMolecule mol1With3d = job1.getResult();
IMolecule mol2With3d = job2.getResult();

So each cdk.generate3dCoordinates call will be a job of it's own. By the way the String jobName paramater must be there or the third method's signature will clash with the first one.

All this will be accomplished by one method on the actual manager implementation. It will look something like this:

Method in the Manager class

public IMolecule
generate3dCoordinates( IMolecule molecule,
IProgressMonitor monitor );

All the translation from the methods declared in the interface to this method will be done by the MethodInterceptor which of course comes in one flavor for Java and one for JavaScript.

Let's look at one more example. This one is for working with files and contains the String -> IFile conversion.

Methods on the interface

public IMolecule loadMolecule( String path );

public void loadMolecule( IFile file,
BioclipseUIJob uiJob );

public BioclipseJob loadMolecule( IFile file,
String jobName );

This is basically business as usual by now. The String -> IFile conversion would also be done by the very same MethodInterceptor.

Method in the Manager class

public IMolecule loadMolecule( IFile file, 
IProgressMonitor monitor );

So, what does this solution lack? :)

Tuesday, April 7, 2009

I just wish I could reach the Eclipse help articles writers with this post

Today I printed an eclipse help article and read it. At page 10 out 13 I found what I was interested in. Some code. But the dabbler who wrote the code used a line width of of something like 94 columns or so and of course the printed version does not show the end of such crazy long rows. So the thing in the article that I was actually interested in was lost when I printed it.

Now if only people could keep their code within a limited number of columns. That would not only make it possible to print the code, but also read the code on screen without having to scroll horizontally.

Friday, April 3, 2009

I just need to make sure all the manager implementations are up to date now I suppose

I am beginning to feel happy with the Bioclipse manager documentation on the Biclipse wiki. There are probably a zillion small errors and things to add to it but at least it contains what I right now think are the most important parts. Now I must make sure that all managers that are going to be included in the Bioclispe 2 release candidate are implemented according to this scheme. Time to file bug reports! :)

Wednesday, April 1, 2009

I just started to (finally) write the documentation for Bioclipse managers

I have been pushing it in front of me for quite too long a time now. But now (finally) I have started to write the text on How to make a manager on the Bioclipse wiki.

I kept telling myself that there where no point in doing it since I want to modify the process anyway. I guess that will have to be post Bioclipse 2.0 though. The main problem with the current solution is that it contains methods with this not so very nice look:
   public void remove( String filePath ) {
throw new IllegalStateException("This method should not be called");
The only way to get rid of these methods are to not have the manager class implementing the manager interface though. Although this sounds a bit strange I am sure it can be done but I am equally sure that it is not something that should be given time / energy right now.

Tuesday, March 31, 2009

I just made a big commit on Jmol stuff for Bioclipse

During the last couple of weeks I have given the Jmol editor in Bioclipse some love. Today I made a big commit. It all started when I tried opening a multiple model pdb in it and started figuring out how to browse it using the outline. That was very unintuitive -- and still is, because I haven't fixed that yet... :)

What I have fixed is only the first thing that felt strange about the Jmol editor. I have been working on synchronizing selections between the outline and the Jmol editor. The Jmol outline in Bioclipse comes in two flavours depending on what is being browsed. So far I have made the one showed when looking at a small molecule synchronized with the outline. So when selecting in Jmol the corresponding atom in the outline is selected. As for the other version of the outline there I have so far only made Jmol select groups instead of atoms as default when clicked on.

In the future we probably want buttons for switching between different ways to select things in Jmol. Much more that can so far only be done using Jmol scipts can be made easily available from the Bioclipse graphical user interface.