"I just need a name I think"

Wednesday, January 19, 2011

I just got the cdk manager into bsl-runner

Finally some real progress. Today I got the cdk-manager running inside bsl-runner. To call it unstable would be an understatement, but from time to time it is there and when so it is possible to create a molecule and run some cdk manager methods on it. For the future I want to get a nightly build which produces something runnable for the project up and running on http://pele.farmbio.uu.se/hudson/. I should also start working on a test suite for the project. For the moment it's been along the ways of: "I wonder what is needed to get this running" but now that it actually runs it is time to get some tests covering the functionality up and running to be able to see when I break things. However since the Bioclipse project is nearing next release I will probably have to start spending my time on getting the next version out, and the bsl-runner is not Bioclipse 2.6 stuff. Not nearly...


P. S. Oh if you are crazy enough to want to try and run this, check-out all the HEADLESS branches (I think core and cheminformatics might be enough actually) and start the bsl-runner product find in the bsl-runner repo. The feature is not yet complete so you will have to add all required plugins using the run config dialogs 'Add required' button.

Friday, January 14, 2011

I just thought I should show you...

Hm that title is not quite true actually. It was more along the roads of egonw++ asked me to blog about Bioclipse headless. He told me that I should be friendly to potential early adapters. I told him there isn't really much to blog about yet. Which got him asking: "It compiles, no?", at which point I decided to give up. I did put up some more fighting just for the looks of it but he had me. So what exactly is this all about then?

I have been working on the headless branch lately (actually branches because a lot of our repos now have headless branches). The idea is to make the bioclipse scripting environment know as BSL accesible from the command line. The first step has been to refactor away all gui dependencies from the core bioclipse plugins and I am far from finished with that yet. I have played a little with a command line loop for interactively coding and what exists today is just the JavaScript loop made accessible from the command line. It fires up a lot of Bioclipse stuff in the background and my hopes is that the publication of the very first Bioclipse manager into it is within reach but to all of you early adopters out there (yea my best guess it that means you Egon (but anyone else reading, feel free to prove me wrong) :) the only thing you can expect to see if you manage to get the thing running is just a JavaScript prompt looking something like this:

Wednesday, April 21, 2010

I just found an awesome way to merge git patches with opendiff

Have you ever been in the situation that someone sent you a patch and you want to take some things from it but probably not all and you definitely want to check each change before accepting that? If you have worked with git, off course you have.

There are many ways of doing this. However maybe you have a favorite merge tool? For me this was open-diff which can show my old version to the left, the new version which I got from someone to the right and the current merge of them at the bottom. I can also edit the bottom one and write whatever I want there. In other words an awesome tool for doing manual merging and adopting of some but not all changes which have been sent to you. It took me a while (and some help from #git on freenode) to figure out how to use open diff for this but in the end it was well worth it.

This is how you can do it:

1. Create a new branch and apply the patch. Copy the #hash it get.

2. Checkout the branch you want to apply the merged commit to.

3. Write: git diff-tool -t opendiff #hash

This will open the opendiff tool for your original compared to the new version and when saving you can overwrite you original files and commit as usual.

Friday, October 30, 2009

I just created yet another little Bioclipse-manager feature

Do you hate it when that long running job suddenly reports result while you are deep in work with something else? Bioclipse 1 had it and now so does Bioclipse 2. Result reporting only when the user want by clicking in the progress view.

I have made this possible using an annotation. If you have a manager method taking a BioclipseUIJob that shows the result from the method (run as a job) then you can annotate the method (in the interface) with @SilentNotification to have the result only pop up when the user clicks the link in the progress view. I also added a little time out so if the job is finnished really fast the result pops up at once. The default value of this is half a second. So if the job is done in less than half a second the result pops up at once. This value can be changed by setting the silentAfter paramater of the annotation. There is also a message paramater with default value: "Job completed, results available!", that is simply the message shown in the progress view when the job is finished.

Thursday, October 15, 2009

I just had to get this out of me

It is thursday evening. An evenining after a day when nothing has worked. And even though I would have prefered writing this with a nice 10 year old scotch whisky in my company, I will not.

On the topic of a structure database

In an already too distant past a project for a molecular structure database was started. It got the name StructureDB, in lack of a better name, and StructureDB it has remained. From the beginning the idea was to have a system that was easy to get started with, no fancy installation and stuff, just fire it up and start playing around. So we settled for HSQLDB. However we also wanted a server version which we thought we were gonna use MySQL for. So far so good. Then we started designing a fancy model with auditing and annotations and stuff because those things are a must have for a big system. In order to do auditing we needed users, different users that is -- which were to log in to the system. So the model turned out to be something like this:



Fast forward to today. As I said I spent the whole day struggling with things that didn't work. Now struggling with things that doesn't work is business as usual. What was different today was that the things that didn't work was things that had been working fine up until today. At least that is what I thought. One of the things I was fighting with was creating a default 'local' user for each new database instance and how to keep the auditing correct in regards to who created this user. I was trying to make it so that it created itself. This had been working fine earlier but was missbehaving in some cases it seemed, and while I was messing around, exploring possibilities for how to solve it the whole thing literary came crashing down upon me and the last thing that happend before I went home was the it somehow used up 500MB of memory while loading a 5MB file into memory.

Anyway, I gave up and thought something like it's clearly a bad day and I better sleep on this. However during the bike ride home a voice in my head told me:
'-You are doing it wrong!'
'Why?' I asked the voice. What was I doing wrong?

Have you spotted it yet? Well, I will tell you now. StructureDB today is a one user system running locally on one client. There is absolutly no need for users, and no need for auditing. There simply is no point in being able to go: "This is wrong, who did this?" to the system because the answer is always gonna be: "You did it!". Furthermore, the fancy ChoiceAnnotation based on pre defined values is probably not the way we want to work with molecules either. Normally we import the molecules from something like a huge SDF file and then we do searches on them. Maybe we calculate some properties for them and store them. But there is no reason not to simply use text fields or number fields for this. We don't need a predefined enumeration of valid property values.

The only reason for keepign the user system was for that distant day when we were going to run this all on a MySQL installation and share data among many clients. But that day is looking more and more distant to me.

So I need to bring out the red pen and cross out some things from that diagram:



Now this should make a lot of things a lot easier. Whenever I find the time to get back to this taht is...

...and who knows, it wouldn't suprise me if it actually will bring that MySQL day a bit closer too!

Wednesday, August 26, 2009

I just realized databases aren't always the way to go.

My molecular database for Bioclipse has been storing the molecules in Blobs inside the database. Since it is based on HSQLDB — in order to be easy to set up — I have been suffering from some scalability problems when the database grows big. Things like starting and stopping the database takes a long time and memory usage is high. Since it has to share memory with the rest of Bioclipse this has led to some trouble...

Anyhow, I have now made some experiments with storing the molecules as cml files outside the database instead of as Blobs inside it. So far I have only got create and retrieve up and running — no delete or update. But this was just what was needed to start doing some benchmarking. I have performed each step three times and things go faster and faster for each time. This is to be expected since the JVM will optimize things that run often.

The first operation is to import the data. I am using the Drugbank small_mol_drugs.sdf file from the Bioclispe drugbank sample data plugin for these tests. It contains about 1000 compunds and is about 7 MB in size. Figure 1 shows that the import time is about the same, maybe a little higher, for the file based approach. This is to be expected since most time is spent calculating fingerprints and such, so no speedup here but at least about the same.




Figure 1: Import time is negligibly higher for the file approach



To really show the speedup I performed a SMARTS query. The query I used was "CC=O", a fairly naive one but as it is mainly the loading we are looking at here that's fine. The SMARTS query method loads CML for each molecule, instantiates a CDK object and performs SMARTS matching using CDK code for that. Figure 2 enlightens the difference in time for that SMARTS query when the molecules are stored in the database and as files.




Figure 2: SMARTS querying time is significantly lower for the file approach



This is where files really pay off. With the help of YourKit I measured how much time was spent on doing the actual SMARTS matching in the two cases. Figure 3 shows that for files we are spending about 70% of the time on SMARTS matching compared to about 10% for the Blob appraoch.


 

Figure 3: In the files case about 70% of the time is spent doing actual SMARTS matching compared to about 10% when the molecules are stored in the databse



So with the files outside the database Bioclipse is not only more stable and quicker to start the SMARTS querying also performs much better. Now I just have to implement update and delete and make sure the files are stored in many folder enough to keep this approach scaling for huge amounts of data as well.

Wednesday, August 12, 2009

I just created two small dialogs for Bioclipse



I have made two dialogs for picking molecules from the Bioclipse workbench - PickMoleculeDialog and PickMoleculesDialog. One is for picking one molecule and it only shows files containing one molecule and allow for picking one of them. The other is for picking many molecules and show files containing one or many molecules and one or many files can be selected. Both dialogs can be resized and the tree viewer rescales to fit in the resized dialog.

Followers