The deployment of an application on the Open Science Grid (OSG)


A few months ago a biology group at the University of Utah contacted the Center for High-Performance Computing (CHPC). Their goal was to predict the MS/MS spectra for a set of molecules (#:230,737). In a first step, I installed the CFM-ID code and its dependencies (vide infra). On our Ember cluster (Intel(R) Xeon(R) CPU X5660 @ 2.80GHz – 2 sockets of each 6 cores – 24 GB Memory) they were able to simulate (on average) the spectra of 69 molecules/hour. The simulation of the complete molecular set would require 10 Ember nodes for 334 hours. This approach didn’t look viable/scalable to me (in the long-run they wanted to simulate larger sets of molecules).

Based on the independent nature of the aforementioned simulations I thought the Open Science Grid (OSG) platform to be a better fit. I encouraged the PI to create an OSG project and apply for an OSG account. In the meantime I compiled the CFM-ID code to run on the OSG resources.

The CFM-ID code had the following dependencies: boost, rdkit, lpsolve, liblbfgs, and mpich (even to run serially!). In an controlled environment such as CHPC’s Ember cluster, we do not need to worry about the presence of certain system libraries. However, this does not apply to the OSG resources.

Therefore, I decided to create a CFM-ID bundle which contained the package and its dependencies. This would allow them to run their simulations on whatever OSG slot (as long as it ran on a Centos6 OS).

This meant that I had to change the rpath (after creating the executable and its dependencies), which I realized as follows (Bash Shell):
for i in ./cfm-annotate ./cfm-id ./cfm-id-precomputed ./cfm-predict ./cfm-test ./cfm-train ./compute-stats ./fraggraph-gen; do
patchelf --set-rpath $HOME/CFM-ID/cfm/12162016/lib/liblbfgs:$HOME/CFM-ID/cfm/12162016/lib/lpsolve:$HOME/CFM-ID/cfm/12162016/lib/mpich:$HOME/CFM-ID/cfm/12162016/lib/rdkit $i; chrpath -r '$ORIGIN'/../lib/liblbfgs/lib:'$ORIGIN'/../lib/lpsolve/lib:'$ORIGIN'/../lib/mpich/lib:'$ORIGIN'/../lib/rdkit/lib $i ; done

Finally I created a compressed tar-file that could easily be shipped within the OSG network:
cd $HOME/CFM-ID/cfm
tar -jcvf cfm-id.tar.bz 12162016

The new bundle could then easily be deployed on Centos6 based OSG-slots:
mkdir $HOME/Trial ; cd $HOME/Trial
cp -pR $HOME/CFM-ID/cfm/cfm-id.tar.bz .
tar -jxvf *
cd *2016/bin
export PATH=`pwd`:$PATH

I performed some initial testing on the OSG resources and found out that (on average) 4 molecular spectra/hour were simulated. However, due to the shear amount of resources available within the OSG framework, all the molecular spectra were simulated within 12 hours.
Besides the considerable speedup in time, a new path was offered to the biology researchers to simulate even larger sets of molecular spectra.

In the meantime Singularity containers have become available within the OSG framework. This would be an even better approach to build the CFM-ID package and its dependencies.