EPIC - Sun Grid Engine Integration with Globus Toolkit 3
Introduction
This page describes how to configure a GT3 server to be able to submit jobs
to a local Sun Grid Engine installation.
Prerequisites
This installation guide assumes that you have the following already configured and working:
- A server containing:
- A properly configured Globus Toolkit 3 installation. Specifically, you
should already be able to run jobs on your server using the
managed-job-globusrun command using the Master Managed Job Factory
Service (MMJFS) and the 'Fork' Jobmanager backend.
- A properly configured Sun Grid Engine installation. Specifically, you
should be able to run tasks on your SGE cluster using the qsub
command.
Installation
Overview
The intallation of the SGE integration software can be broken down into
several steps -- downloading the software, building it on your GT3 server,
updating the configuration of the server and user hosting environment, and
finally testing to ensure it operates correctly.
Download
You will need to download the three software packages which make up the glue
between the Globus Toolkit and the Sun Grid Engine; these are:
The first of these contains the actual Perl glue code which generates the SGE
job script from the RSL specification passed to the managed-job-globusrun command on the client and submits it using qsub
-- and monitors its state after it has been submitted.
The second package contains configuration information which will add the
MasterSGEManagedJobFactoryService to your GT3 installation. It is this service
that the managed-job-globusrun script will connect to when it wishes to
execute a job using SGE.
Finally, the last package will add the SGEManagedJobFactoryService to your
GT3 installation which actually provides the SGE job execution services that
will be used by the MMJFS.
Upgrading from a previous release
Before installing an updated version of the SGE job manager packages, you should first uninstall any existing version using the following command:
% gpt-uninstall globus_gram_job_manager_setup_sge
Build and install
To build and install the SGE jobmanager components, run the following
commands in the temporary directory as the user who owns the globus
installation:
% gpt-build globus_gram_job_manager_setup_sge-0.11.tar.gz
% gpt-build mmjfs_sge_setup-0.0.tar.gz
% gpt-build mjs_sge_setup-0.0.tar.gz
This will build the three source components and prepare them for
installation into the GT3 deployment. Once they have finished, you should run: % gpt-postinstall
to update the live configuration of the GT3 installation.
Custom configuration
As the gpt-build tool does not allow you to customise the configuration of the SGE package from the commandline, if you wish to override any of the default settings for the jobmanager you will need to run the configuration program again after installation. Run:
% ${GLOBUS_LOCATION}/setup/globus/setup-globus-job-manager-sge --help
for a list of configuration options.
Propagate configuration changes
Although the gpt tools have updated the server-config.wsdd in the ${GLOBUS_LOCATION} directory, the User Hosting Environment (or UHE) spawned by
the MMJFS for each user will still have an out of date copy in those users' ${HOME}/.globus/uhe-'hostname'/ directory. So, to remove the old configuration
for each user, simply remove the uhe directory:
% rm -Rf $HOME/.globus/uhe-*/
You will also need to restart the GT3 server and UHEs so that the changes
will take effect; they only read their configuration files on startup.
Testing
Once you have restarted your GT3 server with the new configuration and
services, you can test it using the managed-job-globusrun tool and test
job script provided by GT3. After setting ${GLOBUS_LOCATION} in your
environment, source the GT3 environment configuration script:
% source ${GLOBUS_LOCATION}/etc/globus-user-env.csh(for csh users)
$ . ${GLOBUS_LOCATION}/etc/globus-user-env.sh
(for bash users) With that done, you can now run the test job:
% managed-job-globusrun -factory HOSTNAME:PORT -type SGE -file ${GLOBUS_LOCATION}/etc/test.xml -output
(Please substitue the hostname and port of your GT3 service instead of
HOSTNAME and PORT above. Run managed-job-globusrun -help if you need to
see the full list of commandline options.)
This command will read in the XML-encoded RSL job specification in the file
test.xml and will submit it to the MasterSGEManagedJobFactoryService on your GT3
server. This will create a new SGEManagedJobService and feed the job
specification to the SGE script generator which will generate and submit a new
job for execution on your SGE cluster.
The SGE MJS will poll the state of your job using qstat, and once it has
determined it has completed it will return the standard output of the test
script to the managed-job-globusrun program which will print that output on the
user's terminal.
Troubleshooting
If the above test did not succeed, here a few common error conditions that
may occur along with steps to resolve the problem.
Problem: The managed-job-globusrun command aborts with a "Read timeout" exception.
This can occur the first time that a user tries to execute a job after the GT3 server has been started; it can simply take a significant length of time -- longer than the timeout normally tolerated -- to start the MJS which will execute a user's job. Simply try running the job again.
Problem: The managed-job-globusrun command aborts with an AXIS exception indicating it could not find the SGE MMJFS service!
This can occur either when the SGE packages have not been fully installed or when the configuration changes that they made to the GT3 installation have not been propagated to a user's hosting environment. Make sure that you've successfully run gpt-postinstall and that your ~.globus/uhe-'hostname' folder has been cleared and the UHE restarted.
For further information please contact david.mcbride@imperial.ac.uk