HTCondor

HTCondor is used to distribute computing jobs on several computers. All of these jobs run with priority lower than ordinary user jobs (nice-value 19).

The purpose is that neither other console user jobs nor ssh user jobs will be affected by HTCondor jobs. The use of the mouse/keyboard and the load level on a computing host is checked and may lead to temporary halts of jobs managed by HTCondor. The jobs will continue automatically if the situation changes, or they are moved automatically to other hosts.

Most of the Linux computers for bachelor students in the department can run HTCondor managed jobs. About 200 CPU cores are available.

Content

License conditions

See complete license conditions.

What kind of jobs

HTCondor may be used to run most computing jobs which do not rely on active communication with a user. The computing program is submitted to HTCondor along with data and command files (e.g. keyboard input) and the resulting data appear in files.

For programs which may be relinked, e.g. written in C or C++ by the user, HTCondor can automatically store the job status periodically on disk (by default every third hour). This is called checkpointing. If the computer stops the job, HTCondor may continue the job from the last stored status. In this way the loss of computing time is limited. This is possible when linking to HTCondor libraries.

Most commercial software programs can not be relinked by a user. In that case the user must take care of storing intermediate results to reduce loss of computing time if a computer stops/crashes. This, to some degree, reduces the benefit of HTCondor, but the capability to make use of unused CPU cycles in a large network of machines and automatic restart from a machine crash is useful anyway.

In the case of computing with MATLAB, it is also possible to use the MATLAB compiler and link with HTCondor libraries.

Available HTCondor computers

To determine whether a computer is available for HTCondor, log in on the computer and issue the command

  pgrep -l condor

in a terminal window. If the result is not empty, it means that HTCondor is running on the host.

Submit machines

To determine which machines are submit machines, give the command

condor_status -sched

which, at the time of writing this guide, yields myrkondor and svartkondor.

Only these hosts can be used to send/submit jobs to HTCondor with the command condor_submit. On myrkondor and svartkondor the command

pgrep -l condor

returns a list of which daemons are running, e.g.

17234 condor_master
910032 condor_procd
910035 condor_schedd
22510 condor_shared_p

which means that four HTCondor daemons are running. condor_master governs the other daemons. condor_schedd sends jobs to HTCondor.

There may also be a number of condor_shadow, one for each job running on a computing resource.

Central manager

To determine which computer has the central manager role, issue the command

condor_status -col

which at this point in time yields skogkondor and svartkondor.

On skogkondor the command pgrep -l condor< returns

3091418 condor_master
3091435 condor_procd
338540 condor_shared_p
3091439 condor_collecto
3091508 condor_negotiat

condor_collector collects information from the machines, condor_negotiator negotiates, on behalf of a user, with machines to find those that can satisfy the requirements the user has given. 

Computing resources

The machines used to execute jobs have the daemons condor_master, condor_startd and condor_shared_p, but not condor_schedd, condor_collector or condor_negotiator.

On a machine which is a member of the HTCondor pool, type the command

  condor_status

to get a list of machines which can run HTCondor jobs. The Linux command "man condor_status" gives more infomation.

Starting jobs

To start jobs, use ssh to log into myrkondor.ifi.uio.no or svartkondor.ifi.uio.no (linux machines with 64-bit redhat enterprise linux release 7.3)

  1. Create files containing keyboard commands for programs which need this.
  2. Create a file which gives HTCondor the information required to run the job. This will be the submit file. The file name may well have the extension ".cmd".

    The file needs to include a command which sets the universe HTCondor will use, "standard", "vanilla" or "java". Java programs must run in the "java" universe. Other programs which one can not link, e.g. MATLAB, must run in HTCondors vanilla universe. Programs which a user can link should run in HTCondors standard universe, so that one can utilize automatic checkpointing.

    The universe is assigned like this

      universe = vanilla
    

    A submit file may contain more commands, like giving names of files to use instead of stdin, stdout and stderr, how many times the program should run (with different input data for each run), which requirements the program has about the computing resource (e.g. machine architecture, minimum memory, fastest machine etc.) and who should receive email with information about the job.

    Look below for examples. The Linux command "man condor_submit" has more information about the content of a submit file.

  3. It the job will run in the standard universe, the program needs to be linked with  HTCondor libraries which handle checkpointing. Use the command condor_compile for this. For the most common link commands simply insert the word condor_compile at the front of the command line. Do likewise also if compilation and linking is done in one step. condor_compile can be used with cc, CC, gcc, g++, ld (the linker).

    The linux command "man condor_compile" gives more information about compilation and linking with HTCondor libraries.

  4. Send/submit the job to HTCondor. For this, use the command condor_submit, i.e.
      condor_submit submit-file
    

Minimalist submit file

# Use your own email address for notify_user, e.g.
notify_user    = username@ifi.uio.no
# When do you wish to receive email from HTCondor (Always, Complete, Error, Never)
# Avoid Never.
notification   = Error

executable     = foo                                                    
log            = foo.log                                                    
queue

The program with name foo will be run in the standard universe (because the universe is not specified), and the log file "foo.log" will contain information about the execution. The job will be run on the same machine architecture as the machine where the condor_submit command is given (as long as the program's binary code fits the type of machine). We only use 64-bit linux hosts with intel compatible hardware for HTCondor in the department.

If the standard universe will not be used, the universe must be specified with the command "universe".

Submit file for a simple MATLAB job

Note! MATLAB licenses is a scarce resource. A certain method to become unpopular is to start lots of HTCondor jobs which invoke MATLAB, thereby claiming several licenses.

To avoid using many MATLAB licenses one can use MATLAB Compiler to generate C code for stand-alone programs. Execution of these does not require MATLAB license. See the MATLAB documentation for more information about this. It also facilitates "checkpointing".

A submit file for running a single MATLAB jobb may look like this

# Use your own email-address for notify_user, e.g.:
notify_user    = username@ifi.uio.no

# Only receive e-mail in case of job error.
notification = Error

# Program to run.
executable = /usr/bin/matlab

# Universe without checkpointing for non-linkable job
universe   = vanilla

# Import user environment variables at the time of submission to Condor
getenv     = True

# stdin, i.e. keyboard input to MATLAB. Script, not function.
input      = keyboard_input_script.m

# stdout, i.e. screen output from MATLAB.
output     = matlab_screen_output.txt

# stderr. However, MATLAB prints error messages to screen, i.e. to stdout,
# so leave stderr at default value of /dev/null.
#error      = matlab_errors.txt

# log-file for Condor activity
log        = condor.log

# Choice of directory to start executable from (i.e. Condor does
# cd to initialdir prior to starting executable)
initialdir = $ENV(HOME)/matlab

# Arguments to pass to the executable.
# start_function can be m-file script or m-file function which must be on
# the MATLAB path. Substitute "start_function" with your own choice.
arguments  = -nodisplay -nojvm -r start_function

queue

Submit file for Java job

This shows an example where the files referred to by output and error end up in the folder given by initialdir. The Java program has the full path to the class containing "main". The program also uses other classes from myJar.jar. Standard input is taken from the file h.txt in a specified folder.

The first argument in arguments must contain the name of the class containing "main". The name must be written so that the java command is able to find it. In the example the name is given with full package qualification.

Note that executable gives the start class with suffix class, i.e. the file, while arguments gives the class name.

CLASSPATH will contain the files given by jar-files in addition to elements which HTCondor needs to find it's helper classes.

# Use your own email-address for notify_user, e.g.:
notify_user    = username@ifi.uio.no
notification   = Error

executable = $ENV(HOME)/src/java/mypackage/ClassName.class
universe = java

initialdir = $ENV(HOME)/project
input = $ENV(HOME)/data/h.txt
output = submit.out
error = submit.err
jar_files = $ENV(HOME)/myJar.jar
arguments = mypackage.ClassName arg0 arg1 arg2
log = project.log
queue

Submit file for multiple runs

By using the command

queue 100

the program will run 100 times. Each run is given a process number, from zero. This gives the possibility to e.g. give different data files for each run.

input = file.$(Process)

Other commands may also use the macro Process in this way.

A submit file may contain several queue commands. The process number is increased for each queue command (does not start from zero). All processes started by a submit file are given a common cluster number which also may be referred to in a submit file, with the macroen Cluster.

Monitoring a job

The following commands may be useful to monitor/administer jobs. Use "man command" for more information on the commands.

condor_status
Check status and activity on machines.
condor_q
See the queue of jobs. The option -global yields the queue for all submit machines. The optionz -analyze and -better-analyze can give information about why a job has not been started.
condor_history
Check data about jobs which were completed or removed.
condor_rm
Remove a job from queue.

Problems

When running Java programs on linux machines, HTCondor is not always able to guess the amount of memory the job needs. In the standard setup HTCondor requires that a machine which will run a job must have a memory which is large enough to contain the complete job (without resorting to disk usage). The problem is especially encountered with programs using many threads. The estimate then is too large. If the job becomes suspended, it may be that it will never start again because no machines have enough memory to satisfy the job's requirements.

The solution is to include a requirements command in the submit file, specifying the requirement for memory. In this case HTCondor will neglect the requirement in the previous paragraph. If the specified memory requirement is smaller than the job size, it may lead to ineffective execution. The requirement can be written like this, if the job needs 512 MB of memory

requirements = Memory > 512

With more requirements for the machine, they will all need to be included in the same requirements command, e.g.

requirements = Memory > 512 && ((Arch == "INTEL" && OpSys == "LINUX"))

 

Publisert 2. mars 2011 20:09 - Sist endret 9. juni 2017 16:44