HTCondor is used to distribute computing jobs on several computers. All of these jobs run with priority lower than ordinary user jobs (nice-value 19).
The purpose is that neither other console user jobs nor ssh user jobs will be affected by HTCondor jobs. The use of the mouse/keyboard and the load level on a computing host is checked and may lead to temporary halts of jobs managed by HTCondor. The jobs will continue automatically if the situation changes, or they are moved automatically to other hosts.
Most of the Linux computers for bachelor students in the department can run HTCondor managed jobs. About 200 CPU cores are available.
- License conditions
- What kind of jobs
- User setup
- Available computers
- Starting jobs
- Monitoring a job
HTCondor may be used to run most computing jobs which do not rely on active communication with a user. The computing program is submitted to HTCondor along with data and command files (e.g. keyboard input) and the resulting data appear in files.
For programs which may be relinked, e.g. written in C or C++ by the user, HTCondor can automatically store the job status periodically on disk (by default every third hour). This is called checkpointing. If the computer stops the job, HTCondor may continue the job from the last stored status. In this way the loss of computing time is limited. This is possible when linking to HTCondor libraries.
Most commercial software programs can not be relinked by a user. In that case the user must take care of storing intermediate results to reduce loss of computing time if a computer stops/crashes. This, to some degree, reduces the benefit of HTCondor, but the capability to make use of unused CPU cycles in a large network of machines and automatic restart from a machine crash is useful anyway.
In the case of computing with MATLAB, it is also possible to use the MATLAB compiler and link with HTCondor libraries.
To determine whether a computer is available for HTCondor, log in on the computer and issue the command
pgrep -l condor
in a terminal window. If the result is not empty, it means that HTCondor is running on the host.
To determine which machines are submit machines, give the command
which, at the time of writing this guide, yields andeskondor and gribb.
Only these hosts can be used to send/submit jobs to HTCondor with the command
condor_submit. On andeskondor and gribb the command
pgrep -l condor
returns a list of which daemons are running, e.g.
17234 condor_master 910032 condor_procd 910035 condor_schedd 22510 condor_shared_p
which means that four HTCondor daemons are running.
condor_master governs the other daemons.
condor_schedd sends jobs to HTCondor.
There may also be a number of
condor_shadow, one for each job running on a computing resource.
To determine which computer has the central manager role, issue the command
which at this point in time yields kongekondor.
On kongekondor the command
pgrep -l condor< returns
3091418 condor_master 3091435 condor_procd 338540 condor_shared_p 3091439 condor_collecto 3091508 condor_negotiat
condor_collector collects information from the machines,
condor_negotiator negotiates, on behalf of a user, with machines to find those that can satisfy the requirements the user has given.
The machines used to execute jobs have the daemons
condor_shared_p, but not
On a machine which is a member of the HTCondor pool, type the command
to get a list of machines which can run HTCondor jobs. The Linux command "
man condor_status" gives more infomation.
To start jobs, use
ssh to log into andeskondor.ifi.uio.no or gribb.ifi.uio.no (linux machines with 64-bit redhat enterprise linux release 7.6)
- Create files containing keyboard commands for programs which need this.
- Create a file which gives HTCondor the information required to run the job. This will be the submit file. The file name may well have the extension ".cmd".
The file needs to include a command which sets the universe HTCondor will use, "standard", "vanilla" or "java". Java programs must run in the "java" universe. Other programs which one can not link, e.g. MATLAB, must run in HTCondors vanilla universe. Programs which a user can link should run in HTCondors standard universe, so that one can utilize automatic checkpointing.
The universe is assigned like this
universe = vanilla
A submit file may contain more commands, like giving names of files to use instead of
stderr, how many times the program should run (with different input data for each run), which requirements the program has about the computing resource (e.g. machine architecture, minimum memory, fastest machine etc.) and who should receive email with information about the job.
Look below for examples. The Linux command "
man condor_submit" has more information about the content of a submit file.
- It the job will run in the standard universe, the program needs to be linked with HTCondor libraries which handle checkpointing. Use the command
condor_compilefor this. For the most common link commands simply insert the word
condor_compileat the front of the command line. Do likewise also if compilation and linking is done in one step.
condor_compilecan be used with
The linux command "
man condor_compile" gives more information about compilation and linking with HTCondor libraries.
- Send/submit the job to HTCondor. For this, use the command
# Use your own email address for notify_user, e.g. notify_user = email@example.com # When do you wish to receive email from HTCondor (Always, Complete, Error, Never) # Avoid Never. notification = Error executable = foo log = foo.log queue
The program with name foo will be run in the standard universe (because the universe is not specified), and the log file "foo.log" will contain information about the execution. The job will be run on the same machine architecture as the machine where the
condor_submit command is given (as long as the program's binary code fits the type of machine). We only use 64-bit linux hosts with intel compatible hardware for HTCondor in the department.
If the standard universe will not be used, the universe must be specified with the command "universe".
Note! MATLAB licenses is a scarce resource. A certain method to become unpopular is to start lots of HTCondor jobs which invoke MATLAB, thereby claiming several licenses.
To avoid using many MATLAB licenses one can use MATLAB Compiler to generate C code for stand-alone programs. Execution of these does not require MATLAB license. See the MATLAB documentation for more information about this. It also facilitates "checkpointing".
A submit file for running a single MATLAB jobb may look like this
# Use your own email-address for notify_user, e.g.: notify_user = firstname.lastname@example.org # Only receive e-mail in case of job error. notification = Error # Program to run. executable = /usr/bin/matlab # Universe without checkpointing for non-linkable job universe = vanilla # Import user environment variables at the time of submission to Condor getenv = True # stdin, i.e. keyboard input to MATLAB. Script, not function. input = keyboard_input_script.m # stdout, i.e. screen output from MATLAB. output = matlab_screen_output.txt # stderr. However, MATLAB prints error messages to screen, stdout, # so leave stderr at default value of /dev/null. #error = matlab_errors.txt # log-file for Condor activity log = condor.log # Choice of directory to start executable from (i.e. Condor does # cd to initialdir prior to starting executable) initialdir = $ENV(HOME)/matlab # Arguments to pass to the executable. # start_function can be m-file script or m-file function which must be on # the MATLAB path. Substitute "start_function" with your own choice. arguments = -nodisplay -nojvm -r start_function queue
This shows an example where the files referred to by
error end up in the folder given by
initialdir. The Java program has the full path to the class containing "main". The program also uses other classes from
myJar.jar. Standard input is taken from the file
h.txt in a specified folder.
The first argument in
arguments must contain the name of the class containing "main". The name must be written so that the
java command is able to find it. In the example the name is given with full package qualification.
executable gives the start class with suffix
class, i.e. the file, while
arguments gives the class name.
CLASSPATH will contain the files given by
jar-files in addition to elements which HTCondor needs to find it's helper classes.
# Use your own email-address for notify_user, e.g.: notify_user = email@example.com notification = Error executable = $ENV(HOME)/src/java/mypackage/ClassName.class universe = java initialdir = $ENV(HOME)/project input = $ENV(HOME)/data/h.txt output = submit.out error = submit.err jar_files = $ENV(HOME)/myJar.jar arguments = mypackage.ClassName arg0 arg1 arg2 log = project.log queue
By using the command
the program will run 100 times. Each run is given a process number, from zero. This gives the possibility to e.g. give different data files for each run.
input = file.$(Process)
Other commands may also use the macro
Process in this way.
A submit file may contain several queue commands. The process number is increased for each queue command (does not start from zero). All processes started by a submit file are given a common cluster number which also may be referred to in a submit file, with the macroen
The following commands may be useful to monitor/administer jobs. Use "
man command" for more information on the commands.
- Check status and activity on machines.
- See the queue of jobs. The option
-globalyields the queue for all submit machines. The option
-better-analyzecan give information about why a job has not been started.
- Check data about jobs which were completed or removed.
- Remove a job from queue.
When running Java programs on linux machines, HTCondor is not always able to guess the amount of memory the job needs. In the standard setup HTCondor requires that a machine which will run a job must have a memory which is large enough to contain the complete job (without resorting to disk usage). The problem is especially encountered with programs using many threads. The estimate then is too large. If the job becomes suspended, it may be that it will never start again because no machines have enough memory to satisfy the job's requirements.
The solution is to include a
requirements command in the submit file, specifying the requirement for memory. In this case HTCondor will neglect the requirement in the previous paragraph. If the specified memory requirement is smaller than the job size, it may lead to ineffective execution. The requirement can be written like this, if the job needs 512 MB of memory
requirements = Memory > 512
With more requirements for the machine, they will all need to be included in the same
requirements command, e.g.
requirements = Memory > 512 && ((Arch == "INTEL" && OpSys == "LINUX"))