1.2. Rationale

The Cluster component submits jobs described by a given job script to the queuing system of a cluster. It allows to upload directories beforehand and to download directories after the job is terminated.

To check if jobs are finished, the Cluster component polls the queuing system every minute and asks for their states. The connection to the cluster is established via SSH. For the submission, a directory (sandbox-[uuid]) is created on the cluster in the user’s home directory. It will serve as the current working directory for all remote command line calls.

The remote directory structure is as follows:

/sandbox-[id]
    /iteration-0
        /cluster-job-0
            /input
            /output
        /cluster-job-1
            /input
            /output
        …
        /cluster-job-shared-input
    /iteration-1
        /cluster-job-0
            /input
            /output
        …
        / cluster-job-shared-input
    …
    job 

The job script is uploaded to /sandbox-[id]/job. The job submission is done from /sandbox-[id]/iteration-[n]/cluster-job-[n]/.

If the job failed and the Cluster component should be marked as failed, a file named job_failed must be created in /sandbox-[id]/iteration-[n]/cluster-job-[n]/output. The content of the file is used as error message. The output directories are not downloaded for the failed job and all remaining jobs terminated afterwards.