StarPU Handbook
Execution Configuration Through Environment Variables

The behavior of the StarPU library and tools may be tuned thanks to the following environment variables.

Configuring Workers

STARPU_NCPU

Specify the number of CPU workers (thus not including workers dedicated to control accelerators). Note that by default, StarPU will not allocate more CPU workers than there are physical CPUs, and that some CPUs are used to control the accelerators.

STARPU_RESERVE_NCPU

Specify the number of CPU cores that should not be used by StarPU, so the application can use starpu_get_next_bindid() and starpu_bind_thread_on() to bind its own threads.

This option is ignored if STARPU_NCPU or starpu_conf::ncpus is set.

STARPU_NCPUS

This variable is deprecated. You should use STARPU_NCPU.

STARPU_NCUDA

Specify the number of CUDA devices that StarPU can use. If STARPU_NCUDA is lower than the number of physical devices, it is possible to select which CUDA devices should be used by the means of the environment variable STARPU_WORKERS_CUDAID. By default, StarPU will create as many CUDA workers as there are CUDA devices.

STARPU_NWORKER_PER_CUDA

Specify the number of workers per CUDA device, and thus the number of kernels which will be concurrently running on the devices, i.e. the number of CUDA streams. The default value is 1.

STARPU_CUDA_THREAD_PER_WORKER

Specify whether the cuda driver should use one thread per stream (1) or to use a single thread to drive all the streams of the device or all devices (0), and STARPU_CUDA_THREAD_PER_DEV determines whether is it one thread per device or one thread for all devices. The default value is 0. Setting it to 1 is contradictory with setting STARPU_CUDA_THREAD_PER_DEV.

STARPU_CUDA_THREAD_PER_DEV

Specify whether the cuda driver should use one thread per device (1) or to use a single thread to drive all the devices (0). The default value is 1. It does not make sense to set this variable if STARPU_CUDA_THREAD_PER_WORKER is set to to 1 (since STARPU_CUDA_THREAD_PER_DEV is then meaningless).

STARPU_CUDA_PIPELINE

Specify how many asynchronous tasks are submitted in advance on CUDA devices. This for instance permits to overlap task management with the execution of previous tasks, but it also allows concurrent execution on Fermi cards, which otherwise bring spurious synchronizations. The default is 2. Setting the value to 0 forces a synchronous execution of all tasks.

STARPU_NOPENCL

OpenCL equivalent of the environment variable STARPU_NCUDA.

STARPU_OPENCL_PIPELINE

Specify how many asynchronous tasks are submitted in advance on OpenCL devices. This for instance permits to overlap task management with the execution of previous tasks, but it also allows concurrent execution on Fermi cards, which otherwise bring spurious synchronizations. The default is 2. Setting the value to 0 forces a synchronous execution of all tasks.

STARPU_OPENCL_ON_CPUS

By default, the OpenCL driver only enables GPU and accelerator devices. By setting the environment variable STARPU_OPENCL_ON_CPUS to 1, the OpenCL driver will also enable CPU devices.

STARPU_OPENCL_ONLY_ON_CPUS

By default, the OpenCL driver enables GPU and accelerator devices. By setting the environment variable STARPU_OPENCL_ONLY_ON_CPUS to 1, the OpenCL driver will ONLY enable CPU devices.

STARPU_NMIC

MIC equivalent of the environment variable STARPU_NCUDA, i.e. the number of MIC devices to use.

STARPU_NMICTHREADS

Number of threads to use on the MIC devices.

STARPU_NMPI_MS

MPI Master Slave equivalent of the environment variable STARPU_NCUDA, i.e. the number of MPI Master Slave devices to use.

STARPU_NMPIMSTHREADS

Number of threads to use on the MPI Slave devices.

STARPU_MPI_MASTER_NODE

This variable allows to chose which MPI node (with the MPI ID) will be the master.

STARPU_WORKERS_NOBIND

Setting it to non-zero will prevent StarPU from binding its threads to CPUs. This is for instance useful when running the testsuite in parallel.

STARPU_WORKERS_GETBIND

Setting it to non-zero makes StarPU use the OS-provided CPU binding to determine how many and which CPU cores it should use. This is notably useful when running several StarPU-MPI processes on the same host, to let the MPI launcher set the CPUs to be used.

STARPU_WORKERS_CPUID

Passing an array of integers in STARPU_WORKERS_CPUID specifies on which logical CPU the different workers should be bound. For instance, if STARPU_WORKERS_CPUID = "0 1 4 5", the first worker will be bound to logical CPU #0, the second CPU worker will be bound to logical CPU #1 and so on. Note that the logical ordering of the CPUs is either determined by the OS, or provided by the library hwloc in case it is available. Ranges can be provided: for instance, STARPU_WORKERS_CPUID = "1-3 5" will bind the first three workers on logical CPUs #1, #2, and #3, and the fourth worker on logical CPU #5. Unbound ranges can also be provided: STARPU_WORKERS_CPUID = "1-" will bind the workers starting from logical CPU #1 up to last CPU.

Note that the first workers correspond to the CUDA workers, then come the OpenCL workers, and finally the CPU workers. For example if we have STARPU_NCUDA=1, STARPU_NOPENCL=1, STARPU_NCPU=2 and STARPU_WORKERS_CPUID = "0 2 1 3", the CUDA device will be controlled by logical CPU #0, the OpenCL device will be controlled by logical CPU #2, and the logical CPUs #1 and #3 will be used by the CPU workers.

If the number of workers is larger than the array given in STARPU_WORKERS_CPUID, the workers are bound to the logical CPUs in a round-robin fashion: if STARPU_WORKERS_CPUID = "0 1", the first and the third (resp. second and fourth) workers will be put on CPU #0 (resp. CPU #1).

This variable is ignored if the field starpu_conf::use_explicit_workers_bindid passed to starpu_init() is set.

STARPU_MAIN_THREAD_BIND

When defined, this make StarPU bind the thread that calls starpu_initialize() to a reserved CPU, subtracted from the CPU workers.

STARPU_MAIN_THREAD_CPUID

When defined, this make StarPU bind the thread that calls starpu_initialize() to the given CPU ID.

STARPU_MPI_THREAD_CPUID

When defined, this make StarPU bind its MPI thread to the given CPU ID. Setting it to -1 (the default value) will use a reserved CPU, subtracted from the CPU workers.

STARPU_MPI_NOBIND

Setting it to non-zero will prevent StarPU from binding the MPI to a separate core. This is for instance useful when running the testsuite on a single system.

STARPU_WORKERS_CUDAID

Similarly to the STARPU_WORKERS_CPUID environment variable, it is possible to select which CUDA devices should be used by StarPU. On a machine equipped with 4 GPUs, setting STARPU_WORKERS_CUDAID = "1 3" and STARPU_NCUDA=2 specifies that 2 CUDA workers should be created, and that they should use CUDA devices #1 and #3 (the logical ordering of the devices is the one reported by CUDA).

This variable is ignored if the field starpu_conf::use_explicit_workers_cuda_gpuid passed to starpu_init() is set.

STARPU_WORKERS_OPENCLID

OpenCL equivalent of the STARPU_WORKERS_CUDAID environment variable.

This variable is ignored if the field starpu_conf::use_explicit_workers_opencl_gpuid passed to starpu_init() is set.

STARPU_WORKERS_MICID

MIC equivalent of the STARPU_WORKERS_CUDAID environment variable.

This variable is ignored if the field starpu_conf::use_explicit_workers_mic_deviceid passed to starpu_init() is set.

STARPU_WORKER_TREE

Define to 1 to enable the tree iterator in schedulers.

STARPU_SINGLE_COMBINED_WORKER

If set, StarPU will create several workers which won't be able to work concurrently. It will by default create combined workers which size goes from 1 to the total number of CPU workers in the system. STARPU_MIN_WORKERSIZE and STARPU_MAX_WORKERSIZE can be used to change this default.

STARPU_MIN_WORKERSIZE

STARPU_MIN_WORKERSIZE permits to specify the minimum size of the combined workers (instead of the default 2)

STARPU_MAX_WORKERSIZE

STARPU_MAX_WORKERSIZE permits to specify the minimum size of the combined workers (instead of the number of CPU workers in the system)

STARPU_SYNTHESIZE_ARITY_COMBINED_WORKER

Let the user decide how many elements are allowed between combined workers created from hwloc information. For instance, in the case of sockets with 6 cores without shared L2 caches, if STARPU_SYNTHESIZE_ARITY_COMBINED_WORKER is set to 6, no combined worker will be synthesized beyond one for the socket and one per core. If it is set to 3, 3 intermediate combined workers will be synthesized, to divide the socket cores into 3 chunks of 2 cores. If it set to 2, 2 intermediate combined workers will be synthesized, to divide the the socket cores into 2 chunks of 3 cores, and then 3 additional combined workers will be synthesized, to divide the former synthesized workers into a bunch of 2 cores, and the remaining core (for which no combined worker is synthesized since there is already a normal worker for it).

The default, 2, thus makes StarPU tend to building a binary trees of combined workers.

STARPU_DISABLE_ASYNCHRONOUS_COPY

Disable asynchronous copies between CPU and GPU devices. The AMD implementation of OpenCL is known to fail when copying data asynchronously. When using this implementation, it is therefore necessary to disable asynchronous data transfers.

STARPU_DISABLE_ASYNCHRONOUS_CUDA_COPY

Disable asynchronous copies between CPU and CUDA devices.

STARPU_DISABLE_ASYNCHRONOUS_OPENCL_COPY

Disable asynchronous copies between CPU and OpenCL devices. The AMD implementation of OpenCL is known to fail when copying data asynchronously. When using this implementation, it is therefore necessary to disable asynchronous data transfers.

STARPU_DISABLE_ASYNCHRONOUS_MIC_COPY

Disable asynchronous copies between CPU and MIC devices.

STARPU_DISABLE_ASYNCHRONOUS_MPI_MS_COPY

Disable asynchronous copies between CPU and MPI Slave devices.

STARPU_ENABLE_CUDA_GPU_GPU_DIRECT

Enable (1) or Disable (0) direct CUDA transfers from GPU to GPU, without copying through RAM. The default is Enabled. This permits to test the performance effect of GPU-Direct.

STARPU_DISABLE_PINNING

Disable (1) or Enable (0) pinning host memory allocated through starpu_malloc, starpu_memory_pin and friends. The default is Enabled. This permits to test the performance effect of memory pinning.

STARPU_BACKOFF_MIN

Set minimum exponential backoff of number of cycles to pause when spinning. Default value is 1.

STARPU_BACKOFF_MAX

Set maximum exponential backoff of number of cycles to pause when spinning. Default value is 32.

STARPU_MIC_SINK_PROGRAM_NAME

todo

STARPU_MIC_SINK_PROGRAM_PATH

todo

STARPU_MIC_PROGRAM_PATH

todo

Configuring The Scheduling Engine

STARPU_SCHED

Choose between the different scheduling policies proposed by StarPU: work random, stealing, greedy, with performance models, etc.

Use STARPU_SCHED=help to get the list of available schedulers.

STARPU_MIN_PRIO

Set the mininum priority used by priorities-aware schedulers.

STARPU_MAX_PRIO

Set the maximum priority used by priorities-aware schedulers.

STARPU_CALIBRATE

If this variable is set to 1, the performance models are calibrated during the execution. If it is set to 2, the previous values are dropped to restart calibration from scratch. Setting this variable to 0 disable calibration, this is the default behaviour.

Note: this currently only applies to dm and dmda scheduling policies.

STARPU_CALIBRATE_MINIMUM

Define the minimum number of calibration measurements that will be made before considering that the performance model is calibrated. The default value is 10.

STARPU_BUS_CALIBRATE

If this variable is set to 1, the bus is recalibrated during intialization.

STARPU_PREFETCH

Indicate whether data prefetching should be enabled (0 means that it is disabled). If prefetching is enabled, when a task is scheduled to be executed e.g. on a GPU, StarPU will request an asynchronous transfer in advance, so that data is already present on the GPU when the task starts. As a result, computation and data transfers are overlapped. Note that prefetching is enabled by default in StarPU.

STARPU_SCHED_ALPHA

To estimate the cost of a task StarPU takes into account the estimated computation time (obtained thanks to performance models). The alpha factor is the coefficient to be applied to it before adding it to the communication part.

STARPU_SCHED_BETA

To estimate the cost of a task StarPU takes into account the estimated data transfer time (obtained thanks to performance models). The beta factor is the coefficient to be applied to it before adding it to the computation part.

STARPU_SCHED_GAMMA

Define the execution time penalty of a joule (Energy-based Scheduling).

STARPU_SCHED_READY

For a modular scheduler with sorted queues below the decision component, workers pick up a task which has most of its data already available. Setting this to 0 disables this.

STARPU_IDLE_POWER

Define the idle power of the machine (Energy-based Scheduling).

STARPU_PROFILING

Enable on-line performance monitoring (Enabling On-line Performance Monitoring).

Extensions

SOCL_OCL_LIB_OPENCL

THE SOCL test suite is only run when the environment variable SOCL_OCL_LIB_OPENCL is defined. It should contain the location of the file libOpenCL.so of the OCL ICD implementation.

OCL_ICD_VENDORS

When using SOCL with OpenCL ICD (https://forge.imag.fr/projects/ocl-icd/), this variable may be used to point to the directory where ICD files are installed. The default directory is /etc/OpenCL/vendors. StarPU installs ICD files in the directory $prefix/share/starpu/opencl/vendors.

STARPU_COMM_STATS

Communication statistics for starpumpi (Debugging MPI) will be enabled when the environment variable STARPU_COMM_STATS is defined to an value other than 0.

STARPU_MPI_CACHE

Communication cache for starpumpi (MPI Support) will be disabled when the environment variable STARPU_MPI_CACHE is set to 0. It is enabled by default or for any other values of the variable STARPU_MPI_CACHE.

STARPU_MPI_COMM

Communication trace for starpumpi (MPI Support) will be enabled when the environment variable STARPU_MPI_COMM is set to 1, and StarPU has been configured with the option --enable-verbose.

STARPU_MPI_CACHE_STATS

When set to 1, statistics are enabled for the communication cache (MPI Support). For now, it prints messages on the standard output when data are added or removed from the received communication cache.

STARPU_MPI_PRIORITIES

When set to 0, the use of priorities to order MPI communications is disabled (MPI Support).

STARPU_MPI_NDETACHED_SEND

This sets the number of send requests that StarPU-MPI will emit concurrently. The default is 10.

STARPU_MPI_NREADY_PROCESS

This sets the number of requests that StarPU-MPI will submit to MPI before polling for termination of existing requests. The default is 10.

STARPU_MPI_FAKE_SIZE

Setting to a number makes StarPU believe that there are as many MPI nodes, even if it was run on only one MPI node. This allows e.g. to simulate the execution of one of the nodes of a big cluster without actually running the rest. It of course does not provide computation results and timing.

STARPU_MPI_FAKE_RANK

Setting to a number makes StarPU believe that it runs the given MPI node, even if it was run on only one MPI node. This allows e.g. to simulate the execution of one of the nodes of a big cluster without actually running the rest. It of course does not provide computation results and timing.

STARPU_MPI_DRIVER_CALL_FREQUENCY

When set to a positive value, activates the interleaving of the execution of tasks with the progression of MPI communications (MPI Support). The starpu_mpi_init_conf() function must have been called by the application for that environment variable to be used. When set to 0, the MPI progression thread does not use at all the driver given by the user, and only focuses on making MPI communications progress.

STARPU_MPI_DRIVER_TASK_FREQUENCY

When set to a positive value, the interleaving of the execution of tasks with the progression of MPI communications mechanism to execute several tasks before checking communication requests again (MPI Support). The starpu_mpi_init_conf() function must have been called by the application for that environment variable to be used, and the STARPU_MPI_DRIVER_CALL_FREQUENCY environment variable set to a positive value.

STARPU_SIMGRID_TRANSFER_COST

When set to 1 (which is the default), data transfers (over PCI bus, typically) are taken into account in SimGrid mode.

STARPU_SIMGRID_CUDA_MALLOC_COST

When set to 1 (which is the default), CUDA malloc costs are taken into account in SimGrid mode.

STARPU_SIMGRID_CUDA_QUEUE_COST

When set to 1 (which is the default), CUDA task and transfer queueing costs are taken into account in SimGrid mode.

STARPU_PCI_FLAT

When unset or set to 0, the platform file created for SimGrid will contain PCI bandwidths and routes.

STARPU_SIMGRID_QUEUE_MALLOC_COST

When unset or set to 1, simulate within SimGrid the GPU transfer queueing.

STARPU_MALLOC_SIMULATION_FOLD

Define the size of the file used for folding virtual allocation, in MiB. The default is 1, thus allowing 64GiB virtual memory when Linux's sysctl vm.max_map_count value is the default 65535.

STARPU_SIMGRID_TASK_SUBMIT_COST

When set to 1 (which is the default), task submission costs are taken into account in SimGrid mode. This provides more accurate SimGrid predictions, especially for the beginning of the execution.

STARPU_SIMGRID_FETCHING_INPUT_COST

When set to 1 (which is the default), fetching input costs are taken into account in SimGrid mode. This provides more accurate SimGrid predictions, especially regarding data transfers.

STARPU_SIMGRID_SCHED_COST

When set to 1 (0 is the default), scheduling costs are taken into account in SimGrid mode. This provides more accurate SimGrid predictions, and allows studying scheduling overhead of the runtime system. However, it also makes simulation non-deterministic.

STARPU_SINK

Variable defined by StarPU when running MPI Xeon PHI on the sink.

Miscellaneous And Debug

STARPU_HOME

Specify the main directory in which StarPU stores its configuration files. The default is $HOME on Unix environments, and $USERPROFILE on Windows environments.

STARPU_PATH

Only used on Windows environments. Specify the main directory in which StarPU is installed (Running a Basic StarPU Application on Microsoft Visual C)

STARPU_PERF_MODEL_DIR

Specify the main directory in which StarPU stores its performance model files. The default is $STARPU_HOME/.starpu/sampling.

STARPU_PERF_MODEL_HOMOGENEOUS_CPU

When this is set to 0, StarPU will assume that CPU devices do not have the same performance, and thus use different performance models for them, thus making kernel calibration much longer, since measurements have to be made for each CPU core.

STARPU_PERF_MODEL_HOMOGENEOUS_CUDA

When this is set to 1, StarPU will assume that all CUDA devices have the same performance, and thus share performance models for them, thus allowing kernel calibration to be much faster, since measurements only have to be once for all CUDA GPUs.

STARPU_PERF_MODEL_HOMOGENEOUS_OPENCL

When this is set to 1, StarPU will assume that all OPENCL devices have the same performance, and thus share performance models for them, thus allowing kernel calibration to be much faster, since measurements only have to be once for all OPENCL GPUs.

STARPU_PERF_MODEL_HOMOGENEOUS_MIC

When this is set to 1, StarPU will assume that all MIC devices have the same performance, and thus share performance models for them, thus allowing kernel calibration to be much faster, since measurements only have to be once for all MIC GPUs.

STARPU_PERF_MODEL_HOMOGENEOUS_MPI_MS

When this is set to 1, StarPU will assume that all MPI Slave devices have the same performance, and thus share performance models for them, thus allowing kernel calibration to be much faster, since measurements only have to be once for all MPI Slaves.

STARPU_HOSTNAME

When set, force the hostname to be used when dealing performance model files. Models are indexed by machine name. When running for example on a homogenenous cluster, it is possible to share the models between machines by setting export STARPU_HOSTNAME=some_global_name.

STARPU_OPENCL_PROGRAM_DIR

Specify the directory where the OpenCL codelet source files are located. The function starpu_opencl_load_program_source() looks for the codelet in the current directory, in the directory specified by the environment variable STARPU_OPENCL_PROGRAM_DIR, in the directory share/starpu/opencl of the installation directory of StarPU, and finally in the source directory of StarPU.

STARPU_SILENT

Allow to disable verbose mode at runtime when StarPU has been configured with the option --enable-verbose. Also disable the display of StarPU information and warning messages.

STARPU_MPI_DEBUG_LEVEL_MIN

Set the minimum level of debug when StarPU has been configured with the option --enable-mpi-verbose.

STARPU_MPI_DEBUG_LEVEL_MAX