|
StarPU Handbook
|
TODO: intro qui parle de coherency entre autres
StarPU provides several data interfaces for programmers to describe the data layout of their application. There are predefined interfaces already available in StarPU. Users can define new data interfaces as explained in Defining A New Data Interface. All functions provided by StarPU are documented in Data Interfaces. You will find a short list below.
A variable is a given size byte element, typically a scalar. Here an example of how to register a variable data to StarPU by using starpu_variable_data_register().
A vector is a fixed number of elements of a given size. Here an example of how to register a vector data to StarPU by using starpu_vector_data_register().
To register 2-D matrices with a potential padding, one can use the matrix data interface. Here an example of how to register a matrix data to StarPU by using starpu_matrix_data_register().
To register 3-D blocks with potential paddings on Y and Z dimensions, one can use the block data interface. Here an example of how to register a block data to StarPU by using starpu_block_data_register().
BCSR (Blocked Compressed Sparse Row Representation) sparse matrix data can be registered to StarPU using the bcsr data interface. Here an example on how to do so by using starpu_bcsr_data_register().
StarPU provides an example on how to deal with such matrices in examples/spmv.
TODO
When the application allocates data, whenever possible it should use the starpu_malloc() function, which will ask CUDA or OpenCL to make the allocation itself and pin the corresponding allocated memory, or to use the starpu_memory_pin() function to pin memory allocated by other ways, such as local arrays. This is needed to permit asynchronous data transfer, i.e. permit data transfer to overlap with computations. Otherwise, the trace will show that the DriverCopyAsync state takes a lot of time, this is because CUDA or OpenCL then reverts to synchronous transfers.
By default, StarPU leaves replicates of data wherever they were used, in case they will be re-used by other tasks, thus saving the data transfer time. When some task modifies some data, all the other replicates are invalidated, and only the processing unit which ran this task will have a valid replicate of the data. If the application knows that this data will not be re-used by further tasks, it should advise StarPU to immediately replicate it to a desired list of memory nodes (given through a bitmask). This can be understood like the write-through mode of CPU caches.
will for instance request to always automatically transfer a replicate into the main memory (node 0), as bit 0 of the write-through bitmask is being set.
will request to always automatically broadcast the updated data to all memory nodes.
Setting the write-through mask to ~0U can also be useful to make sure all memory nodes always have a copy of the data, so that it is never evicted when memory gets scarse.
Implicit data dependency computation can become expensive if a lot of tasks access the same piece of data. If no dependency is required on some piece of data (e.g. because it is only accessed in read-only mode, or because write accesses are actually commutative), use the function starpu_data_set_sequential_consistency_flag() to disable implicit dependencies on this data.
In the same vein, accumulation of results in the same data can become a bottleneck. The use of the mode STARPU_REDUX permits to optimize such accumulation (see Data Reduction). To a lesser extent, the use of the flag STARPU_COMMUTE keeps the bottleneck (see Commute Data Access), but at least permits the accumulation to happen in any order.
Applications often need a data just for temporary results. In such a case, registration can be made without an initial value, for instance this produces a vector data:
StarPU will then allocate the actual buffer only when it is actually needed, e.g. directly on the GPU without allocating in main memory.
In the same vein, once the temporary results are not useful any more, the data should be thrown away. If the handle is not to be reused, it can be unregistered:
actual unregistration will be done after all tasks working on the handle terminate.
If the handle is to be reused, instead of unregistering it, it can simply be invalidated:
the buffers containing the current value will then be freed, and reallocated only when another task writes some value to the handle.
The scheduling policies heft, dmda and pheft perform data prefetch (see STARPU_PREFETCH): as soon as a scheduling decision is taken for a task, requests are issued to transfer its required data to the target processing unit, if needed, so that when the processing unit actually starts the task, its data will hopefully be already available and it will not have to wait for the transfer to finish.
The application may want to perform some manual prefetching, for several reasons such as excluding initial data transfers from performance measurements, or setting up an initial statically-computed data distribution on the machine before submitting tasks, which will thus guide StarPU toward an initial task distribution (since StarPU will try to avoid further transfers).
This can be achieved by giving the function starpu_data_prefetch_on_node() the handle and the desired target memory node. The starpu_data_idle_prefetch_on_node() variant can be used to issue the transfer only when the bus is idle.
Conversely, one can advise StarPU that some data will not be useful in the close future by calling starpu_data_wont_use(). StarPU will then write its value b