| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641 | /* StarPU --- Runtime system for heterogeneous multicore architectures. * * Copyright (C) 2009-2021  Université de Bordeaux, CNRS (LaBRI UMR 5800), Inria * * StarPU is free software; you can redistribute it and/or modify * it under the terms of the GNU Lesser General Public License as published by * the Free Software Foundation; either version 2.1 of the License, or (at * your option) any later version. * * StarPU is distributed in the hope that it will be useful, but * WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. * * See the GNU Lesser General Public License in COPYING.LGPL for more details. *//*! \page TasksInStarPU Tasks In StarPU\section TaskGranularity Task GranularityLike any other runtime, StarPU has some overhead to manage tasks. Sinceit does smart scheduling and data management, this overhead is not alwaysneglectable. The order of magnitude of the overhead is typically a couple ofmicroseconds, which is actually quite smaller than the CUDA overhead itself. Theamount of work that a task should do should thus be somewhatbigger, to make sure that the overhead becomes neglectible. The offlineperformance feedback can provide a measure of task length, which should thus bechecked if bad performance are observed. To get a grasp at the scalabilitypossibility according to task size, one can run<c>tests/microbenchs/tasks_size_overhead.sh</c> which draws curves of thespeedup of independent tasks of very small sizes.To determine what task size your application is actually using, one can use<c>starpu_fxt_data_trace</c>, see \ref DataTrace .The choice of scheduler also has impact over the overhead: for instance, the scheduler <c>dmda</c> takes time to make a decision, while <c>eager</c> doesnot. <c>tasks_size_overhead.sh</c> can again be used to get a grasp at how muchimpact that has on the target machine.\section TaskSubmission Task SubmissionTo let StarPU make online optimizations, tasks should be submittedasynchronously as much as possible. Ideally, all tasks should besubmitted, and mere calls to starpu_task_wait_for_all() orstarpu_data_unregister() be done to wait fortermination. StarPU will then be able to rework the whole schedule, overlapcomputation with communication, manage accelerator local memory usage, etc.\section TaskPriorities Task PrioritiesBy default, StarPU will consider the tasks in the order they are submitted bythe application. If the application programmer knows that some tasks shouldbe performed in priority (for instance because their output is needed by manyother tasks and may thus be a bottleneck if not executed earlyenough), the field starpu_task::priority should be set to provide thepriority information to StarPU.\section TaskDependencies Task Dependencies\subsection SequentialConsistency Sequential ConsistencyBy default, task dependencies are inferred from data dependency (sequentialcoherency) by StarPU. The application can however disable sequential coherencyfor some data, and dependencies can be specifically expressed.Setting (or unsetting) sequential consistency can be done at the datalevel by calling starpu_data_set_sequential_consistency_flag() for aspecific data or starpu_data_set_default_sequential_consistency_flag()for all datas.Setting (or unsetting) sequential consistency can also be done at tasklevel by setting the field starpu_task::sequential_consistency to \c 0.Sequential consistency can also be set (or unset) for each handle of aspecific task, this is done by using the fieldstarpu_task::handles_sequential_consistency. When set, its valueshould be a array with the number of elements being the number ofhandles for the task, each element of the array being the sequentialconsistency for the \c i-th handle of the task. The field can easily beset when calling starpu_task_insert() with the flag::STARPU_HANDLES_SEQUENTIAL_CONSISTENCY\code{.c}char *seq_consistency = malloc(cl.nbuffers * sizeof(char));seq_consistency[0] = 1;seq_consistency[1] = 1;seq_consistency[2] = 0;ret = starpu_task_insert(&cl,	STARPU_RW, handleA, STARPU_RW, handleB, STARPU_RW, handleC,	STARPU_HANDLES_SEQUENTIAL_CONSISTENCY, seq_consistency,	0);free(seq_consistency);\endcodeThe internal algorithm used by StarPU to set up implicit dependency isas follows:\code{.c}if (sequential_consistency(task) == 1)    for(i=0 ; i<STARPU_TASK_GET_NBUFFERS(task) ; i++)      if (sequential_consistency(i-th data, task) == 1)        if (sequential_consistency(i-th data) == 1)           create_implicit_dependency(...)\endcode\subsection TasksAndTagsDependencies Tasks And Tags DependenciesOne can explicitely set dependencies between tasks usingstarpu_task_declare_deps() or starpu_task_declare_deps_array(). Dependencies between tasks can beexpressed through tags associated to a tag with the fieldstarpu_task::tag_id and using the function starpu_tag_declare_deps()or starpu_tag_declare_deps_array().The termination of a task can be delayed through the functionstarpu_task_end_dep_add() which specifies the number of calls to the functionstarpu_task_end_dep_release() needed to trigger the task termination. One canalso use starpu_task_declare_end_deps() or starpu_task_declare_end_deps_array()to delay the termination of a task until the termination of other tasks.\section SettingManyDataHandlesForATask Setting Many Data Handles For a TaskThe maximum number of data a task can manage is fixed by the macro\ref STARPU_NMAXBUFS which has a default value which can be changedthrough the \c configure option \ref enable-maxbuffers "--enable-maxbuffers".However, it is possible to define tasks managing more data by usingthe field starpu_task::dyn_handles when defining a task and the fieldstarpu_codelet::dyn_modes when defining the corresponding codelet.\code{.c}enum starpu_data_access_mode modes[STARPU_NMAXBUFS+1] ={	STARPU_R, STARPU_R, ...};struct starpu_codelet dummy_big_cl ={	.cuda_funcs = { dummy_big_kernel },	.opencl_funcs = { dummy_big_kernel },	.cpu_funcs = { dummy_big_kernel },	.cpu_funcs_name = { "dummy_big_kernel" },	.nbuffers = STARPU_NMAXBUFS+1,	.dyn_modes = modes};task = starpu_task_create();task->cl = &dummy_big_cl;task->dyn_handles = malloc(task->cl->nbuffers * sizeof(starpu_data_handle_t));for(i=0 ; i<task->cl->nbuffers ; i++){	task->dyn_handles[i] = handle;}starpu_task_submit(task);\endcode\code{.c}starpu_data_handle_t *handles = malloc(dummy_big_cl.nbuffers * sizeof(starpu_data_handle_t));for(i=0 ; i<dummy_big_cl.nbuffers ; i++){	handles[i] = handle;}starpu_task_insert(&dummy_big_cl,         	  STARPU_VALUE, &dummy_big_cl.nbuffers, sizeof(dummy_big_cl.nbuffers),		  STARPU_DATA_ARRAY, handles, dummy_big_cl.nbuffers,		  0);\endcodeThe whole code for this complex data interface is available in thefile <c>examples/basic_examples/dynamic_handles.c</c>.\section SettingVariableDataHandlesForATask Setting a Variable Number Of Data Handles For a TaskNormally, the number of data handles given to a task is set withstarpu_codelet::nbuffers. This field can however be set to\ref STARPU_VARIABLE_NBUFFERS, in which case starpu_task::nbuffersmust be set, and starpu_task::modes (or starpu_task::dyn_modes,see \ref SettingManyDataHandlesForATask) should be used to specify the modes forthe handles.\section UsingMultipleImplementationsOfACodelet Using Multiple Implementations Of A CodeletOne may want to write multiple implementations of a codelet for a single type ofdevice and let StarPU choose which one to run. As an example, we will show howto use SSE to scale a vector. The codelet can be written as follows:\code{.c}#include <xmmintrin.h>void scal_sse_func(void *buffers[], void *cl_arg){    float *vector = (float *) STARPU_VECTOR_GET_PTR(buffers[0]);    unsigned int n = STARPU_VECTOR_GET_NX(buffers[0]);    unsigned int n_iterations = n/4;    if (n % 4 != 0)        n_iterations++;    __m128 *VECTOR = (__m128*) vector;    __m128 factor __attribute__((aligned(16)));    factor = _mm_set1_ps(*(float *) cl_arg);    unsigned int i;    for (i = 0; i < n_iterations; i++)        VECTOR[i] = _mm_mul_ps(factor, VECTOR[i]);}\endcode\code{.c}struct starpu_codelet cl ={    .cpu_funcs = { scal_cpu_func, scal_sse_func },    .cpu_funcs_name = { "scal_cpu_func", "scal_sse_func" },    .nbuffers = 1,    .modes = { STARPU_RW }};\endcodeSchedulers which are multi-implementation aware (only <c>dmda</c> and<c>pheft</c> for now) will use the performance models of all theprovided implementations, and pick the one which seems to be the fastest.\section EnablingImplementationAccordingToCapabilities Enabling Implementation According To CapabilitiesSome implementations may not run on some devices. For instance, some CUDAdevices do not support double floating point precision, and thus the kernelexecution would just fail; or the device may not have enough shared memory forthe implementation being used. The field starpu_codelet::can_executepermits to express this. For instance:\code{.c}static int can_execute(unsigned workerid, struct starpu_task *task, unsigned nimpl){  const struct cudaDeviceProp *props;  if (starpu_worker_get_type(workerid) == STARPU_CPU_WORKER)    return 1;  /* Cuda device */  props = starpu_cuda_get_device_properties(workerid);  if (props->major >= 2 || props->minor >= 3)    /* At least compute capability 1.3, supports doubles */    return 1;  /* Old card, does not support doubles */  return 0;}struct starpu_codelet cl ={    .can_execute = can_execute,    .cpu_funcs = { cpu_func },    .cpu_funcs_name = { "cpu_func" },    .cuda_funcs = { gpu_func }    .nbuffers = 1,    .modes = { STARPU_RW }};\endcodeThis can be essential e.g. when running on a machine which mixes various modelsof CUDA devices, to take benefit from the new models without crashing on old models.Note: the function starpu_codelet::can_execute is called by thescheduler each time it tries to match a task with a worker, and shouldthus be very fast. The function starpu_cuda_get_device_properties()provides a quick access to CUDA properties of CUDA devices to achievesuch efficiency.Another example is to compile CUDA code for various compute capabilities,resulting with two CUDA functions, e.g. <c>scal_gpu_13</c> for compute capability1.3, and <c>scal_gpu_20</c> for compute capability 2.0. Both functions can beprovided to StarPU by using starpu_codelet::cuda_funcs, andstarpu_codelet::can_execute can then be used to rule out the<c>scal_gpu_20</c> variant on a CUDA device which will not be able to execute it:\code{.c}static int can_execute(unsigned workerid, struct starpu_task *task, unsigned nimpl){  const struct cudaDeviceProp *props;  if (starpu_worker_get_type(workerid) == STARPU_CPU_WORKER)    return 1;  /* Cuda device */  if (nimpl == 0)    /* Trying to execute the 1.3 capability variant, we assume it is ok in all cases.  */    return 1;  /* Trying to execute the 2.0 capability variant, check that the card can do it.  */  props = starpu_cuda_get_device_properties(workerid);  if (props->major >= 2 || props->minor >= 0)    /* At least compute capability 2.0, can run it */    return 1;  /* Old card, does not support 2.0, will not be able to execute the 2.0 variant.  */  return 0;}struct starpu_codelet cl ={    .can_execute = can_execute,    .cpu_funcs = { cpu_func },    .cpu_funcs_name = { "cpu_func" },    .cuda_funcs = { scal_gpu_13, scal_gpu_20 },    .nbuffers = 1,    .modes = { STARPU_RW }};\endcodeAnother example is having specialized implementations for some given commonsizes, for instance here we have a specialized implementation for 1024x1024matrices:\code{.c}static int can_execute(unsigned workerid, struct starpu_task *task, unsigned nimpl){  const struct cudaDeviceProp *props;  if (starpu_worker_get_type(workerid) == STARPU_CPU_WORKER)    return 1;  /* Cuda device */  switch (nimpl)  {    case 0:      /* Trying to execute the generic capability variant.  */      return 1;    case 1:    {      /* Trying to execute the size == 1024 specific variant.  */      struct starpu_matrix_interface *interface = starpu_data_get_interface_on_node(task->handles[0]);      return STARPU_MATRIX_GET_NX(interface) == 1024 && STARPU_MATRIX_GET_NY(interface == 1024);    }  }}struct starpu_codelet cl ={    .can_execute = can_execute,    .cpu_funcs = { cpu_func },    .cpu_funcs_name = { "cpu_func" },    .cuda_funcs = { potrf_gpu_generic, potrf_gpu_1024 },    .nbuffers = 1,    .modes = { STARPU_RW }};\endcodeNote that the most generic variant should be provided first, as some schedulers arenot able to try the different variants.\section InsertTaskUtility Insert Task UtilityStarPU provides the wrapper function starpu_task_insert() to easethe creation and submission of tasks.Here the implementation of a codelet:\code{.c}void func_cpu(void *descr[], void *_args){        int *x0 = (int *)STARPU_VARIABLE_GET_PTR(descr[0]);        float *x1 = (float *)STARPU_VARIABLE_GET_PTR(descr[1]);        int ifactor;        float ffactor;        starpu_codelet_unpack_args(_args, &ifactor, &ffactor);        *x0 = *x0 * ifactor;        *x1 = *x1 * ffactor;}struct starpu_codelet mycodelet ={        .cpu_funcs = { func_cpu },        .cpu_funcs_name = { "func_cpu" },        .nbuffers = 2,        .modes = { STARPU_RW, STARPU_RW }};\endcodeAnd the call to the function starpu_task_insert():\code{.c}starpu_task_insert(&mycodelet,                   STARPU_VALUE, &ifactor, sizeof(ifactor),                   STARPU_VALUE, &ffactor, sizeof(ffactor),                   STARPU_RW, data_handles[0],		   STARPU_RW, data_handles[1],                   0);\endcodeThe call to starpu_task_insert() is equivalent to the followingcode:\code{.c}struct starpu_task *task = starpu_task_create();task->cl = &mycodelet;task->handles[0] = data_handles[0];task->handles[1] = data_handles[1];char *arg_buffer;size_t arg_buffer_size;starpu_codelet_pack_args(&arg_buffer, &arg_buffer_size,                    STARPU_VALUE, &ifactor, sizeof(ifactor),                    STARPU_VALUE, &ffactor, sizeof(ffactor),                    0);task->cl_arg = arg_buffer;task->cl_arg_size = arg_buffer_size;int ret = starpu_task_submit(task);\endcodeHere a similar call using ::STARPU_DATA_ARRAY.\code{.c}starpu_task_insert(&mycodelet,                   STARPU_DATA_ARRAY, data_handles, 2,                   STARPU_VALUE, &ifactor, sizeof(ifactor),                   STARPU_VALUE, &ffactor, sizeof(ffactor),                   0);\endcodeIf some part of the task insertion depends on the value of some computation,the macro ::STARPU_DATA_ACQUIRE_CB can be very convenient. Forinstance, assuming that the index variable <c>i</c> was registered as handle<c>A_handle[i]</c>:\code{.c}/* Compute which portion we will work on, e.g. pivot */starpu_task_insert(&which_index, STARPU_W, i_handle, 0);/* And submit the corresponding task */STARPU_DATA_ACQUIRE_CB(i_handle, STARPU_R,                       starpu_task_insert(&work, STARPU_RW, A_handle[i], 0));\endcodeThe macro ::STARPU_DATA_ACQUIRE_CB submits an asynchronous request foracquiring data <c>i</c> for the main application, and will execute the codegiven as third parameter when it is acquired. In other words, as soon as thevalue of <c>i</c> computed by the codelet <c>which_index</c> can be read, theportion of code passed as third parameter of ::STARPU_DATA_ACQUIRE_CB willbe executed, and is allowed to read from <c>i</c> to use it e.g. as anindex. Note that this macro is only avaible when compiling StarPU withthe compiler <c>gcc</c>.StarPU also provides a utility function starpu_codelet_unpack_args() to retrieve the ::STARPU_VALUE arguments passed to the task. There is several ways of calling this function starpu_codelet_unpack_args().\code{.c}void func_cpu(void *descr[], void *_args){        int ifactor;        float ffactor;        starpu_codelet_unpack_args(_args, &ifactor, &ffactor);}\endcode\code{.c}void func_cpu(void *descr[], void *_args){        int ifactor;        float ffactor;        starpu_codelet_unpack_args(_args, &ifactor, 0);        starpu_codelet_unpack_args(_args, &ifactor, &ffactor);}\endcode\code{.c}void func_cpu(void *descr[], void *_args){        int ifactor;        float ffactor;	char buffer[100];        starpu_codelet_unpack_args_and_copyleft(_args, buffer, 100, &ifactor, 0);        starpu_codelet_unpack_args(buffer, &ffactor);}\endcode\section GettingTaskChildren Getting Task ChildrenIt may be interesting to get the list of tasks which depend on a given task,notably when using implicit dependencies, since this list is computed by StarPU.starpu_task_get_task_succs() provides it. For instance:\code{.c}struct starpu_task *tasks[4];ret = starpu_task_get_task_succs(task, sizeof(tasks)/sizeof(*tasks), tasks);\endcode\section ParallelTasks Parallel TasksStarPU can leverage existing parallel computation libraries by the means ofparallel tasks. A parallel task is a task which is run by a set of CPUs(called a parallel or combined worker) at the same time, by using an existingparallel CPU implementation of the computation to be achieved. This can also beuseful to improve the load balance between slow CPUs and fast GPUs: since CPUswork collectively on a single task, the completion time of tasks on CPUs becomecomparable to the completion time on GPUs, thus relieving from granularitydiscrepancy concerns. <c>hwloc</c> support needs to be enabled to getgood performance, otherwise StarPU will not know how to better groupcores.Two modes of execution exist to accomodate with existing usages.\subsection Fork-modeParallelTasks Fork-mode Parallel TasksIn the Fork mode, StarPU will call the codelet function on oneof the CPUs of the combined worker. The codelet function can usestarpu_combined_worker_get_size() to get the number of threads it isallowed to start to achieve the computation. The CPU binding mask for the wholeset of CPUs is already enforced, so that threads created by the function willinherit the mask, and thus execute where StarPU expected, the OS being in chargeof choosing how to schedule threads on the corresponding CPUs. The applicationcan also choose to bind threads by hand, using e.g. <c>sched_getaffinity</c> to knowthe CPU binding mask that StarPU chose.For instance, using OpenMP (full source is available in<c>examples/openmp/vector_scal.c</c>):\snippet forkmode.c To be included. You should update doxygen if you see this text.Other examples include for instance calling a BLAS parallel CPU implementation(see <c>examples/mult/xgemm.c</c>).\subsection SPMD-modeParallelTasks SPMD-mode Parallel TasksIn the SPMD mode, StarPU will call the codelet function oneach CPU of the combined worker. The codelet function can usestarpu_combined_worker_get_size() to get the total number of CPUsinvolved in the combined worker, and thus the number of calls that are made inparallel to the function, and starpu_combined_worker_get_rank() to getthe rank of the current CPU within the combined worker. For instance:\code{.c}static void func(void *buffers[], void *args){    unsigned i;    float *factor = _args;    struct starpu_vector_interface *vector = buffers[0];    unsigned n = STARPU_VECTOR_GET_NX(vector);    float *val = (float *)STARPU_VECTOR_GET_PTR(vector);    /* Compute slice to compute */    unsigned m = starpu_combined_worker_get_size();    unsigned j = starpu_combined_worker_get_rank();    unsigned slice = (n+m-1)/m;    for (i = j * slice; i < (j+1) * slice && i < n; i++)        val[i] *= *factor;}static struct starpu_codelet cl ={    .modes = { STARPU_RW },    .type = STARPU_SPMD,    .max_parallelism = INT_MAX,    .cpu_funcs = { func },    .cpu_funcs_name = { "func" },    .nbuffers = 1,}\endcodeOf course, this trivial example will not really benefit from parallel taskexecution, and was only meant to be simple to understand.  The benefit comeswhen the computation to be done is so that threads have to e.g. exchangeintermediate results, or write to the data in a complex but safe way in the samebuffer.\subsection ParallelTasksPerformance Parallel Tasks PerformanceTo benefit from parallel tasks, a parallel-task-aware StarPU scheduler has tobe used. When exposed to codelets with a flag ::STARPU_FORKJOIN or::STARPU_SPMD, the schedulers <c>pheft</c> (parallel-heft) and <c>peager</c>(parallel eager) will indeed also try to execute tasks withseveral CPUs. It will automatically try the various available combinedworker sizes (making several measurements for each worker size) andthus be able to avoid choosing a large combined worker if the codeletdoes not actually scale so much.This is however for now only proof of concept, and has not really been optimized yet.\subsection CombinedWorkers Combined WorkersBy default, StarPU creates combined workers according to the architecturestructure as detected by <c>hwloc</c>. It means that for each object of the <c>hwloc</c>topology (NUMA node, socket, cache, ...) a combined worker will be created. Ifsome nodes of the hierarchy have a big arity (e.g. many cores in a socketwithout a hierarchy of shared caches), StarPU will create combined workers ofintermediate sizes. The variable \ref STARPU_SYNTHESIZE_ARITY_COMBINED_WORKERpermits to tune the maximum arity between levels of combined workers.The combined workers actually produced can be seen in the output of thetool <c>starpu_machine_display</c> (the environment variable\ref STARPU_SCHED has to be set to a combined worker-aware scheduler suchas <c>pheft</c> or <c>peager</c>).\subsection ConcurrentParallelTasks Concurrent Parallel TasksUnfortunately, many environments and librairies do not support concurrentcalls.For instance, most OpenMP implementations (including the main ones) do notsupport concurrent <c>pragma omp parallel</c> statements without nesting them inanother <c>pragma omp parallel</c> statement, but StarPU does not yet supportcreating its CPU workers by using such pragma.Other parallel libraries are also not safe when being invoked concurrentlyfrom different threads, due to the use of global variables in their sequentialsections for instance.The solution is then to use only one combined worker at a time.  This can bedone by setting the field starpu_conf::single_combined_worker to <c>1</c>, orsetting the environment variable \ref STARPU_SINGLE_COMBINED_WORKERto <c>1</c>. StarPU will then run only one parallel task at a time (but otherCPU and GPU tasks are not affected and can be run concurrently). The paralleltask scheduler will however still try varying combined workersizes to look for the most efficient ones.\subsection SynchronizationTasks Synchronization TasksFor the application conveniency, it may be useful to define tasks which do notactually make any computation, but wear for instance dependencies between othertasks or tags, or to be submitted in callbacks, etc.The obvious way is of course to make kernel functions empty, but such task willthus have to wait for a worker to become ready, transfer data, etc.A much lighter way to define a synchronization task is to set its starpu_task::clfield to <c>NULL</c>. The task will thus be a mere synchronization point,without any data access or execution content: as soon as its dependencies becomeavailable, it will terminate, call the callbacks, and release dependencies.An intermediate solution is to define a codelet with itsstarpu_codelet::where field set to \ref STARPU_NOWHERE, for instance:\code{.c}struct starpu_codelet cl ={	.where = STARPU_NOWHERE,	.nbuffers = 1,	.modes = { STARPU_R },}task = starpu_task_create();task->cl = &cl;task->handles[0] = handle;starpu_task_submit(task);\endcodewill create a task which simply waits for the value of <c>handle</c> to beavailable for read. This task can then be depended on, etc.*/
 |