123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475 |
- /* StarPU --- Runtime system for heterogeneous multicore architectures.
- *
- * Copyright (C) 2014-2021 Université de Bordeaux, CNRS (LaBRI UMR 5800), Inria
- *
- * StarPU is free software; you can redistribute it and/or modify
- * it under the terms of the GNU Lesser General Public License as published by
- * the Free Software Foundation; either version 2.1 of the License, or (at
- * your option) any later version.
- *
- * StarPU is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
- *
- * See the GNU Lesser General Public License in COPYING.LGPL for more details.
- */
- /*! \page OpenMPRuntimeSupport The StarPU OpenMP Runtime Support (SORS)
- StarPU provides the necessary routines and support to implement an OpenMP
- (http://www.openmp.org/) runtime compliant with the
- revision 3.1 of the language specification, and compliant with the
- task-related data dependency functionalities introduced in the revision
- 4.0 of the language. This StarPU OpenMP Runtime Support (SORS) has been
- designed to be targetted by OpenMP compilers such as the Klang-OMP
- compiler. Most supported OpenMP directives can both be implemented
- inline or as outlined functions.
- All functions are defined in \ref API_OpenMP_Runtime_Support.
- \section OMPImplementation Implementation Details and Specificities
- \subsection OMPMainThread Main Thread
- When using the SORS, the main thread gets involved in executing OpenMP tasks
- just like every other threads, in order to be compliant with the
- specification execution model. This contrasts with StarPU's usual
- execution model where the main thread submit tasks but does not take
- part in executing them.
- \subsection OMPTaskSemantics Extended Task Semantics
- The semantics of tasks generated by the SORS are extended with respect
- to regular StarPU tasks in that SORS' tasks may block and be preempted
- by SORS call, whereas regular StarPU tasks cannot. SORS tasks may
- coexist with regular StarPU tasks. However, only the tasks created using
- SORS API functions inherit from extended semantics.
- \section OMPConfiguration Configuration
- The SORS can be compiled into <c>libstarpu</c> through
- the \c configure option \ref enable-openmp "--enable-openmp".
- Conditional compiled source codes may check for the
- availability of the OpenMP Runtime Support by testing whether the C
- preprocessor macro <c>STARPU_OPENMP</c> is defined or not.
- \section OMPInitExit Initialization and Shutdown
- The SORS needs to be executed/terminated by the
- starpu_omp_init() / starpu_omp_shutdown() instead of
- starpu_init() / starpu_shutdown(). This requirement is necessary to make
- sure that the main thread gets the proper execution environment to run
- OpenMP tasks. These calls will usually be performed by a compiler
- runtime. Thus, they can be executed from a constructor/destructor such
- as this:
- \code{.c}
- __attribute__((constructor))
- static void omp_constructor(void)
- {
- int ret = starpu_omp_init();
- STARPU_CHECK_RETURN_VALUE(ret, "starpu_omp_init");
- }
- __attribute__((destructor))
- static void omp_destructor(void)
- {
- starpu_omp_shutdown();
- }
- \endcode
- \sa starpu_omp_init()
- \sa starpu_omp_shutdown()
- \section OMPSharing Parallel Regions and Worksharing
- The SORS provides functions to create OpenMP parallel regions as well as
- mapping work on participating workers. The current implementation does
- not provide nested active parallel regions: Parallel regions may be
- created recursively, however only the first level parallel region may
- have more than one worker. From an internal point-of-view, the SORS'
- parallel regions are implemented as a set of implicit, extended semantics
- StarPU tasks, following the execution model of the OpenMP specification.
- Thus the SORS' parallel region tasks may block and be preempted, by
- SORS calls, enabling constructs such as barriers.
- \subsection OMPParallel Parallel Regions
- Parallel regions can be created with the function
- starpu_omp_parallel_region() which accepts a set of attributes as
- parameter. The execution of the calling task is suspended until the
- parallel region completes. The field starpu_omp_parallel_region_attr::cl
- is a regular StarPU codelet. However only CPU codelets are
- supported for parallel regions.
- Here is an example of use:
- \code{.c}
- void parallel_region_f(void *buffers[], void *args)
- {
- (void) buffers;
- (void) args;
- pthread_t tid = pthread_self();
- int worker_id = starpu_worker_get_id();
- printf("[tid %p] task thread = %d\n", (void *)tid, worker_id);
- }
- void f(void)
- {
- struct starpu_omp_parallel_region_attr attr;
- memset(&attr, 0, sizeof(attr));
- attr.cl.cpu_funcs[0] = parallel_region_f;
- attr.cl.where = STARPU_CPU;
- attr.if_clause = 1;
- starpu_omp_parallel_region(&attr);
- return 0;
- }
- \endcode
- \sa struct starpu_omp_parallel_region_attr
- \sa starpu_omp_parallel_region()
- \subsection OMPFor Parallel For
- OpenMP <c>for</c> loops are provided by the starpu_omp_for() group of
- functions. Variants are available for inline or outlined
- implementations. The SORS supports <c>static</c>, <c>dynamic</c>, and
- <c>guided</c> loop scheduling clauses. The <c>auto</c> scheduling clause
- is implemented as <c>static</c>. The <c>runtime</c> scheduling clause
- honors the scheduling mode selected through the environment variable
- \c OMP_SCHEDULE or the starpu_omp_set_schedule() function. For loops with
- the <c>ordered</c> clause are also supported. An implicit barrier can be
- enforced or skipped at the end of the worksharing construct, according
- to the value of the <c>nowait</c> parameter.
- The canonical family of starpu_omp_for() functions provide each instance
- with the first iteration number and the number of iterations (possibly
- zero) to perform. The alternate family of starpu_omp_for_alt() functions
- provide each instance with the (possibly empty) range of iterations to
- perform, including the first and excluding the last.
- The family of starpu_omp_ordered() functions enable to implement
- OpenMP's ordered construct, a region with a parallel for loop that is
- guaranteed to be executed in the sequential order of the loop
- iterations.
- \code{.c}
- void for_g(unsigned long long i, unsigned long long nb_i, void *arg)
- {
- (void) arg;
- for (; nb_i > 0; i++, nb_i--)
- {
- array[i] = 1;
- }
- }
- void parallel_region_f(void *buffers[], void *args)
- {
- (void) buffers;
- (void) args;
- starpu_omp_for(for_g, NULL, NB_ITERS, CHUNK, starpu_omp_sched_static, 0, 0);
- }
- \endcode
- \sa starpu_omp_for()
- \sa starpu_omp_for_inline_first()
- \sa starpu_omp_for_inline_next()
- \sa starpu_omp_for_alt()
- \sa starpu_omp_for_inline_first_alt()
- \sa starpu_omp_for_inline_next_alt()
- \sa starpu_omp_ordered()
- \sa starpu_omp_ordered_inline_begin()
- \sa starpu_omp_ordered_inline_end()
- \subsection OMPSections Sections
- OpenMP <c>sections</c> worksharing constructs are supported using the
- set of starpu_omp_sections() variants. The general principle is either
- to provide an array of per-section functions or a single function that
- will redirect to execution to the suitable per-section functions. An
- implicit barrier can be enforced or skipped at the end of the
- worksharing construct, according to the value of the <c>nowait</c>
- parameter.
- \code{.c}
- void parallel_region_f(void *buffers[], void *args)
- {
- (void) buffers;
- (void) args;
- section_funcs[0] = f;
- section_funcs[1] = g;
- section_funcs[2] = h;
- section_funcs[3] = i;
- section_args[0] = arg_f;
- section_args[1] = arg_g;
- section_args[2] = arg_h;
- section_args[3] = arg_i;
- starpu_omp_sections(4, section_f, section_args, 0);
- }
- \endcode
- \sa starpu_omp_sections()
- \sa starpu_omp_sections_combined()
- \subsection OMPSingle Single
- OpenMP <c>single</c> workharing constructs are supported using the set
- of starpu_omp_single() variants. An
- implicit barrier can be enforced or skipped at the end of the
- worksharing construct, according to the value of the <c>nowait</c>
- parameter.
- \code{.c}
- void single_f(void *arg)
- {
- (void) arg;
- pthread_t tid = pthread_self();
- int worker_id = starpu_worker_get_id();
- printf("[tid %p] task thread = %d -- single\n", (void *)tid, worker_id);
- }
- void parallel_region_f(void *buffers[], void *args)
- {
- (void) buffers;
- (void) args;
- starpu_omp_single(single_f, NULL, 0);
- }
- \endcode
- The SORS also provides dedicated support for <c>single</c> sections
- with <c>copyprivate</c> clauses through the
- starpu_omp_single_copyprivate() function variants. The OpenMP
- <c>master</c> directive is supported as well using the
- starpu_omp_master() function variants.
- \sa starpu_omp_master()
- \sa starpu_omp_master_inline()
- \sa starpu_omp_single()
- \sa starpu_omp_single_inline()
- \sa starpu_omp_single_copyprivate()
- \sa starpu_omp_single_copyprivate_inline_begin()
- \sa starpu_omp_single_copyprivate_inline_end()
- \section OMPTask Tasks
- The SORS implements the necessary support of OpenMP 3.1 and OpenMP 4.0's
- so-called explicit tasks, together with OpenMP 4.0's data dependency
- management.
- \subsection OMPTaskExplicit Explicit Tasks
- Explicit OpenMP tasks are created with the SORS using the
- starpu_omp_task_region() function. The implementation supports
- <c>if</c>, <c>final</c>, <c>untied</c> and <c>mergeable</c> clauses
- as defined in the OpenMP specification. Unless specified otherwise by
- the appropriate clause(s), the created task may be executed by any
- participating worker of the current parallel region.
- The current SORS implementation requires explicit tasks to be created
- within the context of an active parallel region. In particular, an
- explicit task cannot be created by the main thread outside of a parallel
- region. Explicit OpenMP tasks created using starpu_omp_task_region() are
- implemented as StarPU tasks with extended semantics, and may as such be
- blocked and preempted by SORS routines.
- The current SORS implementation supports recursive explicit tasks
- creation, to ensure compliance with the OpenMP specification. However,
- it should be noted that StarPU is not designed nor optimized for
- efficiently scheduling of recursive task applications.
- The code below shows how to create 4 explicit tasks within a parallel
- region.
- \code{.c}
- void task_region_g(void *buffers[], void *args)
- {
- (void) buffers;
- (void) args;
- pthread tid = pthread_self();
- int worker_id = starpu_worker_get_id();
- printf("[tid %p] task thread = %d: explicit task \"g\"\n", (void *)tid, worker_id);
- }
- void parallel_region_f(void *buffers[], void *args)
- {
- (void) buffers;
- (void) args;
- struct starpu_omp_task_region_attr attr;
- memset(&attr, 0, sizeof(attr));
- attr.cl.cpu_funcs[0] = task_region_g;
- attr.cl.where = STARPU_CPU;
- attr.if_clause = 1;
- attr.final_clause = 0;
- attr.untied_clause = 1;
- attr.mergeable_clause = 0;
- starpu_omp_task_region(&attr);
- starpu_omp_task_region(&attr);
- starpu_omp_task_region(&attr);
- starpu_omp_task_region(&attr);
- }
- \endcode
- \sa struct starpu_omp_task_region_attr
- \sa starpu_omp_task_region()
- \subsection OMPDataDependencies Data Dependencies
- The SORS implements inter-tasks data dependencies as specified in OpenMP
- 4.0. Data dependencies are expressed using regular StarPU data handles
- (\ref starpu_data_handle_t) plugged into the task's <c>attr.cl</c>
- codelet. The family of starpu_vector_data_register() -like functions and the
- starpu_data_lookup() function may be used to register a memory area and
- to retrieve the current data handle associated with a pointer
- respectively. The testcase <c>./tests/openmp/task_02.c</c> gives a
- detailed example of using OpenMP 4.0 tasks dependencies with the SORS
- implementation.
- Note: the OpenMP 4.0 specification only supports data dependencies
- between sibling tasks, that is tasks created by the same implicit or
- explicit parent task. The current SORS implementation also only supports data
- dependencies between sibling tasks. Consequently the behaviour is
- unspecified if dependencies are expressed beween tasks that have not
- been created by the same parent task.
- \subsection OMPTaskSyncs TaskWait and TaskGroup
- The SORS implements both the <c>taskwait</c> and <c>taskgroup</c> OpenMP
- task synchronization constructs specified in OpenMP 4.0, with the
- starpu_omp_taskwait() and starpu_omp_taskgroup() functions respectively.
- An example of starpu_omp_taskwait() use, creating two explicit tasks and
- waiting for their completion:
- \code{.c}
- void task_region_g(void *buffers[], void *args)
- {
- (void) buffers;
- (void) args;
- printf("Hello, World!\n");
- }
- void parallel_region_f(void *buffers[], void *args)
- {
- (void) buffers;
- (void) args;
- struct starpu_omp_task_region_attr attr;
- memset(&attr, 0, sizeof(attr));
- attr.cl.cpu_funcs[0] = task_region_g;
- attr.cl.where = STARPU_CPU;
- attr.if_clause = 1;
- attr.final_clause = 0;
- attr.untied_clause = 1;
- attr.mergeable_clause = 0;
- starpu_omp_task_region(&attr);
- starpu_omp_task_region(&attr);
- starpu_omp_taskwait();
- \endcode
- An example of starpu_omp_taskgroup() use, creating a task group of two explicit tasks:
- \code{.c}
- void task_region_g(void *buffers[], void *args)
- {
- (void) buffers;
- (void) args;
- printf("Hello, World!\n");
- }
- void taskgroup_f(void *arg)
- {
- (void)arg;
- struct starpu_omp_task_region_attr attr;
- memset(&attr, 0, sizeof(attr));
- attr.cl.cpu_funcs[0] = task_region_g;
- attr.cl.where = STARPU_CPU;
- attr.if_clause = 1;
- attr.final_clause = 0;
- attr.untied_clause = 1;
- attr.mergeable_clause = 0;
- starpu_omp_task_region(&attr);
- starpu_omp_task_region(&attr);
- }
- void parallel_region_f(void *buffers[], void *args)
- {
- (void) buffers;
- (void) args;
- starpu_omp_taskgroup(taskgroup_f, (void *)NULL);
- }
- \endcode
- \sa starpu_omp_task_region()
- \sa starpu_omp_taskwait()
- \sa starpu_omp_taskgroup()
- \sa starpu_omp_taskgroup_inline_begin()
- \sa starpu_omp_taskgroup_inline_end()
- \section OMPSynchronization Synchronization Support
- The SORS implements objects and method to build common OpenMP
- synchronization constructs.
- \subsection OMPSimpleLock Simple Locks
- The SORS Simple Locks are opaque starpu_omp_lock_t objects enabling multiple
- tasks to synchronize with each others, following the Simple Lock
- constructs defined by the OpenMP specification. In accordance with such
- specification, simple locks may not by acquired multiple times by the
- same task, without being released in-between; otherwise, deadlocks may
- result. Codes requiring the possibility to lock multiple times
- recursively should use Nestable Locks (\ref NestableLock). Codes NOT
- requiring the possibility to lock multiple times recursively should use
- Simple Locks as they incur less processing overhead than Nestable Locks.
- \sa starpu_omp_lock_t
- \sa starpu_omp_init_lock()
- \sa starpu_omp_destroy_lock()
- \sa starpu_omp_set_lock()
- \sa starpu_omp_unset_lock()
- \sa starpu_omp_test_lock()
- \subsection OMPNestableLock Nestable Locks
- The SORS Nestable Locks are opaque starpu_omp_nest_lock_t objects enabling
- multiple tasks to synchronize with each others, following the Nestable
- Lock constructs defined by the OpenMP specification. In accordance with
- such specification, nestable locks may by acquired multiple times
- recursively by the same task without deadlocking. Nested locking and
- unlocking operations must be well parenthesized at any time, otherwise
- deadlock and/or undefined behaviour may occur. Codes requiring the
- possibility to lock multiple times recursively should use Nestable
- Locks. Codes NOT requiring the possibility to lock multiple times
- recursively should use Simple Locks (\ref SimpleLock) instead, as they
- incur less processing overhead than Nestable Locks.
- \sa starpu_omp_nest_lock_t
- \sa starpu_omp_init_nest_lock()
- \sa starpu_omp_destroy_nest_lock()
- \sa starpu_omp_set_nest_lock()
- \sa starpu_omp_unset_nest_lock()
- \sa starpu_omp_test_nest_lock()
- \subsection OMPCritical Critical Sections
- The SORS implements support for OpenMP critical sections through the
- family of \ref starpu_omp_critical functions. Critical sections may optionally
- be named. There is a single, common anonymous critical section. Mutual
- exclusion only occur within the scope of single critical section, either
- a named one or the anonymous one.
- \sa starpu_omp_critical()
- \sa starpu_omp_critical_inline_begin()
- \sa starpu_omp_critical_inline_end()
- \subsection OMPBarrier Barriers
- The SORS provides the starpu_omp_barrier() function to implement
- barriers over parallel region teams. In accordance with the OpenMP
- specification, the starpu_omp_barrier() function waits for every
- implicit task of the parallel region to reach the barrier and every
- explicit task launched by the parallel region to complete, before
- returning.
- \sa starpu_omp_barrier()
- */
|