| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323 | /* StarPU --- Runtime system for heterogeneous multicore architectures. * * Copyright (C) 2009-2020  Université de Bordeaux, CNRS (LaBRI UMR 5800), Inria * * StarPU is free software; you can redistribute it and/or modify * it under the terms of the GNU Lesser General Public License as published by * the Free Software Foundation; either version 2.1 of the License, or (at * your option) any later version. * * StarPU is distributed in the hope that it will be useful, but * WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. * * See the GNU Lesser General Public License in COPYING.LGPL for more details. *//*! \mainpage Introduction\htmlonly<h1><a class="anchor" id="Foreword"></a>Foreword</h1>\endhtmlonly\htmlinclude version.html\htmlinclude foreword.html\section Motivation Motivation// This is a comment and it will be removed before the file is processed by doxygen// complex machines with heterogeneous cores/devicesThe use of specialized hardware such as accelerators or coprocessors offers aninteresting approach to overcome the physical limits encountered by processorarchitects. As a result, many machines are now equipped with one or severalaccelerators (e.g. a GPU), in addition to the usual processor(s). While a lot ofefforts have been devoted to offload computation onto such accelerators, verylittle attention as been paid to portability concerns on the one hand, and to thepossibility of having heterogeneous accelerators and processors to interact on the other hand.StarPU is a runtime system that offers support for heterogeneous multicorearchitectures, it not only offers a unified view of the computational resources(i.e. CPUs and accelerators at the same time), but it also takes care ofefficiently mapping and executing tasks onto an heterogeneous machine whiletransparently handling low-level issues such as data transfers in a portablefashion.// this leads to a complicated distributed memory design// which is not (easily) manageable by hand// added value/benefits of StarPU//    - portability//   - scheduling, perf. portability\section StarPUInANutshell StarPU in a NutshellStarPU is a software tool aiming to allow programmers to exploit thecomputing power of the available CPUs and GPUs, while relieving themfrom the need to specially adapt their programs to the target machineand processing units.At the core of StarPU is its runtime support library, which isresponsible for scheduling application-provided tasks on heterogeneousCPU/GPU machines.  In addition, StarPU comes with programming languagesupport, in the form of an OpenCL front-end (\ref SOCLOpenclExtensions).StarPU's runtime and programming language extensions support atask-based programming model. Applications submit computationaltasks, with CPU and/or GPU implementations, and StarPU schedules thesetasks and associated data transfers on available CPUs and GPUs.  Thedata that a task manipulates are automatically transferred amongaccelerators and the main memory, so that programmers are freed from thescheduling issues and technical details associated with these transfers.StarPU takes particular care of scheduling tasks efficiently, usingwell-known algorithms from the literature (\ref TaskSchedulingPolicy).In addition, it allows scheduling experts, such as compiler orcomputational library developers, to implement custom schedulingpolicies in a portable fashion (\ref HowToDefineANewSchedulingPolicy).The remainder of this section describes the main concepts used in StarPU.A video is available on the StarPU websitehttps://starpu.gitlabpages.inria.fr/ that presents these concepts in 26 minutes.Some tutorials are also available on https://starpu.gitlabpages.inria.fr/tutorials/// explain the notion of codelet and task (i.e. g(A, B)\subsection CodeletAndTasks Codelet and TasksOne of the StarPU primary data structures is the \b codelet. A codelet describes acomputational kernel that can possibly be implemented on multiple architecturessuch as a CPU, a CUDA device or an OpenCL device.// TODO insert illustration f: f_spu, f_cpu, ...Another important data structure is the \b task. Executing a StarPU taskconsists in applying a codelet on a data set, on one of the architectures onwhich the codelet is implemented. A task thus describes the codelet that ituses, but also which data are accessed, and how they areaccessed during the computation (read and/or write).StarPU tasks are asynchronous: submitting a task to StarPU is a non-blockingoperation. The task structure can also specify a \b callback function that iscalled once StarPU has properly executed the task. It also contains optionalfields that the application may use to give hints to the scheduler (such aspriority levels).By default, task dependencies are inferred from data dependency (sequentialcoherency) by StarPU. The application can however disable sequential coherencyfor some data, and dependencies can be specifically expressed.A task may be identified by a unique 64-bit number chosen by the applicationwhich we refer as a \b tag.Task dependencies can be enforced either by the means of callback functions, bysubmitting other tasks, or by expressing dependenciesbetween tags (which can thus correspond to tasks that have not yet been submitted).// TODO insert illustration f(Ar, Brw, Cr) + ..// DSM\subsection StarPUDataManagementLibrary StarPU Data Management LibraryBecause StarPU schedules tasks at runtime, data transfers have to bedone automatically and ``just-in-time'' between processing units,relieving application programmers from explicit data transfers.Moreover, to avoid unnecessary transfers, StarPU keeps datawhere it was last needed, even if was modified there, and itallows multiple copies of the same data to reside at the same time onseveral processing units as long as it is not modified.\section ApplicationTaskification Application TaskificationTODO// TODO: section describing what taskifying an application means: before// porting to StarPU, turn the program into:// "pure" functions, which only access data from their passed parameters// a main function which just calls these pure functions// and then it's trivial to use StarPU or any other kind of task-based library:// simply replace calling the function with submitting a task.\section Glossary GlossaryA \b codelet records pointers to various implementations of the sametheoretical function.A <b>memory node</b> can be either the main RAM, GPU-embedded memory or a disk memory.A \b bus is a link between memory nodes.A <b>data handle</b> keeps track of replicates of the same data (\b registered by theapplication) over various memory nodes. The data management library manages tokeep them coherent.The \b home memory node of a data handle is the memory node from which the datawas registered (usually the main memory node).A \b task represents a scheduled execution of a codelet on some data handles.A \b tag is a rendez-vous point. Tasks typically have their own tag, and candepend on other tags. The value is chosen by the application.A \b worker execute tasks. There is typically one per CPU computation core andone per accelerator (for which a whole CPU core is dedicated).A \b driver drives a given kind of workers. There are currently CPU, CUDA,and OpenCL drivers. They usually start several workers to actually drivethem.A <b>performance model</b> is a (dynamic or static) model of the performance of agiven codelet. Codelets can have execution time performance model as well asenergy consumption performance models.A data \b interface describes the layout of the data: for a vector, a pointerfor the start, the number of elements and the size of elements ; for a matrix, apointer for the start, the number of elements per row, the offset between rows,and the size of each element ; etc. To access their data, codelet functions aregiven interfaces for the local memory node replicates of the data handles of thescheduled task.\b Partitioning data means dividing the data of a given data handle (called\b father) into a series of \b children data handles which designate variousportions of the former.A \b filter is the function which computes children data handles from a fatherdata handle, and thus describes how the partitioning should be done (horizontal,vertical, etc.)\b Acquiring a data handle can be done from the main application, to safelyaccess the data of a data handle from its home node, without having tounregister it.\section ResearchPapers Research PapersResearch papers about StarPU can be found athttps://starpu.gitlabpages.inria.fr/publications/.A good overview is available in the research report athttp://hal.archives-ouvertes.fr/inria-00467677.\section StarPUApplications StarPU ApplicationsYou can first have a look at the chapters \ref BasicExamples and \ref AdvancedExamples.A tutorial is also installed in the directory <c>share/doc/starpu/tutorial/</c>.Many examples are also available in the StarPU sources in the directory<c>examples/</c>. Simple examples include:<dl><dt> <c>incrementer/</c> </dt><dd> Trivial incrementation test. </dd><dt> <c>basic_examples/</c> </dt><dd>        Simple documented Hello world and vector/scalar product (as        shown in \ref BasicExamples), matrix        product examples (as shown in \ref PerformanceModelExample), an example using the blocked matrix data        interface, an example using the variable data interface, and an example        using different formats on CPUs and GPUs.</dd><dt> <c>matvecmult/</c></dt><dd>    OpenCL example from NVidia, adapted to StarPU.</dd><dt> <c>axpy/</c></dt><dd>    AXPY CUBLAS operation adapted to StarPU.</dd><dt> <c>native_fortran/</c> </dt><dd>    Example of using StarPU's native Fortran support.</dd><dt> <c>fortran90/</c> </dt><dd>    Example of Fortran 90 bindings, using C marshalling wrappers.</dd><dt> <c>fortran/</c> </dt><dd>    Example of Fortran 77 bindings, using C marshalling wrappers.</dd></dl>More advanced examples include:<dl><dt><c>filters/</c></dt><dd>    Examples using filters, as shown in \ref PartitioningData.</dd><dt><c>lu/</c></dt><dd>    LU matrix factorization, see for instance <c>xlu_implicit.c</c></dd><dt><c>cholesky/</c></dt><dd>    Cholesky matrix factorization, see for instance <c>cholesky_implicit.c</c>.</dd></dl>\section FurtherReading Further ReadingThe documentation chapters include<ul><li> Part 1: StarPU Basics<ul><li> \ref BuildingAndInstallingStarPU<li> \ref BasicExamples</ul><li> Part 2: StarPU Quick Programming Guide<ul><li> \ref AdvancedExamples<li> \ref CheckListWhenPerformanceAreNotThere</ul><li> Part 3: StarPU Inside<ul><li> \ref TasksInStarPU<li> \ref DataManagement<li> \ref Scheduling<li> \ref SchedulingContexts<li> \ref SchedulingContextHypervisor<li> \ref HowToDefineANewSchedulingPolicy<li> \ref DebuggingTools<li> \ref OnlinePerformanceTools<li> \ref OfflinePerformanceTools<li> \ref FrequentlyAskedQuestions</ul><li> Part 4: StarPU Extensions<ul><li> \ref OutOfCore<li> \ref MPISupport<li> \ref FaultTolerance<li> \ref FFTSupport<li> \ref MICSupport<li> \ref NativeFortranSupport<li> \ref SOCLOpenclExtensions<li> \ref SimGridSupport<li> \ref OpenMPRuntimeSupport<li> \ref ClusteringAMachine</ul><li> Part 5: StarPU Reference API<ul><li> \ref ExecutionConfigurationThroughEnvironmentVariables<li> \ref CompilationConfiguration<li> \ref ModuleDocumentation<li> \ref FileDocumentation<li> \ref deprecated</ul><li> Part: Appendix<ul><li> \ref FullSourceCodeVectorScal<li> \ref GNUFreeDocumentationLicense</ul></ul>Make sure to have had a look at those too!*/
 |