123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810 |
- /*
- * This file is part of the StarPU Handbook.
- * Copyright (C) 2009--2011 Universit@'e de Bordeaux 1
- * Copyright (C) 2010, 2011, 2012, 2013 Centre National de la Recherche Scientifique
- * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
- * See the file version.doxy for copying conditions.
- */
- /*! \page basicExamples Basic Examples
- \section Hello_World_using_the_C_Extension Hello World using the C Extension
- This section shows how to implement a simple program that submits a task
- to StarPU using the StarPU C extension (\ref cExtensions). The complete example, and additional examples,
- is available in the <c>gcc-plugin/examples</c> directory of the StarPU
- distribution. A similar example showing how to directly use the StarPU's API is shown
- in \ref Hello_World_using_StarPU_API.
- GCC from version 4.5 permit to use the StarPU GCC plug-in (\ref cExtensions). This makes writing a task both simpler and less error-prone.
- In a nutshell, all it takes is to declare a task, declare and define its
- implementations (for CPU, OpenCL, and/or CUDA), and invoke the task like
- a regular C function. The example below defines <c>my_task</c> which
- has a single implementation for CPU:
- \include hello_pragma.c
- The code can then be compiled and linked with GCC and the <c>-fplugin</c> flag:
- \verbatim
- $ gcc `pkg-config starpu-1.1 --cflags` hello-starpu.c \
- -fplugin=`pkg-config starpu-1.1 --variable=gccplugin` \
- `pkg-config starpu-1.1 --libs`
- \endverbatim
- The code can also be compiled without the StarPU C extension and will
- behave as a normal sequential code.
- \verbatim
- $ gcc hello-starpu.c
- hello-starpu.c:33:1: warning: ‘task’ attribute directive ignored [-Wattributes]
- $ ./a.out
- Hello, world! With x = 42
- \endverbatim
- As can be seen above, the C extensions allows programmers to
- use StarPU tasks by essentially annotating ``regular'' C code.
- \section Hello_World_using_StarPU_API Hello World using StarPU's API
- This section shows how to achieve the same result as in the previous
- section using StarPU's standard C API.
- \subsection Required_Headers Required Headers
- The starpu.h header should be included in any code using StarPU.
- \code{.c}
- #include <starpu.h>
- \endcode
- \subsection Defining_a_Codelet Defining a Codelet
- \code{.c}
- struct params
- {
- int i;
- float f;
- };
- void cpu_func(void *buffers[], void *cl_arg)
- {
- struct params *params = cl_arg;
- printf("Hello world (params = {%i, %f} )\n", params->i, params->f);
- }
- struct starpu_codelet cl =
- {
- .where = STARPU_CPU,
- .cpu_funcs = { cpu_func, NULL },
- .nbuffers = 0
- };
- \endcode
- A codelet is a structure that represents a computational kernel. Such a codelet
- may contain an implementation of the same kernel on different architectures
- (e.g. CUDA, x86, ...). For compatibility, make sure that the whole
- structure is properly initialized to zero, either by using the
- function starpu_codelet_init (@pxref{starpu_codelet_init}), or by letting the
- compiler implicitly do it as examplified above.
- The <c>nbuffers</c> field specifies the number of data buffers that are
- manipulated by the codelet: here the codelet does not access or modify any data
- that is controlled by our data management library. Note that the argument
- passed to the codelet (the <c>cl_arg</c> field of the <c>starpu_task</c>
- structure) does not count as a buffer since it is not managed by our data
- management library, but just contain trivial parameters.
- \internal
- TODO need a crossref to the proper description of "where" see bla for more ...
- \endinternal
- We create a codelet which may only be executed on the CPUs. The <c>where</c>
- field is a bitmask that defines where the codelet may be executed. Here, the
- <c>STARPU_CPU</c> value means that only CPUs can execute this codelet
- (@pxref{Codelets and Tasks} for more details on this field). Note that
- the <c>where</c> field is optional, when unset its value is
- automatically set based on the availability of the different
- <c>XXX_funcs</c> fields.
- When a CPU core executes a codelet, it calls the <c>cpu_func</c> function,
- which \em must have the following prototype:
- \code{.c}
- void (*cpu_func)(void *buffers[], void *cl_arg);
- \endcode
- In this example, we can ignore the first argument of this function which gives a
- description of the input and output buffers (e.g. the size and the location of
- the matrices) since there is none.
- The second argument is a pointer to a buffer passed as an
- argument to the codelet by the means of the <c>cl_arg</c> field of the
- <c>starpu_task</c> structure.
- \internal
- TODO rewrite so that it is a little clearer ?
- \endinternal
- Be aware that this may be a pointer to a
- \em copy of the actual buffer, and not the pointer given by the programmer:
- if the codelet modifies this buffer, there is no guarantee that the initial
- buffer will be modified as well: this for instance implies that the buffer
- cannot be used as a synchronization medium. If synchronization is needed, data
- has to be registered to StarPU, see \ref Vector_Scaling_Using_StarPU_API.
- \subsection Submitting_a_Task Submitting a Task
- \code{.c}
- void callback_func(void *callback_arg)
- {
- printf("Callback function (arg %x)\n", callback_arg);
- }
- int main(int argc, char **argv)
- {
- /* initialize StarPU */
- starpu_init(NULL);
- struct starpu_task *task = starpu_task_create();
- task->cl = &cl; /* Pointer to the codelet defined above */
- struct params params = { 1, 2.0f };
- task->cl_arg = ¶ms;
- task->cl_arg_size = sizeof(params);
- task->callback_func = callback_func;
- task->callback_arg = 0x42;
- /* starpu_task_submit will be a blocking call */
- task->synchronous = 1;
- /* submit the task to StarPU */
- starpu_task_submit(task);
- /* terminate StarPU */
- starpu_shutdown();
- return 0;
- }
- \endcode
- Before submitting any tasks to StarPU, starpu_init() must be called. The
- <c>NULL</c> argument specifies that we use default configuration. Tasks cannot
- be submitted after the termination of StarPU by a call to
- starpu_shutdown().
- In the example above, a task structure is allocated by a call to
- starpu_task_create(). This function only allocates and fills the
- corresponding structure with the default settings (@pxref{Codelets and
- Tasks, starpu_task_create}), but it does not submit the task to StarPU.
- \internal
- not really clear ;)
- \endinternal
- The <c>cl</c> field is a pointer to the codelet which the task will
- execute: in other words, the codelet structure describes which computational
- kernel should be offloaded on the different architectures, and the task
- structure is a wrapper containing a codelet and the piece of data on which the
- codelet should operate.
- The optional <c>cl_arg</c> field is a pointer to a buffer (of size
- <c>cl_arg_size</c>) with some parameters for the kernel
- described by the codelet. For instance, if a codelet implements a computational
- kernel that multiplies its input vector by a constant, the constant could be
- specified by the means of this buffer, instead of registering it as a StarPU
- data. It must however be noted that StarPU avoids making copy whenever possible
- and rather passes the pointer as such, so the buffer which is pointed at must
- kept allocated until the task terminates, and if several tasks are submitted
- with various parameters, each of them must be given a pointer to their own
- buffer.
- Once a task has been executed, an optional callback function is be called.
- While the computational kernel could be offloaded on various architectures, the
- callback function is always executed on a CPU. The <c>callback_arg</c>
- pointer is passed as an argument of the callback. The prototype of a callback
- function must be:
- \code{.c}
- void (*callback_function)(void *);
- \endcode
- If the <c>synchronous</c> field is non-zero, task submission will be
- synchronous: the starpu_task_submit() function will not return until the
- task was executed. Note that the starpu_shutdown() function does not
- guarantee that asynchronous tasks have been executed before it returns,
- starpu_task_wait_for_all() can be used to that effect, or data can be
- unregistered (starpu_data_unregister()), which will
- implicitly wait for all the tasks scheduled to work on it, unless explicitly
- disabled thanks to starpu_data_set_default_sequential_consistency_flag() or
- starpu_data_set_sequential_consistency_flag().
- \subsection Execution_of_Hello_World Execution of Hello World
- \verbatim
- $ make hello_world
- cc $(pkg-config --cflags starpu-1.1) $(pkg-config --libs starpu-1.1) hello_world.c -o hello_world
- $ ./hello_world
- Hello world (params = {1, 2.000000} )
- Callback function (arg 42)
- \endverbatim
- \section Vector_Scaling_Using_the_C_Extension Vector Scaling Using the C Extension
- The previous example has shown how to submit tasks. In this section,
- we show how StarPU tasks can manipulate data.
- We will first show how to use the C language extensions provided by
- the GCC plug-in (\ref cExtensions). The complete example, and
- additional examples, is available in the <c>gcc-plugin/examples</c>
- directory of the StarPU distribution. These extensions map directly
- to StarPU's main concepts: tasks, task implementations for CPU,
- OpenCL, or CUDA, and registered data buffers. The standard C version
- that uses StarPU's standard C programming interface is given in the
- next section (\ref Vector_Scaling_Using_StarPU_API).
- First of all, the vector-scaling task and its simple CPU implementation
- has to be defined:
- \code{.c}
- /* Declare the `vector_scal' task. */
- static void vector_scal (unsigned size, float vector[size],
- float factor)
- __attribute__ ((task));
- /* Define the standard CPU implementation. */
- static void
- vector_scal (unsigned size, float vector[size], float factor)
- {
- unsigned i;
- for (i = 0; i < size; i++)
- vector[i] *= factor;
- }
- \endcode
- Next, the body of the program, which uses the task defined above, can be
- implemented:
- \include hello_pragma2.c
- The <c>main</c> function above does several things:
- <ul>
- <li>
- It initializes StarPU.
- </li>
- <li>
- It allocates <c>vector</c> in the heap; it will automatically be freed
- when its scope is left. Alternatively, good old <c>malloc</c> and
- <c>free</c> could have been used, but they are more error-prone and
- require more typing.
- </li>
- <li>
- It registers the memory pointed to by <c>vector</c>. Eventually,
- when OpenCL or CUDA task implementations are added, this will allow
- StarPU to transfer that memory region between GPUs and the main memory.
- Removing this <c>pragma</c> is an error.
- </li>
- <li>
- It invokes the <c>vector_scal</c> task. The invocation looks the same
- as a standard C function call. However, it is an asynchronous
- invocation, meaning that the actual call is performed in parallel with
- the caller's continuation.
- </li>
- <li>
- It waits for the termination of the <c>vector_scal</c>
- asynchronous call.
- </li>
- <li>
- Finally, StarPU is shut down.
- </li>
- </ul>
- The program can be compiled and linked with GCC and the <c>-fplugin</c>
- flag:
- \verbatim
- $ gcc `pkg-config starpu-1.1 --cflags` vector_scal.c \
- -fplugin=`pkg-config starpu-1.1 --variable=gccplugin` \
- `pkg-config starpu-1.1 --libs`
- \endverbatim
- And voilà!
- \subsection Adding_an_OpenCL_Task_Implementation Adding an OpenCL Task Implementation
- Now, this is all fine and great, but you certainly want to take
- advantage of these newfangled GPUs that your lab just bought, don't you?
- So, let's add an OpenCL implementation of the <c>vector_scal</c> task.
- We assume that the OpenCL kernel is available in a file,
- <c>vector_scal_opencl_kernel.cl</c>, not shown here. The OpenCL task
- implementation is similar to that used with the standard C API
- (\ref Definition_of_the_OpenCL_Kernel). It is declared and defined
- in our C file like this:
- \code{.c}
- /* The OpenCL programs, loaded from 'main' (see below). */
- static struct starpu_opencl_program cl_programs;
- static void vector_scal_opencl (unsigned size, float vector[size],
- float factor)
- __attribute__ ((task_implementation ("opencl", vector_scal)));
- static void
- vector_scal_opencl (unsigned size, float vector[size], float factor)
- {
- int id, devid, err;
- cl_kernel kernel;
- cl_command_queue queue;
- cl_event event;
- /* VECTOR is GPU memory pointer, not a main memory pointer. */
- cl_mem val = (cl_mem) vector;
- id = starpu_worker_get_id ();
- devid = starpu_worker_get_devid (id);
- /* Prepare to invoke the kernel. In the future, this will be largely automated. */
- err = starpu_opencl_load_kernel (&kernel, &queue, &cl_programs,
- "vector_mult_opencl", devid);
- if (err != CL_SUCCESS)
- STARPU_OPENCL_REPORT_ERROR (err);
- err = clSetKernelArg (kernel, 0, sizeof (size), &size);
- err |= clSetKernelArg (kernel, 1, sizeof (val), &val);
- err |= clSetKernelArg (kernel, 2, sizeof (factor), &factor);
- if (err)
- STARPU_OPENCL_REPORT_ERROR (err);
- size_t global = 1, local = 1;
- err = clEnqueueNDRangeKernel (queue, kernel, 1, NULL, &global,
- &local, 0, NULL, &event);
- if (err != CL_SUCCESS)
- STARPU_OPENCL_REPORT_ERROR (err);
- clFinish (queue);
- starpu_opencl_collect_stats (event);
- clReleaseEvent (event);
- /* Done with KERNEL. */
- starpu_opencl_release_kernel (kernel);
- }
- \endcode
- The OpenCL kernel itself must be loaded from <c>main</c>, sometime after
- the <c>initialize</c> pragma:
- \code{.c}
- starpu_opencl_load_opencl_from_file ("vector_scal_opencl_kernel.cl",
- &cl_programs, "");
- \endcode
- And that's it. The <c>vector_scal</c> task now has an additional
- implementation, for OpenCL, which StarPU's scheduler may choose to use
- at run-time. Unfortunately, the <c>vector_scal_opencl</c> above still
- has to go through the common OpenCL boilerplate; in the future,
- additional extensions will automate most of it.
- \subsection Adding_a_CUDA_Task_Implementation Adding a CUDA Task Implementation
- Adding a CUDA implementation of the task is very similar, except that
- the implementation itself is typically written in CUDA, and compiled
- with <c>nvcc</c>. Thus, the C file only needs to contain an external
- declaration for the task implementation:
- \code{.c}
- extern void vector_scal_cuda (unsigned size, float vector[size],
- float factor)
- __attribute__ ((task_implementation ("cuda", vector_scal)));
- \endcode
- The actual implementation of the CUDA task goes into a separate
- compilation unit, in a <c>.cu</c> file. It is very close to the
- implementation when using StarPU's standard C API (\ref Definition_of_the_CUDA_Kernel).
- \code{.c}
- /* CUDA implementation of the `vector_scal' task, to be compiled with `nvcc'. */
- #include <starpu.h>
- #include <stdlib.h>
- static __global__ void
- vector_mult_cuda (unsigned n, float *val, float factor)
- {
- unsigned i = blockIdx.x * blockDim.x + threadIdx.x;
- if (i < n)
- val[i] *= factor;
- }
- /* Definition of the task implementation declared in the C file. */
- extern "C" void
- vector_scal_cuda (size_t size, float vector[], float factor)
- {
- unsigned threads_per_block = 64;
- unsigned nblocks = (size + threads_per_block - 1) / threads_per_block;
- vector_mult_cuda <<< nblocks, threads_per_block, 0,
- starpu_cuda_get_local_stream () >>> (size, vector, factor);
- cudaStreamSynchronize (starpu_cuda_get_local_stream ());
- }
- \endcode
- The complete source code, in the <c>gcc-plugin/examples/vector_scal</c>
- directory of the StarPU distribution, also shows how an SSE-specialized
- CPU task implementation can be added.
- For more details on the C extensions provided by StarPU's GCC plug-in,
- \ref cExtensions.
- \section Vector_Scaling_Using_StarPU_API Vector Scaling Using StarPU's API
- This section shows how to achieve the same result as explained in the
- previous section using StarPU's standard C API.
- The full source code for
- this example is given in @ref{Full source code for the 'Scaling a
- Vector' example}.
- \subsection Source_Code_of_Vector_Scaling Source Code of Vector Scaling
- Programmers can describe the data layout of their application so that StarPU is
- responsible for enforcing data coherency and availability across the machine.
- Instead of handling complex (and non-portable) mechanisms to perform data
- movements, programmers only declare which piece of data is accessed and/or
- modified by a task, and StarPU makes sure that when a computational kernel
- starts somewhere (e.g. on a GPU), its data are available locally.
- Before submitting those tasks, the programmer first needs to declare the
- different pieces of data to StarPU using the <c>starpu_*_data_register</c>
- functions. To ease the development of applications for StarPU, it is possible
- to describe multiple types of data layout. A type of data layout is called an
- <b>interface</b>. There are different predefined interfaces available in StarPU:
- here we will consider the <b>vector interface</b>.
- The following lines show how to declare an array of <c>NX</c> elements of type
- <c>float</c> using the vector interface:
- \code{.c}
- float vector[NX];
- starpu_data_handle_t vector_handle;
- starpu_vector_data_register(&vector_handle, 0, (uintptr_t)vector, NX,
- sizeof(vector[0]));
- \endcode
- The first argument, called the <b>data handle</b>, is an opaque pointer which
- designates the array in StarPU. This is also the structure which is used to
- describe which data is used by a task. The second argument is the node number
- where the data originally resides. Here it is 0 since the <c>vector array</c> is in
- the main memory. Then comes the pointer <c>vector</c> where the data can be found in main memory,
- the number of elements in the vector and the size of each element.
- The following shows how to construct a StarPU task that will manipulate the
- vector and a constant factor.
- \code{.c}
- float factor = 3.14;
- struct starpu_task *task = starpu_task_create();
- task->cl = &cl; /* Pointer to the codelet defined below */
- task->handles[0] = vector_handle; /* First parameter of the codelet */
- task->cl_arg = &factor;
- task->cl_arg_size = sizeof(factor);
- task->synchronous = 1;
- starpu_task_submit(task);
- \endcode
- Since the factor is a mere constant float value parameter,
- it does not need a preliminary registration, and
- can just be passed through the <c>cl_arg</c> pointer like in the previous
- example. The vector parameter is described by its handle.
- There are two fields in each element of the <c>buffers</c> array.
- <c>handle</c> is the handle of the data, and <c>mode</c> specifies how the
- kernel will access the data (<c>STARPU_R</c> for read-only, <c>STARPU_W</c> for
- write-only and <c>STARPU_RW</c> for read and write access).
- The definition of the codelet can be written as follows:
- \code{.c}
- void scal_cpu_func(void *buffers[], void *cl_arg)
- {
- unsigned i;
- float *factor = cl_arg;
- /* length of the vector */
- unsigned n = STARPU_VECTOR_GET_NX(buffers[0]);
- /* CPU copy of the vector pointer */
- float *val = (float *)STARPU_VECTOR_GET_PTR(buffers[0]);
- for (i = 0; i < n; i++)
- val[i] *= *factor;
- }
- struct starpu_codelet cl =
- {
- .cpu_funcs = { scal_cpu_func, NULL },
- .nbuffers = 1,
- .modes = { STARPU_RW }
- };
- \endcode
- The first argument is an array that gives
- a description of all the buffers passed in the <c>task->handles</c> array. The
- size of this array is given by the <c>nbuffers</c> field of the codelet
- structure. For the sake of genericity, this array contains pointers to the
- different interfaces describing each buffer. In the case of the <b>vector
- interface</b>, the location of the vector (resp. its length) is accessible in the
- \<c>ptr<c> (resp. <c>nx</c>) of this array. Since the vector is accessed in a
- read-write fashion, any modification will automatically affect future accesses
- to this vector made by other tasks.
- The second argument of the <c>scal_cpu_func</c> function contains a pointer to the
- parameters of the codelet (given in <c>task->cl_arg</c>), so that we read the
- constant factor from this pointer.
- \subsection Execution_of_Vector_Scaling Execution of Vector Scaling
- \verbatim
- $ make vector_scal
- cc $(pkg-config --cflags starpu-1.1) $(pkg-config --libs starpu-1.1) vector_scal.c -o vector_scal
- $ ./vector_scal
- 0.000000 3.000000 6.000000 9.000000 12.000000
- \endverbatim
- \section Vector_Scaling_on_an_Hybrid_CPU_GPU_Machine Vector Scaling on an Hybrid CPU/GPU Machine
- Contrary to the previous examples, the task submitted in this example may not
- only be executed by the CPUs, but also by a CUDA device.
- \subsection Definition_of_the_CUDA_Kernel Definition of the CUDA Kernel
- The CUDA implementation can be written as follows. It needs to be compiled with
- a CUDA compiler such as nvcc, the NVIDIA CUDA compiler driver. It must be noted
- that the vector pointer returned by STARPU_VECTOR_GET_PTR is here a pointer in GPU
- memory, so that it can be passed as such to the <c>vector_mult_cuda</c> kernel
- call.
- \code{.c}
- #include <starpu.h>
- static __global__ void vector_mult_cuda(unsigned n, float *val,
- float factor)
- {
- unsigned i = blockIdx.x*blockDim.x + threadIdx.x;
- if (i < n)
- val[i] *= factor;
- }
- extern "C" void scal_cuda_func(void *buffers[], void *_args)
- {
- float *factor = (float *)_args;
- /* length of the vector */
- unsigned n = STARPU_VECTOR_GET_NX(buffers[0]);
- /* CUDA copy of the vector pointer */
- float *val = (float *)STARPU_VECTOR_GET_PTR(buffers[0]);
- unsigned threads_per_block = 64;
- unsigned nblocks = (n + threads_per_block-1) / threads_per_block;
- @i{ vector_mult_cuda<<<nblocks,threads_per_block, 0, starpu_cuda_get_local_stream()>>>}
- @i{ (n, val, *factor);}
- @i{ cudaStreamSynchronize(starpu_cuda_get_local_stream());}
- }
- \endcode
- \subsection Definition_of_the_OpenCL_Kernel Definition of the OpenCL Kernel
- The OpenCL implementation can be written as follows. StarPU provides
- tools to compile a OpenCL kernel stored in a file.
- \code{.c}
- __kernel void vector_mult_opencl(int nx, __global float* val, float factor)
- {
- const int i = get_global_id(0);
- if (i < nx) {
- val[i] *= factor;
- }
- }
- \endcode
- Contrary to CUDA and CPU, <c>STARPU_VECTOR_GET_DEV_HANDLE</c> has to be used,
- which returns a <c>cl_mem</c> (which is not a device pointer, but an OpenCL
- handle), which can be passed as such to the OpenCL kernel. The difference is
- important when using partitioning, see @ref{Partitioning Data}.
- \code{.c}
- #include <starpu.h>
- extern struct starpu_opencl_program programs;
- void scal_opencl_func(void *buffers[], void *_args)
- {
- float *factor = _args;
- int id, devid, err; /* OpenCL specific code */
- cl_kernel kernel; /* OpenCL specific code */
- cl_command_queue queue; /* OpenCL specific code */
- cl_event event; /* OpenCL specific code */
- /* length of the vector */
- unsigned n = STARPU_VECTOR_GET_NX(buffers[0]);
- /* OpenCL copy of the vector pointer */
- cl_mem val = (cl_mem) STARPU_VECTOR_GET_DEV_HANDLE(buffers[0]);
- { /* OpenCL specific code */
- id = starpu_worker_get_id();
- devid = starpu_worker_get_devid(id);
- err = starpu_opencl_load_kernel(&kernel, &queue, &programs,
- "vector_mult_opencl", devid); /* Name of the codelet defined above */
- if (err != CL_SUCCESS) STARPU_OPENCL_REPORT_ERROR(err);
- err = clSetKernelArg(kernel, 0, sizeof(n), &n);
- err |= clSetKernelArg(kernel, 1, sizeof(val), &val);
- err |= clSetKernelArg(kernel, 2, sizeof(*factor), factor);
- if (err) STARPU_OPENCL_REPORT_ERROR(err);
- }
- { /* OpenCL specific code */
- size_t global=n;
- size_t local=1;
- err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global, &local, 0, NULL, &event);
- if (err != CL_SUCCESS) STARPU_OPENCL_REPORT_ERROR(err);
- }
- { /* OpenCL specific code */
- clFinish(queue);
- starpu_opencl_collect_stats(event);
- clReleaseEvent(event);
- starpu_opencl_release_kernel(kernel);
- }
- }
- \endcode
- \subsection Definition_of_the_Main_Code Definition of the Main Code
- The CPU implementation is the same as in the previous section.
- Here is the source of the main application. You can notice that the fields
- <c>cuda_funcs</c> and <c>opencl_funcs</c> of the codelet are set to
- define the pointers to the CUDA and OpenCL implementations of the
- task.
- \code{.c}
- #include <starpu.h>
- #define NX 2048
- extern void scal_cuda_func(void *buffers[], void *_args);
- extern void scal_cpu_func(void *buffers[], void *_args);
- extern void scal_opencl_func(void *buffers[], void *_args);
- /* Definition of the codelet */
- static struct starpu_codelet cl =
- {
- .cuda_funcs = { scal_cuda_func, NULL },
- .cpu_funcs = { scal_cpu_func, NULL },
- .opencl_funcs = { scal_opencl_func, NULL },
- .nbuffers = 1,
- .modes = { STARPU_RW }
- }
- #ifdef STARPU_USE_OPENCL
- /* The compiled version of the OpenCL program */
- struct starpu_opencl_program programs;
- #endif
- int main(int argc, char **argv)
- {
- float *vector;
- int i, ret;
- float factor=3.0;
- struct starpu_task *task;
- starpu_data_handle_t vector_handle;
- starpu_init(NULL); /* Initialising StarPU */
- #ifdef STARPU_USE_OPENCL
- starpu_opencl_load_opencl_from_file(
- "examples/basic_examples/vector_scal_opencl_codelet.cl",
- &programs, NULL);
- #endif
- vector = malloc(NX*sizeof(vector[0]));
- assert(vector);
- for(i=0 ; i<NX ; i++) vector[i] = i;
- /* Registering data within StarPU */
- starpu_vector_data_register(&vector_handle, 0, (uintptr_t)vector,
- NX, sizeof(vector[0]));
- /* Definition of the task */
- task = starpu_task_create();
- task->cl = &cl;
- task->handles[0] = vector_handle;
- task->cl_arg = &factor;
- task->cl_arg_size = sizeof(factor);
- /* Submitting the task */
- ret = starpu_task_submit(task);
- if (ret == -ENODEV) {
- fprintf(stderr, "No worker may execute this task\n");
- return 1;
- }
- /* Waiting for its termination */
- starpu_task_wait_for_all();
- /* Update the vector in RAM */
- starpu_data_acquire(vector_handle, STARPU_R);
- /* Access the data */
- for(i=0 ; i<NX; i++) {
- fprintf(stderr, "%f ", vector[i]);
- }
- fprintf(stderr, "\n");
- /* Release the RAM view of the data before unregistering it and shutting down StarPU */
- starpu_data_release(vector_handle);
- starpu_data_unregister(vector_handle);
- starpu_shutdown();
- return 0;
- }
- \endcode
- \subsection Execution_of_Hybrid_Vector_Scaling Execution of Hybrid Vector Scaling
- The Makefile given at the beginning of the section must be extended to
- give the rules to compile the CUDA source code. Note that the source
- file of the OpenCL kernel does not need to be compiled now, it will
- be compiled at run-time when calling the function
- starpu_opencl_load_opencl_from_file() (@pxref{starpu_opencl_load_opencl_from_file}).
- \verbatim
- CFLAGS += $(shell pkg-config --cflags starpu-1.1)
- LDFLAGS += $(shell pkg-config --libs starpu-1.1)
- CC = gcc
- vector_scal: vector_scal.o vector_scal_cpu.o vector_scal_cuda.o vector_scal_opencl.o
- %.o: %.cu
- nvcc $(CFLAGS) $< -c $@
- clean:
- rm -f vector_scal *.o
- \endverbatim
- \verbatim
- $ make
- \endverbatim
- and to execute it, with the default configuration:
- \verbatim
- $ ./vector_scal
- 0.000000 3.000000 6.000000 9.000000 12.000000
- \endverbatim
- or for example, by disabling CPU devices:
- \verbatim
- $ STARPU_NCPU=0 ./vector_scal
- 0.000000 3.000000 6.000000 9.000000 12.000000
- \endverbatim
- or by disabling CUDA devices (which may permit to enable the use of OpenCL,
- see \ref Enabling_OpenCL) :
- \verbatim
- $ STARPU_NCUDA=0 ./vector_scal
- 0.000000 3.000000 6.000000 9.000000 12.000000
- \endverbatim
- */
|