Threading Toolkit Programmer’s Reference Manual

 

V1.0.1

 

Copyright © 2009

 

QuickThread Programming, LLC

 

85 Cove Lane

 

Oshkosh, WI 54902

 

USA

 

www.quickthreadprogramming.com

 

 


 

 

QuickThread is a pending trademark of

 

James G. Dempsey

QuickThread Programming, LLC

85 Cove Lane

Oshkosh, WI 54902

USA

 

The information contained herein is the intellectual property of QuickThread Programming, LLC, all rights reserved.

 

Additional information can be obtained by email at info@quickthreadprogramming.com or at the website http://www.quickthreadprogramming.com.

 

 

 


QuickThread. 5

Parallel constructs. 6

Conceptual programming technique. 7

Tasks. 9

C++ Programming with QuickThread. 10

Program Initialization. 11

Default initialization. 11

Worker thread count 11

Worker and I/O thread counts. 11

Specialty initialization. 12

Alternate specialty initialization. 13

Task evocation. 14

parallel_task. 16

Static function (non-member function) 16

Member function. 16

Lambda function (C++0x) 16

Completion routines. 23

parallel_invoke. 25

Choosing between parallel_invoke or parallel_task. 28

parallel_distribute. 29

Static function (non-member function) 29

Lambda function (C++0x) 29

Loop Parallelization. 32

parallel_for 32

Static (non-member) function: 32

Member function: 32

Lambda Function (C++0x) 32

Special notes for Lambda functions. 33

Simple parallel_for 38

Simple parallel_for with iChunk. 39

Simple parallel_for with qtControl 40

parallel_for with placement 42

parallel_for_each. 43

Static function. 43

Member function. 43

Lambda Function (C++0x) 43

parallel_reduce. 46

Static Function. 46

Member Function. 46

Lambda Function (C++0x) 46

Reduction Object Templates. 47

parallel_list 50

Static Function. 50

Member Function. 50

Lambda Function (C++0x) 50

parallel_pipeline. 51

concurrent_proxy_vector 60

Memory Allocation. 61

qtInit 65

qtPlacement 69

qtControl 71

Miscelleaneous Library Functions. 81

ChargeAffinity. 81

Lock_FIFO.. 81

LockLock. 81

qt_pointerLock. 82

AtomicAdd. 82

get_qtControl 82

qtYield. 82

qt_get_num_threads. 83

qt_get_num_io_threads. 83

qt_get_thread_num.. 83

qt_get_thread_ID. 83

qt_get_thread_AffinityMask. 83

qt_index. 83

FORTRAN Programming. 85

Program Initialization. 85

QueueMain. 87

QuickThread Interfaces. 90

QuickThread Interfaces. 90

QuickThreadInit 90

QuickThreadQueueMain. 90

T_qtControl 90

QuickThreadWaitTillDone. 93

QuickThreadSuggestAffinity. 93

QuickThreadChargeAffinity. 94

QuickThread_Initialized. 94

QuickThread_nWorkerThreads. 94

QuickThreadQueueWork. 95

QuickThreadQueueIO(qtControl, aSub[,args]) 95

QuickThreadQueueOnDone. 96

QuickThreadQueueIOOnDone. 96

QuickThreadQueueDo. 96

QuickThreadQueueDoChunk. 96

QuickThreadQueueDoChunkTemporal 97

Examples. 97

SimpleArray. 97

Example 2: OpenMP outer level parallelization. 99

Example 3: OpenMP inner level parallelization. 99

Example of Pipeline. 101

Example 4: OpenMP outer and inner level parallelization. 103

Example 5 QuickThread outer and inner level parallelization. 104


 

QuickThread

 

QuickThread is a C++ runtime library and programming paradigm for writing multithreaded applications in 32-bit and 64-bit environments using C++, FORTRAN and mixed language programs.

 

QuickThread is affinity capable supporting thread affinity, data binding affinity and NUMA support.

 

QuickThread is a tasking system using thread pools and provides exceptional control over task scheduling with respect to cache levels, core placement, and thread availability.

 

The design goal of QuickThread is to produce a minimal overhead mechanism for distributing work in a multi-socket, multi-core, multi-threaded environment.

 

 

NOTICE

The FORTRAN Application Programmers Interface is not officially supported at this time. These interfaces are residual interfaces leftover from the earliest development work on QuickThread. Feel free to experiment at your own risk. If you have a need for FORTRAN as a supported product, please contact QuickThread Programming, LLC and express your needs. Future versions of QuickThread will address this issue.

 


Parallel constructs

 

parallel_task               Schedule a single task.

 

parallel_invoke                        Invoke multiple different tasks (C++0x Lambda functions only)

 

parallel_distribute     Schedule a task team to work on different portions of same task..

 

parallel_for                 Schedule a task team to run across an iteration space divided up evenly to team members (or chunked up to team members)..

 

parallel_for_each       Schedule a task team across an iteration space divided upon demand by each team member number.

 

parallel_reduce         Schedule a task team across an iteration space divided upon demand by each team member number while performing reduction operation.

 

parallel_list                    Schedule a task team to process a singly linked list of objects.

 

parallel_pipeline         Schedule a task team to process a sequence of steps (pipes) contained within a vector (pipeline).


Conceptual programming technique

 

The above figure depicts an idealized system with eight threads (T0-T7), running on two processors, each processor with four cores, three level cache, two memory systems. Two core pairs within each processor sharing one of two L2 cache within the processor, all cores within each processor sharing a processor common L3 cache. And each processor with direct access to a local RAM (M0) and one hop access to RAM local to the other processor (M1 obversely). The above diagram can be expanded to include additional processor packages and memory systems as well as additional memory hop levels (M2, M3).

 

In the idealized system, each thread has independent data distributed amongst the various cache and memory levels and where the programming goal is to keep as many of the thread’s data (and instruction) accesses as close to its L1 as possible. When the programmer has the means to control the execution of the application in a manner complementary to this idealized system, then the application will experience maximum performance.

 

In practice, the generally used threading tools do not provide the programmer with the means to control the program execution towards this idealized system. That is until now.

 

One of the techniques employed by most of the threading tools which provides a limited measure of this control, was the switch in programming practice from:

 

using a dedicated software thread per task

 

to

 

using a pool of threads (typically with one software thread per hardware thread)

 

Then using a task scheduler within the application that schedules tasks to available threads from the thread pool. This technique exchanged a costly operating system thread context switch with a comparatively low cost task context switch within the application.

 

Additionally, when using the thread pool tasking technique, the programmer can use thread affinity to pin the software thread to a specific hardware thread (or set of threads). Using thread pinning, and when the operating system interrupts an application thread, or context switches to another application, or system task, then upon resumption of application thread, there is the benefit of a higher probability of some portion of the previously cached application data still being present in its cache system.

 

The remaining control technique for the programmer to approach the idealized system is the means to choreograph not only the task scheduling but also the task placement, interaction with other tasks, and data placement control. QuickThread offers this level of control.

 

QuickThread offers the programmer the means to:

 

  • Allocate data objects from a particular NUMA node (e.g. with most available RAM or least estimated computation load).
  • Direct execution of task or task slices for data objects allocated with placement to be restricted to, or have preference to, run on threads within the NUMA node of that data object.
  • Hot-in-cache programming considerations to direct execution of task or task slices to be restricted to, or have preference to, run on threads sharing a specific cache level with the current thread (thread issuing the task en-queue).
  • Not-in-cache programming considerations to direct execution of task or task slices to be restricted to, or have preference to, run on threads sharing a specific cache level on the processor with the most idle hardware threads at that cache level.
  • Opportunistic-in-cache task scheduling whereby loops can be conditionally split-up into multiple task slices only when, and to the extent of, threads sharing a specific cache level with the current thread are available (else the loop is run as a single task or diminished number of tasks together with the current thread).
  • Include (by direct call of task as function call by current thread) or exclude current thread in task slice-up.
  • Slice-up and distribute a task to a primary thread slice, one each, per requested cache level.
  • Slice-up primary thread slice into secondary thread slices within the cache level of the primary thread slice.
  • Opportunistic, as threads become available scheduling to reduce unnecessary thread scheduling calls.

 

With QuickThread the programmer can exert extraordinary level of control by the inclusion of a single placement directive on the parallel_for and other parallel directives.

 

The conceptual programming technique for QuickThread is a messaging system whereby you to throw objects and arguments at functions (C++) or subroutines (FORTRAN). These throw requests are placed into a queue (one of several queues). The queued subroutines, when run, can throw (en-queue) additional subroutines and arguments or perform work or do both.

 

The general queuing technique is neither strictly LIFO nor FIFO. QuickThread will en-queue the work requests in a manner that defaults to be hot-in-cache friendly (the programmer optionally can use cache directed en-queuing of work requests as well as FIFO en-queuing).

 

When thread affinity is not used, the application programmer can select between a compute class queue and an I/O class queue. There is a third queue called the compute class overflow queue which may be used in rare circumstances (i.e. to not block en-queue requests by other threads while the primary compute class queue is being allocated additional free nodes performed by the first thread to cause the overflow).

 

Affinity, when enabled, will, at programmer’s direction, pin compute class threads to one or more execution cores. Then the programmer has the choice of using affinity directed task en-queuing of compute class tasks, or non-affinity directed task en-queuing of compute class tasks. I/O class threads are not affinity pinned.

 

Tasks

Tasks in QuickThread are standard functions (C++), and/or class member functions (C++), ) that have a void return, or subroutines in FORTRAN, and take from 0 to 9 arguments. There is no distinction between a task and function/subroutine as these functions or subroutines can be called directly as well as having task requests thrown at them. For example, the parallel_for template, when creating N tasks will, en-queue N-1 tasks and then directly call the en-queued function (thus saving the overhead of one task en-queuing operation).

 

From the function’s viewpoint there is no distinction between being called as a task and being called directly from within the application.

 

There are no special considerations when writing functions callable from the task scheduler other than for the standard multi-threaded programming requirement in making the function thread-safe.

 

At this time, C++ exception handling is not performed between the en-queued task and the en-queuing task. Future revisions may address this issue.

 

Tasks begin execution upon call (by QuickThread task manager) and terminate upon return. Tasks do not directly make request for work to do, rather they begin life upon receipt of requests for work to do (with optional list of arguments). A task is not a thread waiting in an idle loop, rather, a task is code waiting for a function call.

 

Using threading model, as opposed to tasking model, the application would start many threads with each initially entering a wait state (e.g. WaitForSingleEvent). Standard threading models tend to incur a higher degree of interaction with the operating system than does a task pool system such as QuickThread.

 


C++ Programming with QuickThread

 

The C++ programmer includes the QuickThread header file plus desired template headers.

 

#include <QuickThread.h>            // Required header

#include <parallel_task.h>          // Optional template

#include <parallel_invoke.h>        // Optional template

#include <parallel_distribute.h>    // Optional template

#include <parallel_for.h>           // Optional template

#include <parallel_list.h>          // Optional template

#include <parallel_reduce.h>        // Optional template

#include <parallel_pipeline.h>      // Optional template

 

using namespace qt;                 // Optional namespace

 

 

And then link in the appropriate QuickThread.lib file (x32 or x64 for target O/S)

 

For the C++ programmer, QuickTread uses the namespace “qt” although some of the FORTRAN entry points are visible as QUICKTHREADsomenamehere. The programmer has the choice of using the verbose fully qualified names or shorthand templates. If the programmer prefers, they are free to rename the templates or add there own templates.

 

NOTICE

 

These template files are typically a shell file that declares a macro then includes the current version of the template. Applications can include specific or multiple versions of templates. This permits you to retrograde versions when necessary. Or permits you to accommodate template naming conflicts when combining QuickThread with other threading toolkits such as TBB (Threading Building Blocks).

 

Consider parallel_task.h

 

// parallel_task.h

 

#pragma once

// pick the version

#include "parallel_task_v1.h"

// define as default

#define parallel_task parallel_task_v1

 

 


Program Initialization

There are several methods to initialize your application for using QuickThread.

 

Default initialization

 

// YourApp.cpp

#include <QuickThread.h>

using namespace qt;

 

int main(void)    // or with arguments

{

      // Worker Threads = # hardware threads + 1 I/O thread

      qtInit      qtInit(-1); // -1 ==  use defaults

      // ... your program here

      return YourReturnCode;

}

Worker thread count

 

// YourApp.cpp

#include <QuickThread.h>

using namespace qt;

 

int main(void)

{

      // Worker Threads = 3 threads + 1 (or more) I/O threads

      qtInit      qtInit(3);

      // ... your program here

      return YourReturnCode;

}

Worker and I/O thread counts

 

// YourApp.cpp

#include <QuickThread.h>

using namespace qt;

 

int main(void)

{

      // Worker Threads = 3 threads + 2 I/O threads

      qtInit      qtInit(3,2);

      // ... your program here

      return YourReturnCode;

}


Specialty initialization

 

// YourApp.cpp

#include <QuickThread.h>

using namespace qt;

 

int main(void)

{

      qtInit      qtInit;     // default ctor

      // default ctor initialized qtInit to default settings

      // but does not start the QuickThread thread pool

      // Optional configuration of the qtInit object here

qtInit.nWorkerThreads = 4;

qtInit.nIoThreads = 1;

// the following not available in ctor

qtInit.DefaultSpinWait = 512;

qtInit.nAffinityGrouping = 2;

      // ...

      // Start QuickThread

      if(qtInit.StartQT())

return YourThreadingErrorReport();

 

      // ... your program here

 

      // End QuickThread

      if(qtInit.EndQT())

return YourThreadingErrorReport();

      return YourReturnCode;

}


Alternate specialty initialization

 

// YourApp.cpp

#include <QuickThread.h>

 

// The Main-level task is the only task that returns a completion code

int MainTask(void* context)

{

      // do something with context if desired

DoWork(); // or body of your Main Task

      return YourReturnCode;

}

 

// The C++ program entry point

int _tmain(int argc, _TCHAR* argv[])

{

GetCommandLineArguments(argc, argv);

// Declart a qtInit object

      qt::qtInit qtInit;

      // Modify qtInit as desired

qtInit.nWorkerThreads = 4;

qtInit.nIoThreads = 1;

// the following not available in ctor

qtInit.DefaultSpinWait = 512;

qtInit.nAffinityGrouping = 2;

      // ...

 

      // Launch Main Task

// with optional argument context

// return error code

      return qtInit.QueueMain(MainTask, NULL);

}

 

The choice of technique for initialization is left for the programmer to decide.


Task evocation

Tasks on QuickThread are any function returning void

 

      void foo(a1T a1, a2T a2, ...)

 

This may be a static function, or class member function or Lambda functions (for compilers supporting C++0x features). Functions may have from 0 to 9 arguments.

 

Functions may be overloaded (same name, different argument lists). In cases where the compiler is unable to disambiguate the function signature a cast may be required on the function name argument.

 

Internally, tasks on QuickThread, are evoked by way of a qtControl structure (described later). This qtControl structure is either implicit or explicit at programmer preference.

 

Most of your task evocations will likely be by way of the supplied templates. These templates are provided in source form and thus have the provision for user extensibility. The major templates names are:

 

parallel_task

parallel_distribute

parallel_for

parallel_for_each

parallel_list

parallel_reduce

parallel_pipeline

parallel_wait

 

The templates will accomidate either static functions, member functions or Lambda functions (for compilers supporting C++0x features).

 

Most templates accept a variable number of arguments, some optional, some required, and follow this generalized organization:

 

parallel_{template_suffix} (

{enqueuing instructions and arguments ouside control of task}

{address of object when enquing member function as task}

{address of task}

{arguments to task}

);

 


Example:

 

void Foo(int n, char* msg);

...

Foo(1, “hello world”);

 

becomes

 

parallel_task(Foo, 1, “hello world”);

and

 

class Obj_T

{

      ...

      void Fee( int a, int b, int c);

};

...

Obj->Fee(1,2,3);

 

becomes

 

parallel_task(Obj, &Obj_T::Fee, 1, 2, 3);

 

Generally the left to right sequence of the tokens is the same and therefore conversion effort in part can be done with a search and replace.


parallel_task

 

Static function (non-member function)

 

parallel_task(

[qtPlacement,]

[&qtControl,]

&fn,

[, a1[, a2[, a3[, a4[, a5[, a6[, a7[, a8[, a9]]]]]]]]]

);

 

qtPlacement  an optional placement and/or selection directive (see qtPlacement)

&qtControl    an optional address of QuickThread Control structure (see qtControl)

&fn                  a required address of Task function name

a1 through a9  are optional function arguments.

 

Member function

 

parallel_task(

[qtPlacement,]

[&qtControl,]

&Obj,

&ClassName::fn,

[, a1[, a2[, a3[, a4[, a5[, a6[, a7[, a8[, a9]]]]]]]]]]

);

 

qtPlacement              an optional placement and/or selection directive (see qtPlacement)

&qtControl                an optional address of QuickThread Control structure (see qtControl)

&Obj                            a required address of object

&ClassName::fn        a required address of function name

a1 through a9        optional function arguments.

 

The distinction between member function and static function is the presense or absence of the address of the object of the class of the member function.

Lambda function (C++0x)

 

parallel_task(

[qtPlacement,]

[&qtControl,]

[&]()

{

  // ... function body

}

);

 

qtPlacement              an optional placement and/or selection directive (see qtPlacement)

&qtControl                an optional address of QuickThread Control structure (see qtControl)

[&]( ){ ... }          required Lambda function (may use = as well as & and &x,=y, …)

 

The en-queued tasks are standard functions and/or member functions that can be called directly or enqueue using parallel_task. Lambda functions (C++0x) have no function name and take their arguments by way of a hidden object containing reference (&) or value (=) of current scoped variables.

 

NOTICE

 

The following parallel_task examples serve as a tutorial relating to use of qtPlacement and qtControl calling variations.

 

These variations are applicable to other parallel_... templates.

 

The other template descriptions will not repete this information.

 

Although this section is verbose, the qtPlacement and qtControl are important features of all templates.

 

Read this section to learn the nuances of task enqueuing that will apply to the other parallel_... templates.


Examples:

 

// static (free) functions

SimpleStaticFunction(1, msg, 101);

parallel_task(&SimpleStaticFunction, 1, msg, 101);

 

Note, most C++ compilers will permit you to drop the & on the static function name. Scoped qualified member function names require the presense of the &.

 

parallel_task(SimpleStaticFunction, 1, msg, 101);

 

When parallel_task is issued witout the address of a qtControl structure the address of the Task’s current level default control structure is used. The enqueued task will run asynchronous from the code that issued the parallel_task.

 

Using explicit qtControl

 

parallel_task(&qtControl, SimpleStaticFunction, 1, msg, 101);

 

Using explicit cache level directive (to Task’s current level default control structure)

 

parallel_task(L2$, SimpleStaticFunction, 1, msg, 101);

 

Using explicit cache level directive to explicit qtControl

 

parallel_task(L2$, &qtControl, SimpleStaticFunction, 1, msg, 101);

 

// and static member functions (no required &Obj)

CBaseClass::StaticMemberFunction(2, msg, 102);

parallel_task(&CBaseClass::StaticMemberFunction, 2, msg, 102);

 

// Simple member function (requires address of object)

a.SimpleMemberFunction( 0, msg, 100);

parallel_task(&a, &CBaseClass::SimpleMemberFunction, 0, msg, 100);

 

p->SimpleMemberFunction( 0, msg, 100);

parallel_task(p, &CBaseClass::SimpleMemberFunction, 0, msg, 100);

 

// virtual member functions

b.SimpleVirtualFunction(4, msg, 104);

parallel_task(&b, &CBaseClass::SimpleVirtualFunction, 4, msg, 104);

 

parallel_task(&c, &CDerivedClass::TrickyVirtualFunction, 7, msg, 107);

 

// parallel_task Lambda function

Total = 0;

parallel_task(

      [&]()

      {

            for(int i=0; i<nVectors; ++i)

                  Total += intVector[i];

      }

);

// *** Caution Total not complete yet

// returns from parallel_task while task in progress

 

Special Note for Overloaded Functions

 

At times you may be required to cast a function when functions are overloaded and ambiguous.

 

void Foo(int i);

void Foo(long i);

void Foo(double d);

char  arg;  // When arg is char, which Foo do you mean?

parallel_task( (void(*)(long))Foo, arg); // use void Foo(long i);

 

NOTICE

parallel_task is non-blocking. Meaning execution generaly continues past the parallel_task statement while the en-queued task runs (or is scheduled to run).

 

When parallel_task is issued without the qtControl argument, a default thread context qtControl is used. Each thread maintains a task level default qtControl structure. You can use parallel_wait();  (with a null qtControl) to insert a barrier for all sub-tasks spawned by way of the thread’s task-level default thread context qtControl. There is an implicit  parallel_wait();  at end of task for any pending sub-tasks en-queued using a task level default qtControl structure.

 

void Task1()

{

   parallel_task(Task2);                           void Task2()

   parallel_task(Task3);                           {                                   void Task3()

   DoOtherWork();                                     parallel_task(Task4);   {                      

   parallel_wait(); // for Task2 and Task3      parallel_task(Task5);     parallel_task(Task6);

   UseResultsOfTask1andTask2();           }                                     parallel_task(Task7);

}                                                                                               }

 

Task1 has a default qtControl that is different from the other tasks default qtControl structures. In the above, Task2 and Task3 are enqueue via Task1 default qtControl. Task2 and Task3 (may) run in parallel with Task1, including DoOtherWork(), up until Task1’s parallel_wait();. And at which point, Task1 blocks until Task2 and Task3 complete.

 

Task2 enqueues two tasks (Task 4 and Task5) using Task2’s default qtControl structure.

 

Task 3 enqueues Task6 and Task7.

 

Task4 and Task5 may run concurrent with Task2, the remaining part of Task1 (upto parallel_wait in Task1), Task3, Task6, Task7 and any additional tasks enqueued by those tasks.

 

Upon exit of Task3 it will wait for completion of tasks enqueued by way of Task3’s default qtControl structure. In this case, Task3 waits for completion of Task6 and Task7.

 

Upon exit of Task2 it will wait for completion of tasks enqueued by way of Task2’s default qtControl structure. In this case, Task2 waits for completion of Task4 and Task5.

 

The parallel_wait() in Task1 was inserted to explicitly wait for results from Task2 and Task3 then can make use of results generated by Task1 and Task2.

 

 

The parallel_task can accept an optional qtControl argument to be used in place of the task default qtControl structure.

 

The optonal qtControl argument provides additional control over the task enqueuing. A primary function is to provide object and/or task level placement and thread synchronization. Your task can use multiple qtControl objects for coordinating multiple task lists. Tasks enqueued via explicit qtControl can live longer than the parent task that performs the enqueue.

 

qtControl   yourControl;

...

void SomeTask()

{

      ...

      parallel_task(&yourControl, Foo);

      ...

}

 

In the above, the exit of SomeTask() does not implicitly wait for task Foo to complete.

 

To wait for Foo completion (anywhere in the program) use:

 

yourControl.WaitTillDone();

 

The qtPlacement argument specifies restrictions on which spawned threads are to be run. When used in conjunction with the qtControl argument this provides the programmer considerable control with respect to performance considerations. Examples:

 

a)       Queue the task to the thread that co-resides with the issueing thread’s L2 cache.

b)       Locate an L3 cache which has the most available threads and condition the qtControl structure to enqueue/deque to/from those groups of threads.

c)       Condition the qtControl structure to distribute current en-queued task and subsequent en-queued tasks to one thread per each L2 cache. Thus providing for each en-queued task to enqueue a sub-task to threads sharing its L2 cache using (a) above


 

Using thread private task-level default qtControl

 

      // primary tasks first, then secondary tasks

parallel_task( Task_A );

parallel_task( Task_B );

parallel_task( Task_C );

parallel_wait();

parallel_task( Task_A2 );

parallel_task( Task_B2 );

parallel_task( Task_C2 );

parallel_wait();

xxx

 

|<->|----Task_A------|         |(prior tasks)|<->|----Task_A2-----|xxx|

|<->|----Task_B-----------|                  |<->|----Task_B2---|

|<->|----Task_C----------------|             |<->|---Task_C2--|

 

|<------------------------ latency time to run xxx -------------->|

|<------------------------ total elapse time ------------------------>|

 

Notes, “|<->|“ depicts possible skew in start of task.”|(prior tasks)|  depicts potential completion time for prior task enqueued using default qtControl, and “|xxx|” depicts code that follows second parallel_wait.

 

Using specific qtControl object.

 

{     // primary tasks first, then secondary tasks

      qtControl   qtControl;

parallel_task(&qtControl, Task_A );

parallel_task(&qtControl, Task_B );

parallel_task(&qtControl, Task_C );

parallel_wait(&qtControl); // or qtControl.WaitTillDone();

parallel_task(&qtControl, Task_A2 );

parallel_task(&qtControl, Task_B2 );

parallel_task(&qtControl, Task_C2 );

}

xxx

 

|<->|--------Task_A------|         |<->|----Task_A2------|xxx|

|<->|--------Task_B-----------|    |<->|----Task_B2---|

|<->|--------Task_C----------------|<->|---Task_C2--|

 

|<------------------ latency time to run xxx ----------->|

|<------------------ total elapse time --------------------->|

|<------------------ prior total elapse time ------------------------>|

 

The first example has no consideration for additional pending tasks en-queued by the current task and where there is no requirement to suspend execution.

 

In the second example additional tasks en-queued by the current task may be pending. The programmer uses the local qtControl to en-queue and synchronize a local list of tasks. When the qtControl leaves scope, the dtor performs an implicit wait. The advantage of the second method is  Task_A2, Task_B2, Task_C2 do not have to wait for pending task level sub-tasks en-queued prior to the entry of the scope of the local qtControl structure.

 

Through use of multiple qtControl objects you can handle multiple task queues concurrently and improve your synchronization appropriately.

 

{     // primary tasks first, partial overlap with secondary tasks

      qtControl   qtControlA, qtControlB, qtControlC;

parallel_task(&qtControlA, Task_A );

parallel_task(&qtControlB, Task_B );

parallel_task(&qtControlC, Task_C );

parallel_wait(&qtControlA); // or qtControlA.WaitTillDone();

parallel_task(&qtControl, Task_A2 );

parallel_wait(&qtControlB); // or qtControlB.WaitTillDone();

parallel_task(&qtControl, Task_B2 );

parallel_wait(&qtControlC); // or qtControlC.WaitTillDone();

parallel_task(&qtControl, Task_C2 );

xxx

}

yyy

 

|<->|--------Task_A------|<->|----Task_A2------|

|<->|--------Task_B-----------|<->|----Task_B2---|

|<->|--------Task_C----------------|xxx|<->|----Task_C2--|yyy|

 

|<------------------ total elapse time --------------------->|

|<--- latency time to run xxx ---->|

|<------------ prior latency time to run xxx ----------->|

|<------------ prior total elapse time --------------------->|

|<------------ prior prior total elapse time ------------------------>|

 

Using completion nodes may eliminate skew

 

{

      // Using completion node format

      // primary tasks first, complete overlap with secondary tasks

      // inside of scope

      qtControl   qtControlA, qtControlB, qtControlC;

parallel_task(&qtControlA, Task_A);

parallel_task(OnDone$, &qtControlA, Task_A2);

parallel_task(&qtControlB, Task_B );

parallel_task(OnDone$, &qtControlB, Task_A2);

parallel_task(&qtControlC, Task_C );

parallel_task(OnDone$, &qtControlA, Task_A2);

xxx

}

yyy

 

|xxx|

|<->|--------Task_A------|----Task_A2------|

|<->|--------Task_B-----------|----Task_B2---|

|<->|--------Task_C----------------|---Task_C2--|yyy|

 


// completion nodes with controls outside of scope

qtControl   qtControlA, qtControlB, qtControlC;

{     // primary tasks first, complete overlap with secondary tasks

parallel_task(&qtControlA, Task_A);

parallel_task(OnDone$, &qtControlA, Task_A2);

parallel_task(&qtControlB, Task_B );

parallel_task(OnDone$, &qtControlB, Task_A2);

parallel_task(&qtControlC, Task_C );

parallel_task(OnDone$, &qtControlA, Task_A2);

xxx

}

yyy

 

|xxx|yyy|

|<->|--------Task_A------|----Task_A2------|

|<->|--------Task_B-----------|----Task_B2---|

|<->|--------Task_C----------------|---Task_C2--|

 

In the last example, when the qtControl objects are instantiate outside the scope of the task (as static object or passed in as an argument to the task). Then the en-queued task can return with pending sub-tasks.

 

By varying your use of the qtControl you can significantly alter latencies as well as core utilization. Control of latencies means control of “Hot in cache”.

 

Completion routines

 

When parallel_task is issued with qtPlacement attribute containing OnDone$ the en-queued task is placed into a FIFO list associated with the qtControl object. When all pending tasks complete that are/were en-queued via the specified control object, then the tasks en-queued using the OnDone$ attribute are processed in FIFO order. The OnDone$ qtPlacement attribute can be used to serialize tasks whenever that is in your design requirement.

 

Proper use of the qtControl(s) and the parallel_task with placement attribute of OnDone$ can improve concurrency. N.B. The programmer is free to use qtPlacement or other techniques to schedule the completion task to a thread other than the one it waits on.

 

As you can see, you have considerable control over the sequencing of the tasks. And consequently must take care in your programming to attain the desired results.

 

 

Example code outlines for qtControl existing outside the scope of the task:

 

void foo(qtControl* qtControl)

{

      //... other code here

      parallel_task(qtControl, Task_A);

      parallel_task(qtControl, Task_B);

      parallel_task(qtControl, Task_C);

      //... other code here

      // exit Task foo with

      // Task_A, Task_B, Task_C pending, running, or complete

}


void SomeFunction(someArgs)

{

      qtControl qtControl;    // SomeFunction scoped control

      //... other code here

      // use local control as argument to foo

      // Task_A, Task_B, Task_C enqueued via above qtControl

      // use default task level control for control of Task foo

      parallel_task(foo, &qtControl);

      // return while foo pending/running/complete

      // Task_A, Task_B, Task_C pending/running/complete

      // ... run other code here

      // wait for all pending default task level control Tasks

      // including Task foo to complete

      // but do not wait for pending tasks on qtControl

      // i.e. Task_A, Task_B, Task_C may be pending/running/complete

      parallel_wait();

      // ... other code here

      // wait for Task_A, Task_B, Task_C to complete

      // (as well as additional tasks enqueued via this qtControl)

      parallel_wait(&qtControl); // or qtControl.WaitTillDone();

      // ... other code here

} // void SomeFunction(someArgs)

 

void SomeOtherFunction(someArgs)

{

      qtControl qtControl;

      //... other code here

      // use local control as argument to foo

      // Task_A, Task_B, Task_C enqueued via above qtControl

      // use local control for control of Task foo

      parallel_task(&qtControl, foo, &qtControl);

      // return while foo pending/running/complete

      // Task_A, Task_B, Task_C pending/running/complete

      // ... other code here

      // wait for foo and Task_A, Task_B, Task_C to complete

      // (as well as additional tasks enqueued via this qtControl)

      parallel_wait(&qtControl);

      // ... other code here

} // void SomeOtherFunction(someArgs)

 

void OtherSomeOtherFunction(someArgs)

{

      // code elsewhere

      qtControl qtControl;

      //... other code here

      // use local control as argument to foo

      // Task_A, Task_B, Task_C enqueued via above qtControl

      // use local control for control of Task foo

      parallel_task(&qtControl, &foo, &qtControl);

      // return while foo pending/running/complete

      // Task_A, Task_B, Task_C pending/running/complete

      // ... other code here

      // implicit parallel_wait(&qtControl);

      // on dtor of local qtControl

      // waits for foo and Task_A, Task_B, Task_C to complete

      // (as well as additional tasks enqueued via this qtControl)

} // void OtherSomeOtherFunction(someArgs)

parallel_invoke

 

The parallel_invoke template is used to invoke two or more Lambda function tasks (C++0x) and/or void function tasks.

 

The principal differences between parallel_task and parallel_invoke are: parallel_invoke is less overhead, and parallel_invoke always blocks. In places of your code where you enqueue multiple tasks then immediately wait then consider using parallel_invoke. e.g. n-way partitioning.

 

parallel_invoke(

      [qtPlacement,]

[*](){ ...Lambda0... }, // or (void)(fn*)(void)

[*](){ ...Lambda1... }  // or (void)(fn*)(void)

...); // additional Lambda functions here

 

Where

qtPlacement              Optional qtPlacement argument

[*]                                 Standard Lambda context ([&], [=], [], [&arg1,=arg2,…)

()  no calling parameters

{ ...Lambda0... } Body of Lambda function 0

{ ...Lambda1... } Body of Lambda function 1

...                              Additional Lambda functions follow

 

A rework of one of the parallel_task samples yields

 

 

{    

      double a, b, c;

parallel_invoke(

[&](){ Task_A(a); },

[&](){ Task_B(b); },

[&](){ Task_C(c); });

      // implicit barrier

parallel_invoke(

[&](){ Task_A2(a); },

[&](){ Task_B2(b); },

[&](){ Task_C2(c); });

      // implicit barrier

}

xxx

 

|<->|--------Task_A------|         |<->|----Task_A2------|xxx|

|<->|--------Task_B-----------|    |<->|----Task_B2---|

|<->|--------Task_C----------------|<->|---Task_C2--|

 

|<------------------ latency time to run xxx ----------->|

|<------------------ total elapse time --------------------->|

|<------------------ prior total elapse time ------------------------>|


Real Example:

 

// call with corners of rectangle for interacton

void rect_interact(int i0, int i1, int j0, int j1)

{

    // if there's room,

    // further subdivide the rectangle into four smaller ones

    int di = i1 - i0; int dj = j1 - j0;

    if (di > grainsize && dj > grainsize)

    {

      int im = i0 + di/2;

      int jm = j0 + dj/2;

 

      // recursively call ourself to...

// invoke two tasks for two non-conflicting rectangles

      parallel_invoke(

[&]() {rect_interact(i0, im, j0, jm);},

[&]() {rect_interact(im, i1, jm, j1);});

      // return when done

 

      // Then, recursively call ourself to...

      // invoke two tasks for other two non-conflicting rectangles

      parallel_invoke(

[&]() {rect_interact(i0, im, jm, j1);},

[&]() {rect_interact(im, i1, j0, jm);});

      // return when done

    }

    else

    {

      // otherwise all we have left is a strip

// that can be handled locally

      for (int i = i0; i < i1; ++i)

          for (int j = j0; j < j1; ++j)

                  addAcc(i, j);

    }

}

 


Second example:

 

void

body_interactQT(int i, int j, int level)

{

  // split the interaction triangle into upper and lower triangles

  // (which can be executed in parallel) and the adjacent rectangle

  // (which will be further split)

  int d = j - i;

  if (d > 1)

  {

    int k = d/2 + i;

    if(level == 0)

    {

      // using qtPlacement of OneEach_L2$ has the effect of

      // scheduling current thread

      // plus one of a group of L2 caches that is not in my L2

      // first choice will be one of the other L2 cache threads

      // that is waiting. Second choice is closest other L2

      // Replace OneEach_L2$ with OneEach_L3$ or OneEach_M0$

      parallel_invoke(

          OneEach_L2$,

          [&](){body_interactQT(i,k,level+1);},

          [&](){body_interactQT(k,j,level+1);});

    }

    else

    {

      // for all other levels (1, 2, ...)

      // schedule to closest cache level

      parallel_invoke(

          [&](){body_interactQT(i,k,level+1);},

          [&](){body_interactQT(k,j,level+1);});

    }

    rect_interactQT(i, k, k, j);

  }

  // if d is 1 or 0 then we can skip it.

}

 

Note, if your system has multiple sockets and multiple L3 caches per socket then expand the code above such that on level == 0 you use OneEach_M0$, and when level == 1 use OneEach_L3$, and all other levels make no qualifications as to level.

 

Or in place of a series of fixed dispatches your initialization code could create a table of the levels of cache that contain different sets of threads. (following page)

 

 


        

 

    // in static data

    qtPlacement  YourTable[4]; // NULL or some qtPlacement value

    // On a 2-socket system

    // with each socket having Siamese twin dies

    // each die with L3 cache

    // two cores sharing L2

    // with HT

    // try:

    //    YourTable[0] = OneEach_M0$ (two-way split – sockets)

    //    YourTable[1] = OneEach_L3$ (two-way split – each L3)

    //    YourTable[2] = OneEach_L2$ (two-way split – each L2)

    //    YourTable[3] = OneEach_L1$ (two-way split – each L1)

    ...

 

    if(level < 4 && YourTable[level])

    {

      // OneEach_L1$, OneEach_L2$, OneEach_L3$, or OneEach_M0$

      parallel_invoke(

          YourTable[level], // qtPlacement

          [&](){body_interactQT(i,k,level+1);},

          [&](){body_interactQT(k,j,level+1);});

    }

    else

    {

      // for all other levels (or no splits)

      // schedule to closest cache level

      parallel_invoke(

          [&](){body_interactQT(i,k,level+1);},

          [&](){body_interactQT(k,j,level+1);});

    }

Choosing between parallel_invoke or parallel_task

The choice between selecting to use parallel_invoke or parallel_task is non-trivial as each has a different set of merrits.

 

parallel_invoke

 

Has a very low overhead in enqueuing a task list (of 2 or more tasks).

Has a very low overhead in dequeuing tasks from task list

Has implicit barrier (blocking)

Has limited control over use of optional qtPlacement

(you can only suggest which thread affinities to wakeup)

 

parallel_task

 

Has higher overhead in enqueuing one task

Has higher overhead in dequeuing task

Has no barrier (throw and go)

Has full control over use of optional qtPlacement

(suggest/require affinities, FIFO, quazi-LIFO, completion, etc…)

 

The principal selection chriteria are:

 

Do you need to omit the barrier

Do you need full contol over use of optional qtPlacement

 

parallel_distribute

 

The parallel_distribute template is used to distribute tasks by way of a selection criterion.

 

Static function (non-member function)

 

parallel_distribute(

qtPlacement,

[&qtControl,]

&fn

[, a3[, a4[, a5[, a6[, a7[, a8[, a9]]]]]]]

);

 

qtPlacement              required placement directive

&qtControl                address of optional QuickThread Control structure

&fn               address of required function or member function name

a3 through a9              optional function arguments.

 

Note, arguments a1 and a2 are missing as they are implicit. Tasks (argument fn) are of the following format:

 

void foo(intptr_t teamMemberNumber, intptr_t membersInTeam[, args]);

 

Lambda function (C++0x)

 

parallel_distribute(

qtPlacement,

[&qtControl,]

      [&](intptr_t teamMemberNumber, intptr_t membersInTeam)

      {

            // ...

      }

);

 

qtPlacement              required placement directive

&qtControl                address of optional QuickThread Control structure

&fn               address of required function or member function name

[&](...){...}          Lambda function taking two args

 

Note, for Lambda function, the same Lambda function is called with differing arguments for each thread.


For each parallel_distribute task:

 

The teamMemberNumber is 0-based and in the range of 0:membersInTeam-1. When the thread issuing the parallel_distribute is a member of the team its teamMemberNumber is 0.

 

The teamMember number is NOT a thread number.

 

The parallel_distribute will use the required qtPlacement as a selection criteria for the members of a thread pool. The size of the thread pool could range from 0 to all threads of class. When qtControl is missing from the argument list, one will be provided by the template and the effect of which is the parallel_distribute blocks until all threads complete. When qtControl is supplied, the parallel_distribute does not block. Consider:

 

parallel_distribute( OneEach_L1$, foo);

 

On a system with HyperThreading, the above would select a team of threads consisting of one of the HT threads from each pair (or sub-group) of threads that share an L1 cache with all L1 caches scheduled. On a 4 core HT system with 8 hardware threads (2 threads per core), a team of 4 threads will be selected, one thread from each core.

 

Assume your needs are to divide a process up across cores, and then within each core:

 

// one task per core

parallel_distribute( OneEach_L1$, OnePerCore);

. . .

 

void OnePerCore(size_t teamMemberNumber, size_t membersInTeam)

{

      for(  size_t row = teamMemberNumber;

row < numberOfRows;

row += membersInTeam)

      {

// only with my HT companion thread(s)

      parallel_for( L1$, DoColumns, 0, numberOfColumns, row);

}

}

 

The above using Lambda function

 

// one task per core

parallel_distribute(

OneEach_L1$,

[&](size_t teamMemberNumber, size_t membersInTeam)

{

      for(  size_t row = teamMemberNumber;

row < numberOfRows;

row += membersInTeam)

            {

// only with my HT companion thread(s)

            parallel_for(L1$,DoColumns,0,numberOfColumns, row);

}

}

);

     

Notes: OneEach_L1$ was used for the first (outer) selection criteria, and L1$ was used for the second (inner) criteria. L1$ means only threads sharing L1 of thread issuing statement. You can distribute across each/any cache level (L1, L2, L3) as well as NUMA node distances (M0, M1, M2, M3). Additionally you can distribute depending on availability of threads (Waiting_L1$)

 

parallel_for( Waiting_L1$, DoColumns, 0, numberOfColumns, row);

 

In the above scenario (OnePerCore) on 4 core HT system the above statement will, depending on availability of other thread sharing L1, will produce either one en-queue for a task to perform half the loop plus a direct call to perform the other half of the loop, or perform a direct call to perform the complete loop.

 


Loop Parallelization

parallel_for

 

 

Static (non-member) function:

 

void parallel_for(

[qtPlacement,]

[&qtControl,]

[iChunk,]

&fn,

iBegin,

iEnd

[, a3[, a4[, a5[, a6[, a7[, a8[, a9]]]]]]]

);

 

Member function:

 

void parallel_for(

[qtPlacement,]

[&qtControl,]

[iChunk,]

&Obj, // of class or class derived from containing fn

&ClassName::fn,

iBegin,

iEnd

[, a3[, a4[, a5[, a6[, a7[, a8[, a9]]]]]]]

);

 

Lambda Function (C++0x)

 

void parallel_for(

[qtPlacement,]

[&qtControl,]

[iChunk,]

iBegin,

iEnd,

[&](iBeginT iBegin, iEndT iEnd)

{

      // preamble if required

      for(iBeginT i=iBegin; i<iEnd; ++i)

      {

                  // body of your loop

            }

            // postamble if required

      } // closure of Lambda function

);

 

qtPlacement              optional placement directive (used with thread affinity)

&qtControl                optional QuickThread Control structure

iChunk                        optional chunking value

&Obj              Address of class object when used for member function

&fn               address of required function or member function name

iBegin,iEnd             are required iteration space fields (half-open interval)

a3 through a9               optional function arguments.

[&]                                or [&var1, =var2, etc …]

iBeginT, iEndT        typedef of iterator such as int, long

(generally iBeginT same as iEndT)

 

 

iBegin and iEnd define the half-open interval, expressed in documentation as [iBegin,iEnd). Where iBegin is the first of the interval and iEnd is one after the last of the interval. The typical programming practice for C++ is:

 

      for(int i = iBegin; i < iEnd; ++i)

     

 

fn declares a slice function with two required numeric arguments (integer generally intptr_t), and up to 7 optional arguments (a3:a9) which can be scalars, objects, references or pointers. The specified function is run as one or more tasks (with potentially the first or last task run by way of an inline function call).

 

When issued without a qtControl object, the template uses a template-level default qtControl within the parallel_for function. This type of parallel_for is blocking (all threads complete before return).

 

When issued with a qtControl object, the parallel_for is non-blocking. Meaning it may return from the parallel_for while execution continues on the parallel_for. Use parallel_wait([qtControl]); or qtControl.WaitTillDone(); for synchronization.

 

Special notes for Lambda functions

 

Note 1: Lambda functions that modify objects (values) in the scope of the parallel_for must do so in a thread safe manner:

 

            std::vector<int> intVector;

            // ...

            long total = 0;

            parallel_for(

intVector.begin(), intVector.end(),

                  [&total,&intVector](int iBegin,int iEnd)

                  {

                        long _total = 0;  // local subtotal

                        for(int i=iBegin; i<iEnd; ++i)

                              _total += intVector[i];

                        AtomicAdd(total, _total); // shared total

                  }

            );

Note 2: Lambda functions create a hidden temporary object on the stack of the caller.

 

When you program your parallel_for without a qtControl the parallel_for is blocking therefore the temporary stack object is preserved for the duration of the parallel_for.

 

When you program your parallel_for with a qtControl the parallel_for is non-blocking therefore the preservation of the temporary stack object for the duration of the parallel_for is the responsibility of the programmer. parallel_for with qtControl has to be used with care such that WaitTillDone (either implicitly or explicitly) is called prior to exiting the scope of the parallel_for.

 

std::vector<int> intVectorA;

      std::vector<int> intVectorB;

      qtControl   intVector_qtControl; // (outside of scope of foo)

      long intVector_sum()

      {

            long total = 0;

            // start non-blocking parallel_for

            parallel_for(

                  &intVector_qtControl,

intVectorA.begin(), intVectorA.end(),

                  [&](int iBegin,int iEnd)

                  {

                        long _total = 0;  // local subtotal

                        for(int i=iBegin; i<iEnd; ++i)

                              _total += intVectorA[i];

                        AtomicAdd(total, _total); // shared total

                  }

            );

            // start non-blocking parallel_for

            parallel_for(

                  &intVector_qtControl,

intVectorB.begin(), intVectorB.end(),

                  [&](int iBegin,int iEnd)

                  {

                        long _total = 0;  // local subtotal

                        for(int i=iBegin; i<iEnd; ++i)

                              _total += intVectorB[i];

                        AtomicAdd(total, _total); // shared total

                  }

            );

            // *** CAUTION ***

            // *** Wait for both loops to complete

            // *** protect Lambda functions temporary stack objects

            // *** and wait for total to hold result

            intVector_qtControl.WaitTillDone();

            return total;

      }

           

The placement of intVector_qtControl is outside the scope of the function intVector_sum. There is nothing inherently wrong with doing this, in fact there is a performance advantage in doing so as this eliminates the ctor/dtor of the qtControl. However, this also eliminates the implicit WaitTillDone() when the function exits. It is the programmer’s responsibility to use an explicit WaitTillDone() function call under this circumstance. Failure to do so results in four problems:

 

1)       The value total will be returned prior to it being fully accumulated

2)       The location on the stack where total was stored will continue to be used by the parallel_for tasks and thus get modified after the return from the function

3)       The temporary stack object for the first parallel_for Lambda will get popped off the stack while it is in use by the first parallel_for task list.

4)       The temporary stack object for the second parallel_for Lambda will get popped off the stack while it is in use by the second parallel_for task list.


A more efficient method would be to use the parallel_invoke to launch two tasks (and wait) each task being each of the parallel_for statements.

 

 

std::vector<int> intVectorA;

std::vector<int> intVectorB;

// ...

long intVector_sum()

{

      long total = 0;

      // invoke two tasks

      parallel_invoke(

            // task 1

            [&]()

            {

                  // start non-blocking parallel_for

                  parallel_for(

intVectorA.begin(), intVectorA.end(),

                        [&](int iBegin,int iEnd)

                        {

                              long _total = 0;  // local subtotal

                              for(int i=iBegin; i<iEnd; ++i)

                                    _total += intVectorA[i];

                              AtomicAdd(total, _total); // shared total

                        }

                  );

            },

            // task 2

            [&]()

            {

                  // start non-blocking parallel_for

                  parallel_for(

intVectorB.begin(), intVectorB.end(),

                        [&](int iBegin,int iEnd)

                        {

                              long _total = 0;  // local subtotal

                              for(int i=iBegin; i<iEnd; ++i)

                                    _total += intVectorB[i];

                              AtomicAdd(total, _total); // shared total

                        }

                  );

            });   // end parallel_invoke

      // implicit barrier

      return total;

}


 

 

 

Sample tasks for use with parallel_for are of the format:

 

void ArraySum1D(        // perform sum of partial array

int iBegin,       // beginning index

int iEnd,         // ending index (+1)

double* inA,      // input array A

double* inB,      // input array B

double* outC)     // output Array C

{

      for(int i= iBegin; i < iEnd; ++i)

            outC[i] = inA[i] + inB[i];

}

 

// . . .

      parallel_for(

            // fn, iBegin, iEnd,      inA,    inB,    outC

ArraySum1D, 0, ArraySize, ArrayA, ArrayB, ArrayC);

      // returns when done

 

void ArraySum2D(

int nRows,

int nCols,

double* inA,

double* inB,

double* outD)

{

      qtControl   qtControl;

      for(int row=0; row<nRows; ++row)

      {

            int iBegin = row * nCols;

            int iEnd = iBegin + nCols;

            parallel_for(

                  &qtControl,

                  ArraySum1D, // use 1D function as task

                  iBegin,

                  iEnd,

                  &ArrayA[iBegin],

                  &ArrayB[iBegin],

                  &ArrayC[iBegin]);

            // *** returns while row pending (or may be complete)

      }

      // qtControl dtor implicitly waits untill all rows are done

}

 

The above is not to suggest that the correct place for parallelization of a 2D summation is from the inner loop. The purpose of the above illustration is to show that the inner parallel_for does not wait for completion when using explicit qtControl. Non-blocking parallel_for is a feature of QuickThread for you to exploit.


Simple parallel_for

 

When qtControl, qtPlacement and iChunk are omitted, the default behavior is to divide up the iteration space up into as many pieces as you have compute class threads. Then en-queue number of compute class threads-1 task requests and directly calling the slice function for the last slice.

 

Example:

 

// slice function

void DoSum(intptr_t iFrom, intptr_t iEnd, double* A, double* B, double* C)
{

for( intptr_t i = iFrom; i < iTo; ++i)
      A[i] = B[i] + C[i];

}

 

. . .
      double* A = new double[nSize];

      double* B = new double[nSize];

      double* C = new double[nSize];

. . .

parallel_for(&DoSum, 0, nSize, A, B, C);

. . .

 

When, for example, nSize = 1000, and number of threads = 4, threads will run with arguments

 

  0,  250, A, B, C            (thread a)

250,  500, A, B, C            (thread b)

500,  750, A, B, C            (thread c)

750, 1000, A, B, C            (thread d)  (a != b != c != d)

 

The above assumes all threads were available at the time of the parallel_for.

 

If thread c were unavailable due to long computation task but threads a, b and d were available, and assume thread b finishes its first slice first

 

  0,  250, A, B, C            (thread a)

250,  500, A, B, C            (thread b)

500,  750, A, B, C            (thread d)

750, 1000, A, B, C            (thread b)

 


Simple parallel_for with iChunk

 

 

// same slice function

void DoSum(intptr_t iFrom, intptr_t iEnd, double* A, double* B, double* C)
{

for( intptr_t i = iFrom; i < iTo; ++i)
      A[i] = B[i] + C[i];

}

 

. . .
      double* A = new double[nSize];

      double* B = new double[nSize];

      double* C = new double[nSize];

. . .

intptr_t iChunk = 100;

parallel_for(iChunk, &DoSum, 0, nSize, A, B, C);

 

When, for example, nSize = 1000, iChunk=100, and number of threads = 4, threads will run with arguments

 

  0,  100, A, B, C            (thread a)

100,  200, A, B, C            (thread b)

200,  300, A, B, C            (thread c)

300,  400, A, B, C            (thread d)  (a != b != c != d)

400,  500, A, B, C            (thread first of above to finish)

500,  600, A, B, C            (thread second of above to finish)

600,  700, A, B, C            (thread third of above to finish)

700,  800, A, B, C            (thread forth of above to finish)

800,  900, A, B, C            (thread next available)

900, 1000, A, B, C            (thread next available)

 

Including the iChunk parameter the parallel_for divides the range into iChunk sized pieces (last chunk may be less than specified size). If nSize / iChunk is less than the number of available threads then the lesser number of threads are scheduled. When iChunk <= nSize then the slice function is called directly (bypassing the scheduler).

 

The chunking of a parallel_for is advantageous when the amount of processing is not uniform across the iteration space or when not all of the threads scheduled for the work distribution are immediately available for performing the work, or may get preempted during work (by O/S running other applications).


Simple parallel_for with qtControl

 

Example:

 

// same slice function

void DoSum (int iFrom, int iEnd, double* A, double* B, double* C)
{

for( int i = iFrom; i < iTo; ++i)
      A[i] = B[i] + C[i];

}

 

. . .
      double* A = new double[nSize];

      double* B = new double[nSize];

      double* AB = new double[nSize];

double* C = new double[nSize];

      double* D = new double[nSize];

      double* CD = new double[nSize];

. . .

{

      qtControl   qtControl;

parallel_for(&qtControl, DoSum, 0, nSize, A, B, AB);

parallel_for(&qtControl, DoSum, 0, nSize, C, D, CD);

}

 

The above two parallel_for statements run concurrently. The blocking occurs within the dtor of the qtControl structure called at the close brace. The scoping of the qtControl structure is under the programmer’s control. Alternately, this example could have omitted the use of the local qtControl structure and used the parallel_wait(); however, the current task may have had additional sub-tasks pending and you may not wish to wait for those additional tasks to complete as well.

 

If your compiler supports the C++0x Lambda functions an improved method for the above code

 

. . .

{

      qtControl   qtControl;

      parallel_invoke(

            [&]()

            {

parallel_for(

      &qtControl,

DoSum, 0, nSize, A, B, AB);

                  },

            [&]()

            {

parallel_for(

      &qtControl,

DoSum, 0, nSize, C, D, CD);

                  });

}


Should you wish to block on a qtControl structure without letting it go out of scope you can call the member function WaitTillDone().

 

double* A = new double[nSize];

      double* B = new double[nSize];

      double* AB = new double[nSize];

double* C = new double[nSize];

      double* D = new double[nSize];

      double* CD = new double[nSize];

      double* ABCD = new double[nSize];

. . .

{

      qtControl   qtControl;

parallel_for(&qtControl, &DoSum 0, nSize, A, B, AB);

parallel_for(&qtControl, &DoSum 0, nSize, C, D, CD);

qtControl.WaitTillDone(); // or parallel_wait(&qtControl);

parallel_for(&qtControl, &DoSum, 0, nSize, AB, CD, ABCD);

// something else to do while performing parallel_for

}

 

The reason for coding this way is it eliminates an additional call to the ctor of the qtControl object.


parallel_for with placement

 

You may find it advantageous to specify placement restrictions on the parallel_for.

(see qtPlacement):

 

Example 1:

 

Assume you have a loop that has a relatively small number of iterations and you know that most of the data for the loop is “hot in L2 cache” of the current thread, and further you know that the execution time will benefit using multiple threads only when the other thread(s) sharing the current thread’s L2 cache are waiting to run.

 

parallel_for(Waiting_L2$, fnFoo, 0, nSize, A, B, C);

 

Example 2:

 

Assume you have a loop that has a relatively small number of iterations and you know that most of the data for the loop is not in cache (in RAM) and further you know that the execution time will benefit using multiple threads only when the other thread(s) are available and also share the same L2 cache with each other.

 

parallel_for(Waiting_NotInCache_L2$, fnFoo, 0, nSize, A, B, C);

 

Example 3:

 

You have a relatively long running loop but each instance of the running loop (each slice) will run optimally when all the data resides within one L3 cache

 

parallel_for(NotInCache_L3$, fnFoo, 0, nSize, A, B, C);

 

Assuming your system has two quad core processors each with L3 cache capability. The processor with the most available threads at L3 would be selected and the number of threads scheduled will be the number of HW threads sharing the same L3 (4 in this case).


parallel_for_each

 

Task

 

void fn(intptr_t iPos[, optional args])

{

      // ...

}

 

The function fn is run in parallel one element at a time using optional arguments.

 

Static function

 

void parallel_for_each(

      [qtPlacement,]

      [&qtControl,]

iEnd,       // half open termination point

      &fn,       // void fn(intptr_t i[, ...]);

iBegin      // beginning and varying position

[, a3[, a4[, a5[, a6[, a7[, a8[, a9]]]]]]]); // optional args

 

 

Member function

 

void parallel_for_each(

      [qtPlacement,]

      [&qtControl,]

iEnd,       // half open termination point

[&Obj,]     // object when member function

      &Object::fn,// void Object::fn(intptr_t i[, ...]);

iBegin      // beginning and varying position

[, a3[, a4[, a5[, a6[, a7[, a8[, a9]]]]]]]); // optional args

 

Lambda Function (C++0x)

 

void parallel_for_each(

      [qtPlacement,]

      [&qtControl,]

iBegin,     // beginning and varying position

iEnd,       // half open termination point

[&](intptr_t i)

{

      // i=item number

}

);

 

*** Note different iBegin argument placement for Lambda function ***

parallel_for_each schedules or runs a (or multiple) dispatching task(s) that happen to be waiting for work at the time of the parallel_for_each. The dispatching task(s) will run making calls across their slices of the iteration space (iBegin, iEnd) to the supplied function providing the current position within the slice of the iteration space (together with optional args).

 

As the dispatching task runs, it monitors for the availability of additional threads to become idle and are now available to join the team processing the parallel_for_each. Monitoring is at beginning of task, then periodically during dispatching. When an available thread or threads are observed, the remaining iteration space of the observing thread is divided, a portion is held for the current thread, and the remainder is en-queued for the idle thread. The new task is a recursive call of the current distribution task and it runs in parallel with the first task. Only when additional threads become available is the iteration space split.

 

Note, the various dispatching task(s) may run out of iteration space at differing times. When a busy dispatching task notices a new idle thread it will split its remaining iteration space and en-queues a new task.

 

The number of task en-queuing operations for parallel_for_each is generally larger than for parallel_for. However, when the execution time varies across the iteration space, a non-chunking parallel_for might not fully utilize all cores (i.e. some threads finish early while others finish late). A parallel_for with chunking can improve the utilization of all cores but at the expense of incurring a fixed number of additional task en-queue/de-queue. Use of parallel_for_each is thread availability demand related number of additional task en-queue/de-queue operations. As to which method is best, this will depend on your application.

 

The criteria for using parallel_for_each are:

 

a)       The work per object is variable and significant with respect for the cost of intermittent task en-queue/de-queue operations.

b)       Your requirement is to distribute objects using qtPlacement into specified cache systems such that the object sub-tasks can subsequently schedule within that cache.

c)       The availability of threads to schedule is uncertain at the issuance of the parallel_for_each and may change during the execution of the control loop.

 

Example:

 

void  DoObject(intptr_t iObj);      // Function to do per object

 

void  DoObjectsParallel()

{

      // Create 1 or more instances of DoObject task

      // split up across (for each) L2 cache on system

// (one task per L2).

//

      //                qtPlacement, iEnd,     fn,       iBegin

      parallel_for_each(OneEach_L2$, nObjects, DoObject, 0);

}

 

void DoObject(intptr_t iObj)

{

      // Obtain an object from the Object List

      // The Object contains three matricies

      // A and B are to be multiplied and stored in output

      Object& Obj = ObjectList[iObj];

 

      // Perform the matrix multiplication row at a time

      // using a thread pool of threads within my L2

      parallel_for(

L2$,              // only threads sharing our thread's L2

matmultStripe,    // function to perform striped matmul

0, Obj.Height,    // split across number of threads on L2

Obj.inputA, Obj.inputB,

Obj.output,       // results

Obj.Width, Obj.Height);

 }

 

Note, matmultStripe would be optimal for this Object when all cores using the selected L2 would be available. This is not normally the case. As such, not all stripes will finish at the same time. When the array size is relatively large and the system is working hard, then consider using the iChunk form of the parallel_for.

 

Example using iChunk follows:

 

void DoObject(intptr_t iObj)

{

      // Obtain an object from the Object List

      Object& Obj = ObjectList[iObj];

      parallel_for(

L2$,              // only our thread's L2

Obj.Height / 4,   // iChunk = 1/4 of total

&matmultStripe,   // function to perform striped matmul

0, Obj.Height,    // split into iChunk sizes

Obj.inputA, Obj.inputB,