
Threading Toolkit Programmer’s Reference Manual
V1.0.1
Copyright © 2009
QuickThread Programming, LLC
www.quickthreadprogramming.com
QuickThread is a pending trademark of
James G. Dempsey
QuickThread Programming, LLC
The information contained herein is the intellectual property of QuickThread Programming, LLC, all rights reserved.
Additional information can be obtained by email at info@quickthreadprogramming.com or at the website http://www.quickthreadprogramming.com.
Conceptual
programming technique
C++
Programming with QuickThread
Alternate
specialty initialization
Static
function (non-member function)
Choosing
between parallel_invoke or parallel_task
Static
function (non-member function)
Special
notes for Lambda functions
Simple
parallel_for with iChunk
Simple
parallel_for with qtControl
Miscelleaneous
Library Functions
QuickThreadQueueIO(qtControl,
aSub[,args])
QuickThreadQueueDoChunkTemporal
Example
2: OpenMP outer level parallelization.
Example
3: OpenMP inner level parallelization.
Example
4: OpenMP outer and inner level parallelization
Example
5 QuickThread outer and inner level parallelization
QuickThread is a C++ runtime library and programming paradigm for writing multithreaded applications in 32-bit and 64-bit environments using C++, FORTRAN and mixed language programs.
QuickThread is affinity capable supporting thread affinity, data binding affinity and NUMA support.
QuickThread is a tasking system using thread pools and provides exceptional control over task scheduling with respect to cache levels, core placement, and thread availability.
The design goal of QuickThread is to produce a minimal overhead mechanism for distributing work in a multi-socket, multi-core, multi-threaded environment.
NOTICE
The FORTRAN Application Programmers Interface is not officially supported at
this time. These interfaces are residual interfaces leftover from the earliest
development work on QuickThread. Feel free to experiment at your own risk. If
you have a need for FORTRAN as a supported product, please contact QuickThread
Programming, LLC and express your needs. Future versions of QuickThread will
address this issue.
parallel_task Schedule a single task.
parallel_invoke Invoke multiple different tasks (C++0x Lambda functions only)
parallel_distribute Schedule a task team to work on different portions of same task..
parallel_for Schedule a task team to run across an iteration space divided up evenly to team members (or chunked up to team members)..
parallel_for_each Schedule a task team across an iteration space divided upon demand by each team member number.
parallel_reduce Schedule a task team across an iteration space divided upon demand by each team member number while performing reduction operation.
parallel_list Schedule a task team to process a singly linked list of objects.
parallel_pipeline
Schedule a task team to process a
sequence of steps (pipes) contained within a vector (pipeline).

The above figure depicts an idealized system with eight threads (T0-T7), running on two processors, each processor with four cores, three level cache, two memory systems. Two core pairs within each processor sharing one of two L2 cache within the processor, all cores within each processor sharing a processor common L3 cache. And each processor with direct access to a local RAM (M0) and one hop access to RAM local to the other processor (M1 obversely). The above diagram can be expanded to include additional processor packages and memory systems as well as additional memory hop levels (M2, M3).
In the idealized system, each thread has independent data distributed amongst the various cache and memory levels and where the programming goal is to keep as many of the thread’s data (and instruction) accesses as close to its L1 as possible. When the programmer has the means to control the execution of the application in a manner complementary to this idealized system, then the application will experience maximum performance.
In practice, the generally used threading tools do not provide the programmer with the means to control the program execution towards this idealized system. That is until now.
One of the techniques employed by most of the threading tools which provides a limited measure of this control, was the switch in programming practice from:
using a dedicated software thread per task
to
using a pool of threads (typically with one software thread per hardware thread)
Then using a task scheduler within the application that schedules tasks to available threads from the thread pool. This technique exchanged a costly operating system thread context switch with a comparatively low cost task context switch within the application.
Additionally, when using the thread pool tasking technique, the programmer can use thread affinity to pin the software thread to a specific hardware thread (or set of threads). Using thread pinning, and when the operating system interrupts an application thread, or context switches to another application, or system task, then upon resumption of application thread, there is the benefit of a higher probability of some portion of the previously cached application data still being present in its cache system.
The remaining control technique for the programmer to approach the idealized system is the means to choreograph not only the task scheduling but also the task placement, interaction with other tasks, and data placement control. QuickThread offers this level of control.
QuickThread offers the programmer the means to:
With QuickThread the programmer can exert extraordinary level of control by the inclusion of a single placement directive on the parallel_for and other parallel directives.
The conceptual programming technique for QuickThread is a messaging system whereby you to throw objects and arguments at functions (C++) or subroutines (FORTRAN). These throw requests are placed into a queue (one of several queues). The queued subroutines, when run, can throw (en-queue) additional subroutines and arguments or perform work or do both.
The general queuing technique is neither strictly LIFO nor FIFO. QuickThread will en-queue the work requests in a manner that defaults to be hot-in-cache friendly (the programmer optionally can use cache directed en-queuing of work requests as well as FIFO en-queuing).
When thread affinity is not used, the application programmer can select between a compute class queue and an I/O class queue. There is a third queue called the compute class overflow queue which may be used in rare circumstances (i.e. to not block en-queue requests by other threads while the primary compute class queue is being allocated additional free nodes performed by the first thread to cause the overflow).
Affinity, when enabled, will, at programmer’s direction, pin compute class threads to one or more execution cores. Then the programmer has the choice of using affinity directed task en-queuing of compute class tasks, or non-affinity directed task en-queuing of compute class tasks. I/O class threads are not affinity pinned.
Tasks in QuickThread are standard functions (C++), and/or class member functions (C++), ) that have a void return, or subroutines in FORTRAN, and take from 0 to 9 arguments. There is no distinction between a task and function/subroutine as these functions or subroutines can be called directly as well as having task requests thrown at them. For example, the parallel_for template, when creating N tasks will, en-queue N-1 tasks and then directly call the en-queued function (thus saving the overhead of one task en-queuing operation).
From the function’s viewpoint there is no distinction between being called as a task and being called directly from within the application.
There are no special considerations when writing functions callable from the task scheduler other than for the standard multi-threaded programming requirement in making the function thread-safe.
At this time, C++ exception handling is not performed between the en-queued task and the en-queuing task. Future revisions may address this issue.
Tasks begin execution upon call (by QuickThread task manager) and terminate upon return. Tasks do not directly make request for work to do, rather they begin life upon receipt of requests for work to do (with optional list of arguments). A task is not a thread waiting in an idle loop, rather, a task is code waiting for a function call.
Using threading model, as opposed to tasking model, the application would start many threads with each initially entering a wait state (e.g. WaitForSingleEvent). Standard threading models tend to incur a higher degree of interaction with the operating system than does a task pool system such as QuickThread.
The C++ programmer includes the QuickThread header file plus desired template headers.
#include <QuickThread.h> // Required header
#include <parallel_task.h> // Optional template
#include <parallel_invoke.h> // Optional template
#include <parallel_distribute.h> // Optional template
#include <parallel_for.h> // Optional template
#include <parallel_list.h> // Optional template
#include <parallel_reduce.h> // Optional template
#include <parallel_pipeline.h> // Optional template
using namespace
qt; //
Optional namespace
And then link in the appropriate QuickThread.lib file (x32 or x64 for target O/S)
For the C++ programmer, QuickTread uses the namespace “qt” although some of the FORTRAN entry points are visible as QUICKTHREADsomenamehere. The programmer has the choice of using the verbose fully qualified names or shorthand templates. If the programmer prefers, they are free to rename the templates or add there own templates.
NOTICE
These template files are typically a shell file that declares a macro then includes the current version of the template. Applications can include specific or multiple versions of templates. This permits you to retrograde versions when necessary. Or permits you to accommodate template naming conflicts when combining QuickThread with other threading toolkits such as TBB (Threading Building Blocks).
Consider parallel_task.h
// parallel_task.h
#pragma once
// pick the version
#include "parallel_task_v1.h"
// define as default
#define parallel_task
parallel_task_v1
There are several methods to initialize your application for using QuickThread.
// YourApp.cpp
#include <QuickThread.h>
using namespace
qt;
int main(void) // or with arguments
{
// Worker Threads = # hardware threads + 1
I/O thread
qtInit qtInit(-1); // -1 == use defaults
// ... your program here
return YourReturnCode;
}
// YourApp.cpp
#include <QuickThread.h>
using namespace
qt;
int main(void)
{
// Worker Threads = 3 threads + 1 (or
more) I/O threads
qtInit qtInit(3);
// ... your program here
return YourReturnCode;
}
// YourApp.cpp
#include <QuickThread.h>
using namespace
qt;
int main(void)
{
// Worker Threads = 3 threads + 2 I/O
threads
qtInit qtInit(3,2);
// ... your program here
return YourReturnCode;
}
// YourApp.cpp
#include <QuickThread.h>
using namespace
qt;
int main(void)
{
qtInit qtInit; // default ctor
// default ctor initialized qtInit to
default settings
// but does not start the QuickThread
thread pool
// Optional configuration of the
qtInit object here
qtInit.nWorkerThreads
= 4;
qtInit.nIoThreads
= 1;
// the following not available in ctor
qtInit.DefaultSpinWait
= 512;
qtInit.nAffinityGrouping
= 2;
// ...
// Start QuickThread
if(qtInit.StartQT())
return YourThreadingErrorReport();
// ... your program here
// End QuickThread
if(qtInit.EndQT())
return YourThreadingErrorReport();
return YourReturnCode;
}
// YourApp.cpp
#include <QuickThread.h>
// The
Main-level task is the only task that returns a completion code
int MainTask(void* context)
{
// do something with context if desired
DoWork(); // or body of your Main Task
return YourReturnCode;
}
// The C++
program entry point
int _tmain(int argc, _TCHAR* argv[])
{
GetCommandLineArguments(argc,
argv);
// Declart a qtInit object
qt::qtInit qtInit;
// Modify qtInit as desired
qtInit.nWorkerThreads
= 4;
qtInit.nIoThreads
= 1;
// the following not available in ctor
qtInit.DefaultSpinWait
= 512;
qtInit.nAffinityGrouping
= 2;
// ...
// Launch Main Task
// with optional argument context
// return error code
return qtInit.QueueMain(MainTask,
NULL);
}
The choice of technique for
initialization is left for the programmer to decide.
Tasks on QuickThread are any
function returning void
void
foo(a1T a1, a2T a2, ...)
This may be a static function,
or class member function or Lambda functions (for compilers supporting C++0x
features). Functions may have from 0 to 9 arguments.
Functions may be overloaded
(same name, different argument lists). In cases where the compiler is unable to
disambiguate the function signature a cast may be required on the function name
argument.
Internally, tasks on
QuickThread, are evoked by way of a qtControl structure (described later). This qtControl structure is either implicit or explicit at
programmer preference.
Most of your task evocations
will likely be by way of the supplied templates. These templates are provided
in source form and thus have the provision for user extensibility. The major
templates names are:
parallel_task
parallel_distribute
parallel_for
parallel_for_each
parallel_list
parallel_reduce
parallel_pipeline
parallel_wait
The templates will accomidate
either static functions, member functions or Lambda functions (for compilers
supporting C++0x features).
Most templates accept a
variable number of arguments, some optional, some required, and follow this
generalized organization:
parallel_{template_suffix} (
{enqueuing instructions and arguments ouside control of task}
{address of object when enquing member function as task}
{address of task}
{arguments to task}
);
Example:
void Foo(int n, char* msg);
...
Foo(1, “hello world”);
becomes
parallel_task(Foo, 1, “hello world”);
and
class Obj_T
{
...
void Fee( int a, int b, int c);
};
...
Obj->Fee(1,2,3);
becomes
parallel_task(Obj, &Obj_T::Fee,
1, 2, 3);
Generally the left to right
sequence of the tokens is the same and therefore conversion effort in part can
be done with a search and replace.
parallel_task(
[qtPlacement,]
[&qtControl,]
&fn,
[, a1[, a2[, a3[, a4[, a5[,
a6[, a7[, a8[, a9]]]]]]]]]
);
qtPlacement an optional placement and/or selection directive (see qtPlacement)
&qtControl an optional address of QuickThread Control structure (see qtControl)
&fn a required address of Task function name
a1 through a9 are optional function arguments.
parallel_task(
[qtPlacement,]
[&qtControl,]
&Obj,
&ClassName::fn,
[, a1[, a2[, a3[, a4[, a5[,
a6[, a7[, a8[, a9]]]]]]]]]]
);
qtPlacement an optional placement and/or selection directive (see qtPlacement)
&qtControl an optional address of QuickThread Control structure (see qtControl)
&Obj a required address of object
&ClassName::fn a required address of function name
a1 through a9 optional function arguments.
The distinction between
member function and static function is the presense or absence of the address
of the object of the class of the member function.
parallel_task(
[qtPlacement,]
[&qtControl,]
[&]()
{
// ... function body
}
);
qtPlacement an optional placement and/or selection directive (see qtPlacement)
&qtControl an optional address of QuickThread Control structure (see qtControl)
[&]( ){ ... } required Lambda function (may use = as well as & and &x,=y, …)
The en-queued tasks are
standard functions and/or member functions that can be called directly or
enqueue using parallel_task. Lambda functions (C++0x) have no function name and
take their arguments by way of a hidden object containing reference (&) or
value (=) of current scoped variables.
NOTICE
The following parallel_task examples serve as a tutorial relating to use
of qtPlacement and qtControl calling variations.
These variations are applicable to other parallel_... templates.
The
other template descriptions will not repete this information.
Although this section is verbose, the qtPlacement and qtControl are
important features of all templates.
Read this section to learn the nuances
of task enqueuing that will apply to the other parallel_... templates.
Examples:
// static (free)
functions
SimpleStaticFunction(1, msg,
101);
parallel_task(&SimpleStaticFunction,
1, msg, 101);
Note, most C++ compilers will
permit you to drop the & on the static function name. Scoped qualified
member function names require the presense of the &.
parallel_task(SimpleStaticFunction,
1, msg, 101);
When parallel_task is issued witout the address of a qtControl structure
the address of the Task’s current level default control structure is used. The
enqueued task will run asynchronous from the code that issued the parallel_task.
Using explicit qtControl
parallel_task(&qtControl,
SimpleStaticFunction, 1, msg, 101);
Using explicit cache level
directive (to Task’s current level default control structure)
parallel_task(L2$,
SimpleStaticFunction, 1, msg, 101);
Using explicit cache level
directive to explicit qtControl
parallel_task(L2$,
&qtControl, SimpleStaticFunction, 1, msg, 101);
// and static
member functions (no required &Obj)
CBaseClass::StaticMemberFunction(2,
msg, 102);
parallel_task(&CBaseClass::StaticMemberFunction,
2, msg, 102);
// Simple member
function (requires address of object)
a.SimpleMemberFunction( 0,
msg, 100);
parallel_task(&a,
&CBaseClass::SimpleMemberFunction, 0, msg, 100);
p->SimpleMemberFunction(
0, msg, 100);
parallel_task(p,
&CBaseClass::SimpleMemberFunction, 0, msg, 100);
// virtual
member functions
b.SimpleVirtualFunction(4,
msg, 104);
parallel_task(&b,
&CBaseClass::SimpleVirtualFunction, 4, msg, 104);
parallel_task(&c,
&CDerivedClass::TrickyVirtualFunction, 7, msg, 107);
// parallel_task
Lambda function
Total = 0;
parallel_task(
[&]()
{
for(int i=0; i<nVectors; ++i)
Total += intVector[i];
}
);
// *** Caution
Total not complete yet
// returns from
parallel_task while task in progress
Special
Note for Overloaded Functions
At
times you may be required to cast a function when functions are overloaded and
ambiguous.
void Foo(int i);
void Foo(long i);
void Foo(double d);
…
char arg; // When arg is char,
which Foo do you mean?
…
parallel_task( (void(*)(long))Foo, arg); // use
void Foo(long i);
NOTICE
parallel_task is non-blocking. Meaning execution generaly continues
past the parallel_task statement while the en-queued task runs (or is scheduled
to run).
When parallel_task is issued without the qtControl argument, a default thread context qtControl is used. Each thread maintains a task level default qtControl structure. You can use parallel_wait(); (with a null qtControl) to insert a barrier for all sub-tasks spawned by way of
the thread’s task-level default thread context qtControl. There is an implicit
parallel_wait(); at end of task for any pending sub-tasks en-queued
using a task level default qtControl
structure.
void Task1()
{
parallel_task(Task2); void
Task2()
parallel_task(Task3); { void Task3()
DoOtherWork();
parallel_task(Task4); {
parallel_wait(); // for Task2 and Task3 parallel_task(Task5);
parallel_task(Task6);
UseResultsOfTask1andTask2(); } parallel_task(Task7);
} }
Task1 has a default qtControl that is different from the
other tasks default qtControl
structures. In the above, Task2 and Task3 are enqueue via Task1 default qtControl. Task2 and Task3 (may) run in
parallel with Task1, including DoOtherWork(), up until Task1’s
parallel_wait();. And at which point, Task1 blocks until Task2 and Task3
complete.
Task2 enqueues two tasks
(Task 4 and Task5) using Task2’s default qtControl
structure.
Task 3 enqueues Task6 and
Task7.
Task4 and Task5 may run
concurrent with Task2, the remaining part of Task1 (upto parallel_wait in
Task1), Task3, Task6, Task7 and any additional tasks enqueued by those tasks.
Upon exit of Task3 it will
wait for completion of tasks enqueued by way of Task3’s default qtControl structure. In this case,
Task3 waits for completion of Task6 and Task7.
Upon exit of Task2 it will
wait for completion of tasks enqueued by way of Task2’s default qtControl structure. In this case, Task2
waits for completion of Task4 and Task5.
The parallel_wait() in Task1
was inserted to explicitly wait for results from Task2 and Task3 then can make
use of results generated by Task1 and Task2.
The parallel_task can accept an optional qtControl argument to be used in place of the task default qtControl structure.
The optonal qtControl argument provides additional
control over the task enqueuing. A primary function is to provide object and/or
task level placement and thread synchronization. Your task can use multiple qtControl objects for coordinating
multiple task lists. Tasks enqueued via explicit qtControl can live longer than
the parent task that performs the enqueue.
qtControl yourControl;
...
void SomeTask()
{
...
parallel_task(&yourControl, Foo);
...
}
In the above, the exit of
SomeTask() does not implicitly wait for task Foo to complete.
To wait for Foo completion
(anywhere in the program) use:
yourControl.WaitTillDone();
The qtPlacement argument specifies
restrictions on which spawned threads are to be run. When used in conjunction
with the qtControl argument this provides
the programmer considerable control with respect to performance considerations.
Examples:
a) Queue the task to the thread that co-resides with the
issueing thread’s L2 cache.
b) Locate an L3 cache which has the most available
threads and condition the qtControl
structure to enqueue/deque to/from those groups of threads.
c) Condition the qtControl
structure to distribute current en-queued task and subsequent en-queued tasks
to one thread per each L2 cache. Thus providing for each en-queued task to
enqueue a sub-task to threads sharing its L2 cache using (a) above
Using thread private
task-level default qtControl
// primary tasks
first, then secondary tasks
parallel_task( Task_A );
parallel_task( Task_B );
parallel_task( Task_C );
parallel_wait();
parallel_task( Task_A2 );
parallel_task( Task_B2 );
parallel_task( Task_C2 );
parallel_wait();
xxx
|<->|----Task_A------| |(prior tasks)|<->|----Task_A2-----|xxx|
|<->|----Task_B-----------| |<->|----Task_B2---|
|<->|----Task_C----------------| |<->|---Task_C2--|
|<------------------------
latency time to run xxx -------------->|
|<------------------------
total elapse time ------------------------>|
Notes, “|<->|“ depicts possible skew in start of task.”|(prior tasks)|” depicts
potential completion time for prior task enqueued using default qtControl, and “|xxx|” depicts code that follows second parallel_wait.
Using specific qtControl object.
{ // primary tasks
first, then secondary tasks
qtControl qtControl;
parallel_task(&qtControl, Task_A );
parallel_task(&qtControl, Task_B );
parallel_task(&qtControl, Task_C );
parallel_wait(&qtControl); //
or qtControl.WaitTillDone();
parallel_task(&qtControl, Task_A2 );
parallel_task(&qtControl, Task_B2 );
parallel_task(&qtControl, Task_C2 );
}
xxx
|<->|--------Task_A------| |<->|----Task_A2------|xxx|
|<->|--------Task_B-----------| |<->|----Task_B2---|
|<->|--------Task_C----------------|<->|---Task_C2--|
|<------------------
latency time to run xxx ----------->|
|<------------------
total elapse time --------------------->|
|<------------------
prior total elapse time ------------------------>|
The first example has
no consideration for additional pending tasks en-queued by the current task and
where there is no requirement to suspend execution.
In the second example
additional tasks en-queued by the current task may be pending. The programmer
uses the local qtControl to en-queue
and synchronize a local list of tasks. When the qtControl leaves scope, the dtor performs an implicit wait. The
advantage of the second method is Task_A2, Task_B2, Task_C2 do not have to wait for pending task level
sub-tasks en-queued prior to the entry of the scope of the local qtControl structure.
Through use of
multiple qtControl objects you can
handle multiple task queues concurrently and improve your synchronization
appropriately.
{ // primary tasks
first, partial overlap with secondary tasks
qtControl qtControlA, qtControlB, qtControlC;
parallel_task(&qtControlA, Task_A );
parallel_task(&qtControlB, Task_B );
parallel_task(&qtControlC, Task_C );
parallel_wait(&qtControlA); //
or qtControlA.WaitTillDone();
parallel_task(&qtControl, Task_A2 );
parallel_wait(&qtControlB); //
or qtControlB.WaitTillDone();
parallel_task(&qtControl, Task_B2 );
parallel_wait(&qtControlC); //
or qtControlC.WaitTillDone();
parallel_task(&qtControl, Task_C2 );
xxx
}
yyy
|<->|--------Task_A------|<->|----Task_A2------|
|<->|--------Task_B-----------|<->|----Task_B2---|
|<->|--------Task_C----------------|xxx|<->|----Task_C2--|yyy|
|<------------------
total elapse time --------------------->|
|<---
latency time to run xxx ---->|
|<------------
prior latency time to run xxx ----------->|
|<------------
prior total elapse time --------------------->|
|<------------
prior prior total elapse time ------------------------>|
Using completion nodes may
eliminate skew
{
// Using
completion node format
// primary tasks
first, complete overlap with secondary tasks
// inside of scope
qtControl qtControlA, qtControlB, qtControlC;
parallel_task(&qtControlA, Task_A);
parallel_task(OnDone$, &qtControlA, Task_A2);
parallel_task(&qtControlB, Task_B );
parallel_task(OnDone$, &qtControlB, Task_A2);
parallel_task(&qtControlC, Task_C );
parallel_task(OnDone$, &qtControlA, Task_A2);
xxx
}
yyy
|xxx|
|<->|--------Task_A------|----Task_A2------|
|<->|--------Task_B-----------|----Task_B2---|
|<->|--------Task_C----------------|---Task_C2--|yyy|
// completion nodes with controls outside of scope
qtControl qtControlA, qtControlB, qtControlC;
{ // primary tasks
first, complete overlap with secondary tasks
parallel_task(&qtControlA, Task_A);
parallel_task(OnDone$, &qtControlA, Task_A2);
parallel_task(&qtControlB, Task_B );
parallel_task(OnDone$, &qtControlB, Task_A2);
parallel_task(&qtControlC, Task_C );
parallel_task(OnDone$, &qtControlA, Task_A2);
xxx
}
yyy
|xxx|yyy|
|<->|--------Task_A------|----Task_A2------|
|<->|--------Task_B-----------|----Task_B2---|
|<->|--------Task_C----------------|---Task_C2--|
In the last example,
when the qtControl objects are instantiate outside the scope of the task (as
static object or passed in as an argument to the task). Then the en-queued task
can return with pending sub-tasks.
By varying your use of
the qtControl you can significantly alter latencies as well as core
utilization. Control of latencies means control of “Hot in cache”.
When parallel_task is issued with qtPlacement attribute containing OnDone$ the en-queued task is placed
into a FIFO list associated with the qtControl
object. When all pending tasks complete that are/were en-queued via the
specified control object, then the tasks en-queued using the OnDone$ attribute are processed in FIFO
order. The OnDone$ qtPlacement
attribute can be used to serialize tasks whenever that is in your design requirement.
Proper use of the qtControl(s) and the parallel_task with placement attribute
of OnDone$ can improve concurrency.
N.B. The programmer is free to use qtPlacement
or other techniques to schedule the completion task to a thread other than the
one it waits on.
As you can see, you
have considerable control over the sequencing of the tasks. And consequently
must take care in your programming to attain the desired results.
Example code outlines
for qtControl existing
outside the scope of the task:
void foo(qtControl* qtControl)
{
//... other code here
parallel_task(qtControl, Task_A);
parallel_task(qtControl, Task_B);
parallel_task(qtControl, Task_C);
//... other code here
// exit Task foo with
// Task_A, Task_B, Task_C pending,
running, or complete
}
void SomeFunction(someArgs)
{
qtControl qtControl; // SomeFunction scoped control
//... other code here
// use local control as argument to
foo
// Task_A, Task_B, Task_C enqueued
via above qtControl
// use default task level control for
control of Task foo
parallel_task(foo, &qtControl);
// return while foo
pending/running/complete
// Task_A, Task_B, Task_C
pending/running/complete
// ... run other code here
// wait for all pending default task
level control Tasks
// including Task foo to complete
// but do not wait for pending tasks
on qtControl
// i.e. Task_A, Task_B, Task_C may be
pending/running/complete
parallel_wait();
// ... other code here
// wait for Task_A, Task_B, Task_C to
complete
// (as well as additional tasks
enqueued via this qtControl)
parallel_wait(&qtControl); // or
qtControl.WaitTillDone();
// ... other code here
} //
void SomeFunction(someArgs)
void SomeOtherFunction(someArgs)
{
qtControl qtControl;
//... other code here
// use local control as argument to
foo
// Task_A, Task_B, Task_C enqueued
via above qtControl
// use local control for control of
Task foo
parallel_task(&qtControl, foo, &qtControl);
// return while foo
pending/running/complete
// Task_A, Task_B, Task_C
pending/running/complete
// ... other code here
// wait for foo and Task_A, Task_B,
Task_C to complete
// (as well as additional tasks
enqueued via this qtControl)
parallel_wait(&qtControl);
// ... other code here
} //
void SomeOtherFunction(someArgs)
void
OtherSomeOtherFunction(someArgs)
{
// code elsewhere
qtControl qtControl;
//... other code here
// use local control as argument to
foo
// Task_A, Task_B, Task_C enqueued
via above qtControl
// use local control for control of
Task foo
parallel_task(&qtControl, &foo, &qtControl);
// return while foo pending/running/complete
// Task_A, Task_B, Task_C pending/running/complete
// ... other code here
// implicit
parallel_wait(&qtControl);
// on dtor of local qtControl
// waits for foo and Task_A, Task_B,
Task_C to complete
// (as well as additional tasks
enqueued via this qtControl)
} //
void OtherSomeOtherFunction(someArgs)
The parallel_invoke template
is used to invoke two or more Lambda function tasks (C++0x) and/or void
function tasks.
The principal differences
between parallel_task and parallel_invoke are: parallel_invoke is less
overhead, and parallel_invoke always blocks. In places of your code where you
enqueue multiple tasks then immediately wait then consider using
parallel_invoke. e.g. n-way partitioning.
parallel_invoke(
[qtPlacement,]
[*](){ ...Lambda0... }, // or (void)(fn*)(void)
[*](){ ...Lambda1... } // or (void)(fn*)(void)
...); // additional Lambda functions here
Where
qtPlacement Optional
qtPlacement argument
[*] Standard Lambda context ([&],
[=], [], [&arg1,=arg2,…)
() no calling parameters
{ ...Lambda0... } Body of Lambda function 0
{ ...Lambda1... } Body of Lambda function 1
... Additional Lambda
functions follow
A rework of one of the
parallel_task samples yields
{
double a,
b, c;
parallel_invoke(
[&](){ Task_A(a); },
[&](){ Task_B(b); },
[&](){ Task_C(c); });
//
implicit barrier
parallel_invoke(
[&](){ Task_A2(a); },
[&](){ Task_B2(b); },
[&](){ Task_C2(c); });
//
implicit barrier
}
xxx
|<->|--------Task_A------| |<->|----Task_A2------|xxx|
|<->|--------Task_B-----------| |<->|----Task_B2---|
|<->|--------Task_C----------------|<->|---Task_C2--|
|<------------------
latency time to run xxx ----------->|
|<------------------
total elapse time --------------------->|
|<------------------
prior total elapse time ------------------------>|
Real Example:
// call with corners
of rectangle for interacton
void rect_interact(int i0, int i1, int j0, int j1)
{
// if there's
room,
// further subdivide the rectangle into
four smaller ones
int di = i1
- i0; int dj = j1 - j0;
if (di >
grainsize && dj > grainsize)
{
int im = i0 + di/2;
int jm = j0 + dj/2;
// recursively call ourself to...
// invoke two tasks for two non-conflicting rectangles
parallel_invoke(
[&]() {rect_interact(i0, im, j0, jm);},
[&]() {rect_interact(im, i1, jm, j1);});
// return when done
// Then, recursively call ourself to...
// invoke two tasks for other two
non-conflicting rectangles
parallel_invoke(
[&]() {rect_interact(i0, im, jm, j1);},
[&]() {rect_interact(im, i1, j0, jm);});
// return when done
}
else
{
// otherwise
all we have left is a strip
// that can be handled locally
for (int
i = i0; i < i1; ++i)
for (int j = j0; j
< j1; ++j)
addAcc(i, j);
}
}
Second example:
void
body_interactQT(int i, int j, int level)
{
// split the interaction triangle into upper and lower
triangles
// (which can be executed in
parallel) and the adjacent rectangle
// (which will be further
split)
int d = j - i;
if (d > 1)
{
int k = d/2 + i;
if(level == 0)
{
// using qtPlacement of OneEach_L2$
has the effect of
// scheduling current thread
// plus one of a group of L2 caches
that is not in my L2
// first choice will be one of the
other L2 cache threads
// that is waiting. Second
choice is closest other L2
// Replace OneEach_L2$
with OneEach_L3$ or OneEach_M0$
parallel_invoke(
OneEach_L2$,
[&](){body_interactQT(i,k,level+1);},
[&](){body_interactQT(k,j,level+1);});
}
else
{
// for all other levels (1, 2, ...)
// schedule to closest cache level
parallel_invoke(
[&](){body_interactQT(i,k,level+1);},
[&](){body_interactQT(k,j,level+1);});
}
rect_interactQT(i, k, k, j);
}
// if d is 1 or 0 then we
can skip it.
}
Note, if your system has
multiple sockets and multiple L3 caches per socket then expand the code above
such that on level == 0 you use OneEach_M0$, and when level == 1 use
OneEach_L3$, and all other levels make no qualifications as to level.
Or in place of a series of
fixed dispatches your initialization code could create a table of the levels of
cache that contain different sets of threads. (following page)
// in static data
qtPlacement YourTable[4]; // NULL or some qtPlacement
value
// On a 2-socket system
// with each socket
having Siamese twin dies
// each die with L3
cache
// two cores sharing L2
// with HT
// try:
// YourTable[0] = OneEach_M0$ (two-way split –
sockets)
// YourTable[1] = OneEach_L3$ (two-way split –
each L3)
// YourTable[2] = OneEach_L2$ (two-way split –
each L2)
// YourTable[3] = OneEach_L1$ (two-way split –
each L1)
...
if(level < 4 &&
YourTable[level])
{
// OneEach_L1$, OneEach_L2$,
OneEach_L3$, or OneEach_M0$
parallel_invoke(
YourTable[level], // qtPlacement
[&](){body_interactQT(i,k,level+1);},
[&](){body_interactQT(k,j,level+1);});
}
else
{
// for all other levels (or no splits)
// schedule to closest cache level
parallel_invoke(
[&](){body_interactQT(i,k,level+1);},
[&](){body_interactQT(k,j,level+1);});
}
The choice between selecting
to use parallel_invoke or parallel_task is non-trivial as each has a different
set of merrits.
parallel_invoke
Has
a very low overhead in enqueuing a task list (of 2 or more tasks).
Has
a very low overhead in dequeuing tasks from task list
Has
implicit barrier (blocking)
Has
limited control over use of optional qtPlacement
(you
can only suggest which thread affinities to wakeup)
parallel_task
Has
higher overhead in enqueuing one task
Has
higher overhead in dequeuing task
Has
no barrier (throw and go)
Has
full control over use of optional qtPlacement
(suggest/require
affinities, FIFO, quazi-LIFO, completion, etc…)
The principal selection
chriteria are:
Do
you need to omit the barrier
Do
you need full contol over use of optional qtPlacement
The parallel_distribute template is used to
distribute tasks by way of a selection criterion.
parallel_distribute(
qtPlacement,
[&qtControl,]
&fn
[, a3[, a4[, a5[, a6[, a7[,
a8[, a9]]]]]]]
);
qtPlacement required placement directive
&qtControl address of optional QuickThread Control structure
&fn address of required function
or member function name
a3 through a9 optional function arguments.
Note, arguments a1 and
a2 are missing as they are implicit. Tasks (argument fn) are of the following format:
void foo(intptr_t teamMemberNumber, intptr_t membersInTeam[, args]);
parallel_distribute(
qtPlacement,
[&qtControl,]
[&](intptr_t teamMemberNumber, intptr_t membersInTeam)
{
// ...
}
);
qtPlacement required placement directive
&qtControl address of optional QuickThread Control structure
&fn address of required function or member function name
[&](...){...} Lambda function taking two args
Note, for Lambda
function, the same Lambda function is called with differing arguments for each
thread.
For each parallel_distribute task:
The teamMemberNumber is 0-based and in the
range of 0:membersInTeam-1. When the
thread issuing the parallel_distribute
is a member of the team its teamMemberNumber
is 0.
The teamMember number is NOT a thread number.
The parallel_distribute will use the
required qtPlacement as a selection
criteria for the members of a thread pool. The size of the thread pool could
range from 0 to all threads of class. When qtControl
is missing from the argument list, one will be provided by the template and the
effect of which is the parallel_distribute
blocks until all threads complete. When qtControl
is supplied, the parallel_distribute
does not block. Consider:
parallel_distribute( OneEach_L1$, foo);
On a system with
HyperThreading, the above would select a team of threads consisting of one of
the HT threads from each pair (or sub-group) of threads that share an L1 cache
with all L1 caches scheduled. On a 4 core HT system with 8 hardware threads (2
threads per core), a team of 4 threads will be selected, one thread from each
core.
Assume your needs are
to divide a process up across cores, and then within each core:
// one task per core
parallel_distribute( OneEach_L1$, OnePerCore);
. . .
void OnePerCore(size_t teamMemberNumber, size_t
membersInTeam)
{
for( size_t row = teamMemberNumber;
row <
numberOfRows;
row += membersInTeam)
{
// only
with my HT companion thread(s)
parallel_for(
L1$, DoColumns, 0, numberOfColumns, row);
}
}
The above using Lambda
function
// one task per core
parallel_distribute(
OneEach_L1$,
[&](size_t teamMemberNumber, size_t membersInTeam)
{
for( size_t row = teamMemberNumber;
row <
numberOfRows;
row += membersInTeam)
{
// only
with my HT companion thread(s)
parallel_for(L1$,DoColumns,0,numberOfColumns,
row);
}
}
);
Notes: OneEach_L1$ was used for the first (outer) selection
criteria, and L1$ was used for the
second (inner) criteria. L1$ means only threads sharing L1 of thread issuing
statement. You can distribute across each/any cache level (L1, L2, L3) as well
as NUMA node distances (M0, M1, M2, M3). Additionally you can distribute
depending on availability of threads (Waiting_L1$)
parallel_for( Waiting_L1$, DoColumns, 0, numberOfColumns,
row);
In the above scenario
(OnePerCore) on 4 core HT system the above statement will,
depending on availability of other thread sharing L1, will produce either one
en-queue for a task to perform half the loop plus a direct call to perform the
other half of the loop, or perform a direct call to perform the complete loop.
void parallel_for(
[qtPlacement,]
[&qtControl,]
[iChunk,]
&fn,
iBegin,
iEnd
[, a3[, a4[, a5[, a6[, a7[,
a8[, a9]]]]]]]
);
void parallel_for(
[qtPlacement,]
[&qtControl,]
[iChunk,]
&Obj, // of class or class derived from containing fn
&ClassName::fn,
iBegin,
iEnd
[, a3[, a4[, a5[, a6[, a7[,
a8[, a9]]]]]]]
);
void parallel_for(
[qtPlacement,]
[&qtControl,]
[iChunk,]
iBegin,
iEnd,
[&](iBeginT iBegin,
iEndT iEnd)
{
// preamble if required
for(iBeginT i=iBegin; i<iEnd; ++i)
{
//
body of your loop
}
// postamble if required
} // closure of Lambda function
);
qtPlacement optional placement directive (used with thread affinity)
&qtControl optional QuickThread Control structure
iChunk optional chunking value
&Obj Address of class object when used for member function
&fn address
of required function or member function name
iBegin,iEnd are required iteration space fields (half-open interval)
a3 through a9 optional function arguments.
[&] or [&var1, =var2, etc …]
iBeginT, iEndT typedef of iterator such as int, long
(generally iBeginT same as iEndT)
iBegin and iEnd define the half-open interval, expressed in documentation as [iBegin,iEnd). Where iBegin is the first of the interval and iEnd is one after the last of the interval. The typical programming practice for C++ is:
for(int
i = iBegin; i < iEnd; ++i)
fn declares a slice function with two required numeric arguments (integer generally intptr_t), and up to 7 optional arguments (a3:a9) which can be scalars, objects, references or pointers. The specified function is run as one or more tasks (with potentially the first or last task run by way of an inline function call).
When issued without a qtControl object, the template uses a template-level default qtControl within the parallel_for function. This type of parallel_for is blocking (all threads complete before return).
When issued with a qtControl object, the parallel_for is non-blocking. Meaning it may return from the parallel_for while execution continues on the parallel_for. Use parallel_wait([qtControl]); or qtControl.WaitTillDone(); for synchronization.
Note 1: Lambda functions that modify objects (values) in the scope of the parallel_for must do so in a thread safe manner:
std::vector<int>
intVector;
// ...
long total = 0;
parallel_for(
intVector.begin(), intVector.end(),
[&total,&intVector](int iBegin,int iEnd)
{
long _total =
0; // local
subtotal
for(int i=iBegin; i<iEnd; ++i)
_total += intVector[i];
AtomicAdd(total, _total); // shared total
}
);
Note 2: Lambda functions create a hidden temporary object on the stack of the caller.
When you program your parallel_for without a qtControl the parallel_for is blocking therefore the temporary stack object is preserved for the duration of the parallel_for.
When you program your parallel_for with a qtControl the parallel_for is non-blocking therefore the preservation of the temporary stack object for the duration of the parallel_for is the responsibility of the programmer. parallel_for with qtControl has to be used with care such that WaitTillDone (either implicitly or explicitly) is called prior to exiting the scope of the parallel_for.
std::vector<int> intVectorA;
std::vector<int>
intVectorB;
qtControl intVector_qtControl; // (outside of scope of foo)
long intVector_sum()
{
long total = 0;
// start non-blocking parallel_for
parallel_for(
&intVector_qtControl,
intVectorA.begin(), intVectorA.end(),
[&](int iBegin,int iEnd)
{
long _total =
0; // local
subtotal
for(int i=iBegin; i<iEnd; ++i)
_total += intVectorA[i];
AtomicAdd(total, _total); // shared total
}
);
// start non-blocking parallel_for
parallel_for(
&intVector_qtControl,
intVectorB.begin(), intVectorB.end(),
[&](int iBegin,int iEnd)
{
long _total =
0; // local
subtotal
for(int i=iBegin; i<iEnd; ++i)
_total += intVectorB[i];
AtomicAdd(total, _total); // shared total
}
);
// ***
CAUTION ***
// *** Wait
for both loops to complete
// ***
protect Lambda functions temporary stack objects
// *** and
wait for total to hold result
intVector_qtControl.WaitTillDone();
return
total;
}
The placement of intVector_qtControl is outside the scope of the function intVector_sum. There is nothing inherently wrong with doing this, in fact there is a performance advantage in doing so as this eliminates the ctor/dtor of the qtControl. However, this also eliminates the implicit WaitTillDone() when the function exits. It is the programmer’s responsibility to use an explicit WaitTillDone() function call under this circumstance. Failure to do so results in four problems:
1) The value total will be returned prior to it being fully accumulated
2) The location on the stack where total was stored will continue to be used by the parallel_for tasks and thus get modified after the return from the function
3) The temporary stack object for the first parallel_for Lambda will get popped off the stack while it is in use by the first parallel_for task list.
4) The temporary stack object for the second parallel_for Lambda will get popped off the stack while it is in use by the second parallel_for task list.
A more efficient method would be to use the parallel_invoke to launch two tasks (and wait) each task being each of the parallel_for statements.
std::vector<int> intVectorA;
std::vector<int> intVectorB;
// ...
long intVector_sum()
{
long total = 0;
// invoke two tasks
parallel_invoke(
// task 1
[&]()
{
//
start non-blocking parallel_for
parallel_for(
intVectorA.begin(), intVectorA.end(),
[&](int
iBegin,int iEnd)
{
long
_total = 0; //
local subtotal
for(int i=iBegin; i<iEnd; ++i)
_total += intVectorA[i];
AtomicAdd(total, _total); // shared total
}
);
},
// task 2
[&]()
{
//
start non-blocking parallel_for
parallel_for(
intVectorB.begin(), intVectorB.end(),
[&](int
iBegin,int iEnd)
{
long
_total = 0; //
local subtotal
for(int i=iBegin; i<iEnd; ++i)
_total += intVectorB[i];
AtomicAdd(total, _total); // shared total
}
);
}); // end parallel_invoke
// implicit
barrier
return
total;
}
Sample tasks for use with parallel_for are of the format:
void ArraySum1D( // perform sum
of partial array
int iBegin, // beginning index
int iEnd, // ending index (+1)
double* inA, // input array A
double* inB, // input array B
double* outC) // output Array C
{
for(int
i= iBegin; i < iEnd; ++i)
outC[i] = inA[i] + inB[i];
}
// . . .
parallel_for(
// fn, iBegin, iEnd, inA, inB,
outC
ArraySum1D, 0, ArraySize, ArrayA, ArrayB, ArrayC);
// returns when done
void ArraySum2D(
int nRows,
int nCols,
double* inA,
double* inB,
double* outD)
{
qtControl qtControl;
for(int
row=0; row<nRows; ++row)
{
int iBegin = row * nCols;
int iEnd = iBegin +
nCols;
parallel_for(
&qtControl,
ArraySum1D, // use 1D function as task
iBegin,
iEnd,
&ArrayA[iBegin],
&ArrayB[iBegin],
&ArrayC[iBegin]);
// *** returns while row pending
(or may be complete)
}
// qtControl dtor implicitly waits untill all rows are done
}
The above is not to suggest that the correct place for parallelization of a 2D summation is from the inner loop. The purpose of the above illustration is to show that the inner parallel_for does not wait for completion when using explicit qtControl. Non-blocking parallel_for is a feature of QuickThread for you to exploit.
When qtControl, qtPlacement and iChunk are omitted, the default behavior is to divide up the iteration space up into as many pieces as you have compute class threads. Then en-queue number of compute class threads-1 task requests and directly calling the slice function for the last slice.
Example:
// slice function
void
DoSum(intptr_t iFrom, intptr_t iEnd, double* A, double* B, double* C)
{
for(
intptr_t i = iFrom; i < iTo; ++i)
A[i]
= B[i] + C[i];
}
.
. .
double* A = new double[nSize];
double* B = new double[nSize];
double* C = new double[nSize];
.
. .
parallel_for(&DoSum,
0, nSize, A, B, C);
.
. .
When, for example, nSize = 1000, and number of threads = 4, threads will run with arguments
0,
250, A, B, C (thread
a)
250, 500, A, B, C (thread b)
500, 750, A, B, C (thread c)
750, 1000, A, B, C (thread d) (a != b != c != d)
The above assumes all threads were available at the time of the parallel_for.
If thread c were unavailable due to long computation task but threads a, b and d were available, and assume thread b finishes its first slice first
0,
250, A, B, C (thread
a)
250, 500, A, B, C (thread b)
500, 750, A, B, C (thread d)
750, 1000, A, B, C (thread b)
// same slice function
void
DoSum(intptr_t iFrom, intptr_t iEnd, double* A, double* B, double* C)
{
for(
intptr_t i = iFrom; i < iTo; ++i)
A[i]
= B[i] + C[i];
}
.
. .
double* A = new double[nSize];
double* B = new double[nSize];
double* C = new double[nSize];
.
. .
intptr_t
iChunk = 100;
parallel_for(iChunk,
&DoSum, 0, nSize, A, B, C);
When, for example, nSize = 1000, iChunk=100, and number of threads = 4, threads will run with arguments
0, 100,
A, B, C (thread a)
100, 200, A, B, C (thread b)
200, 300, A, B, C (thread c)
300, 400, A, B, C (thread d) (a != b
!= c != d)
400, 500, A, B, C (thread first of above to finish)
500, 600, A, B, C (thread second of above to finish)
600, 700, A, B, C (thread third of above to finish)
700, 800, A, B, C (thread forth of above to finish)
800, 900, A, B, C (thread next available)
900, 1000, A, B, C (thread next available)
Including the iChunk parameter the parallel_for divides the range into iChunk sized pieces (last chunk may be less than specified size). If nSize / iChunk is less than the number of available threads then the lesser number of threads are scheduled. When iChunk <= nSize then the slice function is called directly (bypassing the scheduler).
The chunking of a parallel_for is advantageous when the amount of processing is not uniform across the iteration space or when not all of the threads scheduled for the work distribution are immediately available for performing the work, or may get preempted during work (by O/S running other applications).
Example:
// same slice function
void DoSum (int
iFrom, int iEnd, double* A, double* B, double* C)
{
for(
int i = iFrom; i < iTo; ++i)
A[i]
= B[i] + C[i];
}
.
. .
double* A = new double[nSize];
double* B = new double[nSize];
double* AB = new double[nSize];
double*
C = new double[nSize];
double* D = new double[nSize];
double* CD = new double[nSize];
.
. .
{
qtControl qtControl;
parallel_for(&qtControl, DoSum, 0, nSize,
A, B, AB);
parallel_for(&qtControl, DoSum, 0, nSize,
C, D, CD);
}
The above two parallel_for statements run concurrently. The blocking occurs within the dtor of the qtControl structure called at the close brace. The scoping of the qtControl structure is under the programmer’s control. Alternately, this example could have omitted the use of the local qtControl structure and used the parallel_wait(); however, the current task may have had additional sub-tasks pending and you may not wish to wait for those additional tasks to complete as well.
If your compiler supports the C++0x Lambda functions an improved method for the above code
.
. .
{
qtControl qtControl;
parallel_invoke(
[&]()
{
parallel_for(
&qtControl,
DoSum, 0, nSize, A, B, AB);
},
[&]()
{
parallel_for(
&qtControl,
DoSum, 0, nSize, C, D, CD);
});
}
Should you wish to block on a qtControl structure without letting it go out of scope you can call the member function WaitTillDone().
double*
A = new double[nSize];
double* B = new double[nSize];
double* AB = new double[nSize];
double*
C = new double[nSize];
double* D = new double[nSize];
double* CD = new double[nSize];
double* ABCD = new double[nSize];
.
. .
{
qtControl qtControl;
parallel_for(&qtControl, &DoSum 0, nSize,
A, B, AB);
parallel_for(&qtControl, &DoSum 0, nSize,
C, D, CD);
qtControl.WaitTillDone(); // or parallel_wait(&qtControl);
parallel_for(&qtControl, &DoSum, 0, nSize,
AB, CD, ABCD);
// something else to do while
performing parallel_for
}
The reason for coding this way is it eliminates an additional call to the ctor of the qtControl object.
You may find it advantageous to specify placement restrictions on the parallel_for.
(see qtPlacement):
Example 1:
Assume you have a loop that has a relatively small number of iterations and you know that most of the data for the loop is “hot in L2 cache” of the current thread, and further you know that the execution time will benefit using multiple threads only when the other thread(s) sharing the current thread’s L2 cache are waiting to run.
parallel_for(Waiting_L2$, fnFoo, 0, nSize, A, B, C);
Example 2:
Assume you have a loop that has a relatively small number of iterations and you know that most of the data for the loop is not in cache (in RAM) and further you know that the execution time will benefit using multiple threads only when the other thread(s) are available and also share the same L2 cache with each other.
parallel_for(Waiting_NotInCache_L2$, fnFoo, 0, nSize, A, B, C);
Example 3:
You have a relatively long running loop but each instance of the running loop (each slice) will run optimally when all the data resides within one L3 cache
parallel_for(NotInCache_L3$, fnFoo, 0, nSize, A, B, C);
Assuming your system has two quad core processors each with L3 cache capability. The processor with the most available threads at L3 would be selected and the number of threads scheduled will be the number of HW threads sharing the same L3 (4 in this case).
Task
void fn(intptr_t
iPos[, optional args])
{
// ...
}
The function fn is run in parallel one element at a time using optional arguments.
void parallel_for_each(
[qtPlacement,]
[&qtControl,]
iEnd, // half open termination
point
&fn, // void fn(intptr_t i[,
...]);
iBegin // beginning and
varying position
[, a3[, a4[, a5[, a6[, a7[, a8[, a9]]]]]]]); //
optional args
void parallel_for_each(
[qtPlacement,]
[&qtControl,]
iEnd, // half open termination
point
[&Obj,] // object when member
function
&Object::fn,// void Object::fn(intptr_t
i[, ...]);
iBegin // beginning and
varying position
[, a3[, a4[, a5[, a6[, a7[, a8[, a9]]]]]]]); //
optional args
void parallel_for_each(
[qtPlacement,]
[&qtControl,]
iBegin, // beginning and varying
position
iEnd, // half open termination
point
[&](intptr_t i)
{
// i=item number
}
);
*** Note different iBegin argument placement for Lambda function ***
parallel_for_each schedules or runs a (or multiple) dispatching task(s) that happen to be waiting for work at the time of the parallel_for_each. The dispatching task(s) will run making calls across their slices of the iteration space (iBegin, iEnd) to the supplied function providing the current position within the slice of the iteration space (together with optional args).
As the dispatching task runs, it monitors for the availability of additional threads to become idle and are now available to join the team processing the parallel_for_each. Monitoring is at beginning of task, then periodically during dispatching. When an available thread or threads are observed, the remaining iteration space of the observing thread is divided, a portion is held for the current thread, and the remainder is en-queued for the idle thread. The new task is a recursive call of the current distribution task and it runs in parallel with the first task. Only when additional threads become available is the iteration space split.
Note, the various dispatching task(s) may run out of iteration space at differing times. When a busy dispatching task notices a new idle thread it will split its remaining iteration space and en-queues a new task.
The number of task en-queuing operations for parallel_for_each is generally larger than for parallel_for. However, when the execution time varies across the iteration space, a non-chunking parallel_for might not fully utilize all cores (i.e. some threads finish early while others finish late). A parallel_for with chunking can improve the utilization of all cores but at the expense of incurring a fixed number of additional task en-queue/de-queue. Use of parallel_for_each is thread availability demand related number of additional task en-queue/de-queue operations. As to which method is best, this will depend on your application.
The criteria for using parallel_for_each are:
a) The work per object is variable and significant with respect for the cost of intermittent task en-queue/de-queue operations.
b) Your requirement is to distribute objects using qtPlacement into specified cache systems such that the object sub-tasks can subsequently schedule within that cache.
c) The availability of threads to schedule is uncertain at the issuance of the parallel_for_each and may change during the execution of the control loop.
Example:
void DoObject(intptr_t iObj); // Function to do
per object
void DoObjectsParallel()
{
// Create 1 or more instances of
DoObject task
// split up across (for each) L2
cache on system
// (one task per L2).
//
// qtPlacement, iEnd, fn,
iBegin
parallel_for_each(OneEach_L2$, nObjects, DoObject, 0);
}
void DoObject(intptr_t iObj)
{
// Obtain an object from the Object List
// The Object contains three matricies
// A and B are to be multiplied and stored
in output
Object& Obj = ObjectList[iObj];
// Perform the matrix multiplication row
at a time
// using a thread pool of threads within
my L2
parallel_for(
L2$, // only threads sharing our thread's L2
matmultStripe, // function to perform striped matmul
0, Obj.Height, //
split across number of threads on L2
Obj.inputA, Obj.inputB,
Obj.output, // results
Obj.Width, Obj.Height);
}
Note, matmultStripe would be optimal for this Object when all cores using
the selected L2 would be available. This is not normally the case. As such, not
all stripes will finish at the same time. When the array size is relatively
large and the system is working hard, then consider using the iChunk form of
the parallel_for.
Example using iChunk follows:
void DoObject(intptr_t iObj)
{
//
Obtain an object from the Object List
Object& Obj =
ObjectList[iObj];
parallel_for(
L2$, // only our thread's L2
Obj.Height / 4, // iChunk = 1/4 of
total
&matmultStripe, // function to
perform striped matmul
0, Obj.Height, // split into
iChunk sizes
Obj.inputA, Obj.inputB,