MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_NextPart_01CB693E.65133EC0" This document is a Single File Web Page, also known as a Web Archive file. If you are seeing this message, your browser or editor doesn't support Web Archive files. Please download a browser that supports Web Archive, such as Microsoft Internet Explorer. ------=_NextPart_01CB693E.65133EC0 Content-Location: file:///C:/E8693AC9/Superscalarprogramming101(parts1-5).htm Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="us-ascii"
Copyright &= copy; 2010
by
Jim Dempsey=
QuickThread Programming, LLC
jim@quickthreadprogramming.c=
om
www.quickthreadprogramming.=
com
Superscalar programming 101 (Matrix Multiply) <= o:p>
Introduction<= o:p>
Part 1<= o:p>
Part 2<= o:p>
Part 3<= o:p>
Part 4<= o:p>
Part 5<= o:p>
Appendix (program)<= o:p>
This document is a composition of a 5 part article pre= sented on the Parallel Programming Community of the Intel Software Network (http://software.intel.co= m/en-us/parallel/).
Some minor edits to this article have been made in this composition to aid in readability, in particular: a table of contends was added, the segment bylines have been removed, and an appendix was added to include code samples.
The subject matter of this article is: How to optimall= y tune a well known algorithm. We will take this well known (small) algorithm, a common approach to parallelizing this algorithm, a better approach to parallelizing this algorithm, and then produce a fully cache sensitized approach to parallelizing this algorithm. The intention of this article is = to teach you a methodology of how to interpret the statistics gathered during = test runs and then use those interpretations at improving your parallel code.
This is a multi-part article, where the author believe= s the process in reaching a goal (learning how to assess results and improve upon your parallel programming technique) is as important as the goal (finished application). For without the process you will not know how to attain goal = to the extent possible.
To titillate you into reading the complete set of arti= cles, you will experience how to attain up to 80x parallel programming performance increase on a single processor (4 core with HT) system over a simple serial method on the same system. (don’t zoom the charts)
_files/image002.gif")
_files/image004.gif")
Let us begin…
One of the posters on the Intel Software Network forums
provided a link to an MSDN article titled “How to: Write a parallel_for
The typical C++ serial method to compute the matrix multiplication of a square matrix might look as follows:
=
//
Computes the product of two square matrices.
v=
oid matrix_multiply(
double** m1, <=
span
style=3D'color:blue'>double** m2, double<=
/span>**
result, size_t size)
{
f=
or
(size_t i =3D 0; i < size; i++)
{
for (size_t j =3D 0; j < size; j++)
{
double temp =3D 0;
for (in=
t
k =3D 0; k < size; k++)
{
=
temp +=3D m1[i][k] * m2[k][j];
}
result[i][j] =3D temp;
}
}
}
Using the VS concurrency collection parallel_for the code looks like this:
=
//
Compute the product of two square matrices in parallel.
v=
oid parallel_matrix_multiply(
double** m1, <=
span
style=3D'color:blue'>double** m2, double<=
/span>**
result, size_t size)
{
parallel_for (size_t(0), siz=
e,
[&](size_t i)
{
for (size_t j =3D 0; j < size; j++)
{
double temp =3D 0;
for (int k =3D 0; k < size; k++=
)
{
=
temp +=3D m1[i][k] * m2[k][j];
}
result[i][j] =3D temp;
}
});
}
This makes use of the C++0x Lambda functions and the Concurrency template for the paral= lel_for. This conversion on the surface appears to be elegant (unusual= ly effective and simple) by essentially rewriting two lines of code: The first= for statement and closing brace o= f that for statement.
The MSDN reported performance boost on a 4 processor s= ystem for 750 x 750 matrix multiplication as:
Serial: 3853 (ticks)
Parallel: 1311 (ticks)
Approximately a 2.94x speed-up.
This is not an ideal scaling situation in that it does= not produce a 4x speed-up using 4 processors. But considering that there must be some setup overhead, 2.94x is not too bad on this small of matrix. And you might be inclined to think that there is only 25% room remaining for improvement.
The article did not chart the scaling as a function of= N (one dimension of a square matrix) so it is difficult to tell the shape of = the performance gain trend line as a function of N.
Although this article was written to illustrate the us= e of the parallel_for in the MS Concurrency, a parallel programmer might be miss-lead into assuming that th= is example illustrates how to write a parallel matrix multiplication function. After all, this article describes a parallel method for matrix multiplicati= on and was written in an MSDN article – an authoritative source.
Let’s see how we can do better at parallelizatio= n of the Matrix Multiplication.
First off, I do not have Visual Studio 2010, so I cann= ot use the sample program as-is.
I do have QuickThread (I am the author of this parallel programming toolkit www.= quickthreadprogramming.com ). Therefore, I adapted the code to use QuickThread.
Adaptation is relatively easy. Change the include file= s and minor differences in syntax.
// Compute=
s the
product of two square matrices in parallel.
void=
parallel_matrix_multi=
ply(
double** m1, double** m2,
{
parallel_for(
0,
size,
[&](intptr_t iBegin, intptr_t iEnd)
{
=
for(intptr_t
i =3D iBegin; i < iEnd; ++i)
=
{
=
for
(intptr_t j =3D 0; j < size; j++)
=
{
=
double
temp =3D 0;
=
for
(intptr_t k =3D 0; k < size; k++)
=
{
=
&nb=
sp;
temp +=3D m1[i][k] * m2[k][j];
=
}
=
result[i][j] =3D temp;
=
}
=
}
}
);
}
In the QuickThread dialect, the parallel_for passes the half open range as arguments to the function, as opposed to the parall= el_for of the VS 2010 concurrency collection passing the single index into the bod= y of the function. QuickThread chose to pass the half open range, as opposed to a single index, because knowing the range, the programmer and/or compiler can better optimize the code within the loop. N.B. this loop could have as easi= ly been written using OpenMP or Cilk++, or TBB. The choice parallel tool language/dialect is not important at this point of the article.
Using the QuickThread modified code, and run on Window= s XP x64 and using an Intel Q6600 4 Core processor without Hyper Threading we fi= nd the scaling as:
_files/image006.gif")
Fig 1 ̵= 1; Parallel Scaling
For an average scaling ratio of 3.77x at N values betw= een 128 and 1024. And considering four threads are involved the above chart sho= ws the parallel version of the serial code yielding a scaling factor of 0.94 (scale / number of cores). This is a reasonably good scaling factor.
The processor model was not listed for the Visual Stud= io test so it is hard to tell the reason why QuickThread yields 3.77x improvem= ent on 4 cores, while VS yields 2.94x improvement on 4 processors (cores?). Set= ting aside the issue of which threading toolkit is better, let’s concentra= te on the parallelization of the matrix multiplication.
First thing to do is get an additional (different) set= of sample data. Let’s see what the performance is on a processor support= ing L3 cache plus Hyper Threading (HT).
When running the program on an Intel Core i7 920, 4 co= res with HT (8 hardware threads – but only 4 floating point instruction p= aths) we find a completely different trend line:
_files/image008.gif")
Fig 2
Why the dip when N ranges from 350 to 570?
When the performance recovers, why do we see 4x improv= ement instead of 8x improvement?
Why the jagged line (noise) towards lower N?
The dip is due to cache evictions caused by one thread adversely interacting with other threads.
We get just over 4x performance because the Core i7 92= 0 is a 4 core processor capable of 4 execution streams for floating point (althoug= h it has 8 hardware threads for integer execution streams). On the higher N valu= es it would appear that we are maxing out the capabilities of processor, 4 cor= es =3D=3D 4 x performance boost.
The trend line is jagged due to only one sample run be= ing taken, and due to the short run times when N is low. Also, the variable overhead in starting the thread team (now 8 threads) is more noticeable on = the left hand side of the curve. An average of a larger number of runs would sm= ooth out this line, but this consumes unnecessary time from the developer. Addit= ionally the large spike at about 250 indicates that the Serial version took a large tick count hit at that point due to out of application control circumstances (expansion of page file or other activity on the system). You do not need a precise chart for program analysis purposes.
Is there anything we can we do to improve this perform= ance and/or correct for the dip in performance for N between 350 and 570?
The main computational section of code for both Serial= and Parallel are the same:
for (intptr_t j =3D 0; j < size; j++)
{
double temp =3D 0;
for (intptr_t k =3D 0; k < size; k++)
{
temp +=3D m1[i][k] * m2[k][j];
}
result[i][j] =3D temp;
}
Notice that the product accumulation statement
temp
+=3D m1[i][k] * m2[k][j];
uses k to index the column of array m1 and the row of array m2.=
This
access pattern has two problems: First, it is not cache friendly, and secon=
d,
it is not amenable to vectorization by the compiler. Let’s look at fi=
xing
these problems.
In the Cilk++ sample
programs, which is (or will be) available with the Intel Parallel Studio, we
find a sample program showing a cache friendly implementation of matrix
multiply:
// Multiply double precision
square n x n matrices. A =3D B * C
// Matrices are stored in r=
ow
major order.
void matrix_multiply(double* A=
, double* B, double* C,
unsigned int=
span> n)
{
if
(n < 1) {
return;
}
cilk_for(unsigned int =
i =3D 0; i
< n; ++i) {
// This is the only Cilk++
keyword used in this program
// Note the order of the lo=
ops
and the code motion of the i*n and k*n
// computation. This gives =
a 5-10
performance improvement over
// exchanging the j and k l=
oops.
&=
nbsp; int itn =3D i * n;
for (unsigned=
int k =3D 0; k < n; ++k) {
=
for (un=
signed
int j =3D 0; j < n; ++j) {
=
int ktn =3D k * n;
=
// Compute A[i,j] in the inner loop.
=
A[itn + j] +=3D B[itn + k] * C[ktn + j];
=
}
}
}
return;
}
The itn (i times n)
and ktn (k times n=
span>)
variables take large steps through the arrays. The array pointers point to
single dimensioned arrays that are row packed. This is to say each row is
appended to the prior row into one large single dimension array. Which is t=
hen
indexed by way of i*n and/=
or k*n.
Serial and Parallel function signatures:
void
matrix_multiply(
double** m1, doubl=
e**
m2, double** result, size_t size);
void parallel_matrix_multiply(
double** m1, double** m2,
Note double** on the 2D array pointers.
Cilk++ function signature:
void matrix_multiply(double*
A, double* B, =
double*
C, unsigned in=
t
n);
Note double* on the 2D array pointers.
Row packing principally does one beneficial thing to y= our program. Row packing eliminates a memory fetch of a pointer to the row (as = done with array of row pointers). This uses fewer Virtual Memory Translation Look Aside Buffers. And this improves cache utilization.
temp +=3D m1[i][k] * m2[k][j]; // serial/p=
arallel
Uses: 1 TLB for code =
+ 1 TLB
for stack (could be 0) + 1TLB for m1 row table + 1 TLB for m1 data + 1 TLB =
for
m2 row table + 1 TLB for m2 data =3D 6 (or 5) TLB’s.
Whereas the row packi=
ng
method employed by the Cilk++ sample program:
=
A[itn + j] +=3D B[itn + k] * C[ktn + j];
Uses: 1 TLB for code =
+ 1 TLB
for stack (could be 0) + 1 TLB for m1 data + 1 TLB for m2 data =3D 4 (or 3)
TLB’s.
The reduction in the =
numbers
of TLB’s required should produce an advantage.
And the corrisponding=
charts:
_files/image010.gif")
Fig 3 (abov= e)
Wow!, the Cilk++ tech=
nique on
the Q6600 has a nice peak between 400 and 700 where it attains close to 25x
performance over Serial but dropps off to about 14x at the higher range. Th=
is
represents a 3x to 6x improvement over the parallel version of the standard
matrix multiply code. The elimination of the array of row pointers made a
significant difference.
_files/image011.gif")
Now looking at the Co=
re i7 920
processor we find:
_files/image013.gif")
Fig 4
The Cilk++ method on =
the Core
i7 920 attains almost 50x performance gain over serial method at about N=3D=
800
(12x over parallel), but appears as if it will be dropping off after 1024 (=
as
it did earlier at 700 on the Q6600). Additional data points should be
collected.
What this teaches us =
is:
constructing 2D arrays as an array of pointers to 1D arrays looks good but =
can
cost you dearly. Clearly the more efficient route is to use row packing to
reduce the number of TLBs and corrisponding entries in the cache. But this
means changing
d=
ouble**
A; // referenced as A[iRow][iCol]
into some class
Array2D<double> A;
And then changing the
allocations and references (or fancy template operators).
If you look at this f=
rom a
different perspective the array class object(s) are effectively array
descriptors. Array descriptors are a well used technique by FORTRAN.
WIth an improvement o=
f 12x
over a parallel version of the serial implimentation this shows that paying
attention to cache issues realy pays off. And that you may have to dispense
with the familiar C/C++ programming practice of referencing a multi-dimensi=
oned
array as an array of pointers. A little extra programming effort pays off w=
ith
significant returns in performance.
You would have to ask
yourself: is this the best you can do?
And, is it worth your=
while
to attempt at better performance?
In the previous secti=
ion we had:
_files/image014.gif")
_files/image015.gif")
The above charts, imp=
ressive
as they are, are an “apples and oranges” type of comparison. The
chart is comparing a non-cache sensitive serial technique against a cache
sensitive parallel technique. Good for promotional literature, certainly a =
good
incentive to learn how to program in parallel, but this falls short for
analysis purposes by the seasoned parallel programmer.
The first thing any parallel programmer should know is: improve the serial algorithm first, and then address the problem through parallelization (save intrinsic small vector functions for last).
If you look at the inner most loop of the Serial funct= ion we find:
for (in=
t
k =3D 0; k < size; k++)
{
=
temp +=3D m1[i][k] * m2[k][j];
}
We addressed the issue of the traditional C/C++ progra= mming practice of using an array of pointers to rows, and saw that by dispensing = with this practice that a 6x to 12x improvement in performance. What else can be done?
Set aside the issue of the TLBs and row table for a mo=
ment.
In examining the principal
computational statement we find that while the k
index in m1[i][k] sequentially accesses memory, the k index in m2[k][j]<=
/span> does not. What this means is the comp=
iler
is unable to vector this loop. And potentially worse, stepping down the row=
s at
certain distance intervals are known to cause cache evictions (the dip in c=
urve
observed on Core i7 920). Using vectors on doubles might attain a 2x
improvement – well worth going after. So let’s improve the
vector-ability of the Serial version of this program. (and Parallel variant=
of
the Serial version too)
Note, while=
we
could use the approach done in the Cilk++ implementation (eliminate row tab=
le
of pointers), I will choose an alternate means that will become useful late=
r.
First address the simple Serial and Simple Parallel to see what happens.
In order to=
assist
the compiler in vectorizing the matrix multiplication (and minimize cache
evictions) you can transpose the matrix m2
(to m2t), and then you can transpose the indexes in the inner loop. At first
this may sound absurd. To transpose the matrix m2 you will have to add over=
head
to:
Perform an allocation (actually size+1 allocations with 1 for the row
pointers)
run transposition loops
reading and re-writing one of the arrays (array m2 to m2t)
You might a=
ssume
that this must be slower than the simple matrix multiplication.
Well you wo=
uld be
wrong to assume this. Each cell in m2
is used N times, each cell in m1=
b> is
used N times. We have the potential of trading off the cache misses on 2N**2
reads plus N*N writes against N reads + N/2 writes + (2N**2)/2 reads + N/2 =
writes
.The /2 is due to the inner loop now becomes a DOT function that operates on
two sequential arrays and is an ideal candidate for vectorization by the
compiler. The rewritten Serial loop (and inner loop transformed into an inl=
ine
function) looks like:
// compute=
DOT
product of two vectors, return result as double
inline double
DOT(double* v1, double*
v2, size_t size)
{
double
temp =3D 0.0;
for(size_t
i =3D 0; i < size; i++)
{
temp=
+=3D
v1[i] * v2[i];
}
return
temp;
}
The matrix multiply f=
unction,
with transposition now looks like:
void=
matrix_multiplyTransp=
ose(
double** m1, double** m2,
{
//
create a matrix to hold the transposition of m2
double**
m2t =3D create_matrix(size);
//
perform transposition m2 into m2t
for
(size_t i =3D 0; i < size; i++)
{
for (size_t j =3D 0; j < size; j++)
{
=
m2t[j][i] =3D m2[i][j];
}
}
// Now matrix multiply=
with
the transposed matrix m2t
for
(size_t i =3D 0; i < size; i++)
{
for (size_t j =3D 0; j &l=
t; size;
j++)
{
=
result[i][j] =3D DOT(m1[i], m2t[j], size);
}
}
//
remove the matrix that holds the transposition of m2
destroy_matrix(m2t, si=
ze);
}
Note, in a =
real
implementation you might want to consider preserving the buffers allocate on
the first call. Then on subsequent calls you check to see if the prior
allocation was sufficient for the current call. If so, then bypass the
allocation. If not then delete the buffers, and allocate new buffers. This =
is a
further optimization detail left to the readers.
We will inc=
lude the
allocation/deallocation overhead in our charts. Also note, if the m2 matrix is going to be used many
times (e.g. in a filter system), then you can perform the transposition onc=
e,
and then reused the transposed matrix many times. These are implementation
strategies to keep in mind when taking these suggestions into your own code=
.
Let’s=
see how
the serial with transposition technique (ST) stacks up against the serial
without transposition (S) and the parallel code (P) using the non-transposed
matrix method:
_files/image017.gif")
Fig 5 (abov=
e)
_files/image019.gif")
Fig 6
The revised
technique, which still uses a row table is a significant improvement over t=
he
Serial and Parallel methods. The revised technique does not approach that of
the row table elimination method of the Cilk++ demo program. Is there anyth=
ing
to learn from this?
The revised=
serial
code, using one thread, overwhelms the parallel code using 4 or 8 threads on
the two different processor at higher N values.
Although 4x=
better
parallel performance than before, it is still less than the Cilk++ method, =
is
this information useful in helping produce a better algorithm?
Why the
“turbo charged” boost at 513 for the Serial Transpose method?
Although Pa=
rallel
Transpose (PT) is 4x faster than Parallel (P) it is actually less than the
performance of the Serial Transpose method! Why!?!?
The answer =
to this
is the chart lines are relative to the original (non-transposed) Serial
performance. Let’s see the actual run time for the (non-transposed)
Serial method to see if something shows up:
_files/image021.gif")
_files/image023.gif")
Fig 7
And there i=
t is. At
N =3D 513 we find the serial =
time
experiencing a drastic change in slope. (Q6600 shows earlier slope change at
about 350).
What is sig=
nificant
about N =3D 513 on Core i7 920?. We principally have access to two arrays of
doubles dimension at 513 x 513. The number of bytes is 2 x 513 x 513 x 8 =
=3D 4,210,704.
Hmm….
The Core i7=
920
from specifications gleaned from the internet indicate that this processor =
has
L1 cache 32=
KB
instruction + 32KB data (one for each core)
L2 cache 25=
6KB (one for each core)
L3 cache 8MB
(shared)
So why the
performance drop at ~4MB instead of ~8MB? To be honest, I haven’t
performed an thorough analysis of this, my postulation is this is a TLB iss=
ue,
so I will simply state that it appears that the cache system is being over
taxed at this point. What is important, is the steep rise in the curve at t=
his
point accounts for the superscalar performance gain of a better serial
technique over a non-optimal serial technique.
The importa=
nt piece
of information learned is the Serial Transpose method trashes the performan=
ce
of the parallel technique using the non-transformed technique. (Although the
Cilk++ technique is still 3.nnx faster that either)
Now when th=
e Cilk++
row table elimination method is compared to Serial Transpose method we have=
:
_files/image025.gif")
Fig 8
_files/image027.gif")
Fig 9
The Cilk++
technique shows approximately 3.25x performance over Serial Transpose metho=
d.
More notewo=
rthy,
the Parallel Transpose is not faster than the Serial Transpose method. Why?=
And
more importantly, does it matter why?
As a genera=
l rule,
when you observe your parallel code not
performing better than your serial code, this is a good indication that you
have reached a memory bandwidth problem. The Cilk++ code performance clearly
indicates the memory bandwidth problem is due to poor cache utilization by =
the
Parallel Transpose technique.
Is the Cilk=
++
method the best we can expect to do?
The answer =
to this
is: No.
In the prev=
ious section
we have seen that by reorganizing the loops and with use of temporary array=
we
can observe a performance gain with SSE small vector optimizations (compiler
does this) but a larger gain came from better cache utilization due to the =
layout
change and array access order. The improvements pushed us into a memory
bandwidth limitation whereby the Serial method now outperforms the Parallel
method (of the Serial method).
The memory
bandwidth limitation is an important factor to consider and warrants a chan=
ge
in programming strategy: In order to attain additional performance gains we=
are
going to attack the problem from the perspective of trying to keep the data
access patterns to the closest cache level.
But how are=
we
going to do this?
On a system=
with
Hyper Threading, such as the Core i7 920, the Hyper Threads within a single
core share the 256KB L2 cache and, depending in internal architecture, share
the 32KB L1 data cache. While all cores within the socket share the 8MB L3
cache (6MB or 12MB on other processor models).
The Cilk++ =
method
uses the common practice of “divide and concur” to partition (or
tile) the matrix into smaller working sets that nicely fit into the cache
available to a thread. This is a good starting point. What we need to look =
at
next is how to coordinate the activity between
caches within the processor(s) of the system.
While you c=
an tile
using nested loops, consider this: As you reduce the tile size, you also
increase the number of interactions with the thread scheduler (entering and
exiting the inner most loop). Thread scheduling is not cheap. In the first
chart, the overhead break even occurred at 50 x 50 tile size. Is there a
different way to program such that you can use 2x2 or 2x1 or 1x2 tiles and
attain superior performance?
The answer =
to this
is: Yes
Targeting t=
asks to
specific threads, or grouping of threads is difficult to do effectively usi=
ng
most threading tool kits (e.g. MS Parallel Collections, OpenMP, or Cilk++).
Cache level task grouping is a built-in design feature of QuickThread. Exam=
ple:
// divide the iteration spa=
ce
across each multi-core processor
// L3$ specifies by L3 cach=
e
// .OR.
// within processor (Socket=
) for
processors without L3 cache
parallel_for( OneEach_L3$, intptr_t(0),=
size,
=
[&](intptr_t iBegi=
nL3,
intptr_t iEndL3)
{
// divide our L3 iteration space by L2 within this th=
reads
L3
parallel_for( OneEach_Within_L3$ + L2$, iBeginL3, iEndL3,
=
[&](intptr_t iBeginL2, intptr_t iEndL2)
=
{
=
// Here we are running as the Master thread of a
=
// 2 team member team (or 1 in the event=
of
older
=
// processor)
=
//
=
// Now bring in our other team member(s)=
=
// form team of threads sharing this threads L2 cache
=
parallel_distribute( L2$,
=
[&](intptr_t iTMinL2, intptr_t nTMinL2)
=
{
=
&nb=
sp;
// ... (Do Work)
=
} // [&](intptr_t iTMinL2, intptr_t
nTMinL2)
=
); // parallel_distribute( L2$,
=
} // [&](intptr_t iBeginL2, intptr_t
iEndL2)
); // parallel_for( OneEach_Within_L3$ + L2$, iBeginL3,
iEndL3,
} //
[&](intptr_t iBeginL3, intptr_t iEndL3)
); // parallel_for(
OneEach_L3$, intptr_t(0), size,
The outer l=
oop is a
per socket division, the next deeper loop is a per L2 cache, and the inner
layer is a n-Way split by the threads sharing L2 (2 threads on Core i7 920,=
2
threads on Q6600)
The techniq=
ue is to
use the parallel_distribute as=
the
inner most division, then use the loop partitioning of the outer loops with=
in a
state machine loop placed inside the parallel_distribute
level. Say what?
The paralle=
l_for in
QuickThread performs the task to thread set selection and partitioning of t=
he
iteration space, but specifically does not drive the iteration. Due to this
feature of QuickThread, the iteration domain can be passed deeper into the =
loop
nest levels. The second loop does the same thing (pass iteration space to i=
nner
most parallel_distribute. Now as to why this is important.
This altern=
ate
technique creates the environment for an intelligent swarm of threads
performing a coordinated attack of the problem. These threads know which
threads share the nearest cache, which threads share each of the largest
cache(s), and which threads are not sharing caches with the current thread.
This identification is implicit by the sub range of the outer two loops and=
by
the team member number (iT=
MinL2) within the parallel_distribute.
Adding an
additional outer layer loop per NUMA node could easily be handled using:
parallel_for( OneEach_=
M0$,
intptr_t(0), size,
&=
nbsp; ...
However, fo=
r this
matrix multiplication, this would not provide any additional benefit unless
your system has an L4 cache external to the processor and which is also sha=
red
amongst processors sharing the same NUMA node. NUMA aware issues are best
handled in the memory allocation routines which can be placed within the L3
cache distribution layer of the above loop structure.
The interio=
r state
machine loop will partition (tile) the output array:
Socket (L3 =
cache)
Core Pair/HT
Siblings (L2 cache/L1 cache)
Amongst Core
Pair/HT Siblings (L2 cache/L1 cache)
To accompli=
sh this,
we segment the matrix into 2x2 tiles with each 2x2 tile being serviced by a=
2
thread team. The 2 thread team I will call a “Tag Team” (as in =
Tag
Team Wrestlers). On the Core i7 we will have 4 tag teams, one for each core,
and the two threads for each team being Hyper Thread siblings within the sa=
me
core. On Q6600 the tag team becomes the two threads sharing the same L2 (Q6=
600
has 2 L2 caches, each shared by 2 cores).
If you ever=
watch
TV wrestling, there is a type of wrestling format where each team consists =
of
two team members. Each team is supp=
osed
to have one member in the wrestling ring (square) at any one time, and =
they
must tag their team mate when they wish to let them have a go at their
opponent. I say supposed to in
italics since more often than not, the tag exchange usually ends up with the
team doing the tag cheating by having two wrestlers in the ring at one time.
And in these situations they completely overwhelm their opponent. I am goin=
g to
show you how to do the equivalent of this using cache directed thread
scheduling as available with QuickThread.
The output =
matrix
is divided into 2x2 tiles. The thread teams are derived from the thread pool
into 2 thread teams, each thread sharing the closest cache level possible. =
On
Core i7 920 these are Hyper Thread siblings within each core, on Intel Q6600
these are two cores sharing same L2 cache, on AMD Opteron 270 these are two
cores within same processor on same NUMA node, etc… An optimization
strategy will be employed whereby we will concurrently perform the
transposition and DOT products of some of the cells. Then perform the DOT
products on the remaining cells.
The Hyper T=
hread
siblings within one core have two integer execution paths, but only one
floating point execution path. We code to take advantage of this by having =
one
thread of the tag team, called the Master thread, perform the transpositions
using the integer execution path while the other Hyper Thread sibling, call=
ed
the Slave thread, is performing the DOT product of the lower 1x2 portion of=
the
2x2 tile concurrently with the transposition (shortly following the
transposition of each cache line).
Upon completion of the transposition, the Master thread will then be=
gin
the DOT product of the upper left cell of 2x2 tile, and is joined shortly
thereafter by the Slave thread performing the DOT product on the upper right
cell. For the cells handled in this manner (those performing the
transposition), cost of the DOT product of half these cells is essentially
free, since it occurs during memory latency of the transposition. For the D=
OT
product of the other half of this 2x2 cell, the row and transposed column of
one of these output cells is fully cached (potentially all in L1 cache) with
the other output cell having the transposed column residing in L1 cache and
leaving the row of this output un-cached.
_files/image029.jpg")
Fig 10
Figure 10 s=
hows the
initial Per Socket tile work assignment distribution (color backgrounds, an
iterative per processor thread team (heavy outline), and 2x2 cell tile ligh=
test
outline.
The 2x2 til=
es that
lie on the diagonal require additional work for the transposition of the m2
array into m2t array. The master thread (even number of 2 member team) perf=
orms
the transposition (memory access intensive) using the integer execution uni=
t of
the HT core, while the other team member performs the DOT product of 3 of t=
he
cells in the 2x2 tile using the floating point execution unit of the HT cor=
e.
These DOT products are performed concurrent with the transposition by means=
of
snooping on the progress of the transposition made by the master thread of =
the
tag team. The master thread of the 2 member team then performs the final DOT
product. Some experimentation yet needs to be done to see if the Slave thre=
ad
of the 2 member team ought to perform all 4 DOT products concurrent with th=
e transposition.
See below for the {transpose, 1x2},{1x1, 1x1} method:
_files/image031.jpg")
The first t=
wo
member thread team works on the upper left most 2x2 tile (and transposition=
) of
its designated zone (shown above). Concurrent with this, the second team wo=
rks
on the upper left most tile (and transposition) of its designated zone:
_files/image033.jpg")
And so on t=
o the
last socket, last team working on the upper left most 2x2 tile (and
transposition) of its designated zone:
_files/image035.jpg")
The complet=
ion of
these diagonal tiles are not signified by the end of a parallel construct.
Instead, the completion is signified by writing a completion status into a
mailbox. This completion will be asynchronous with the activities occurring=
by
the other threads. And have the “overhead” induced by a non-Int=
erlocked
write to a shared memory mailbox.
Once the up=
per left
most tile of the diagonal is completed, the two member team consults a flag
indicating if there is a work starved thread (early-on this flag will indic=
ate
no work starved threads. When there are no work starved threads, the two me=
mber
team works on the 2x2 tiles in the column in same column of the diagonal ti=
le
just completed:
_files/image037.jpg")
Signified b=
y the
green column above. However, each thread computes a 1x2 double DOT within t=
he
2x2 tile. When that 2x2 tile is complete, the two member team consults a fl=
ag
indicating if there is a work starved thread, if no work starved thread, the
team continues down the column, if work starved, it progresses down the
diagonal.
The goal of=
working
down the column, just after transposition is the two columns just transposed
are most likely to be residing hot in cache (L1, L2 or L3).
When a 2 me=
mber
thread team completes its diagonal, and the column cells above/below its
diagonals (within its two member team zone), The thread team then consults =
the
transposition completed status of the other thread teams within its L3 cache
(by way of mailbox)
_files/image039.jpg")
The light g=
reen
column, is processed by the first team (0/1 in socket 0), after its work has
complete, and after second team (2/3) has completed the upper left most
diagonal. Which statistically will be done by the time the mailbox is consu=
lted
(after 4 transpose with DOTs, plus 12 double DOTs time delay).
Each thread=
team
does the same. As the thread team progresses over/under the columns of the
diagonals of the teams sharing its L3 cache, if it finds an uncompleted
designated column within its L3, it posts a “work starved” thre=
ad
notification such that its L3 cache sharing teams can interrupt column
processing and advance to next diagonal processing prior to completing its
current column.
Each team, =
with
exception for work starved thread indication, works independent of the other
thread teams within its socket.
This iterat=
ive
process continues until a thread team completes all the designated work wit=
hin
its socket tile:
_files/image041.jpg")
At this poi=
nt, it
now consults the diagonal completion status mailboxes for the diagonals of =
the
other sockets. Being that this is a state machine instead of nested inner
loops, the diagonal completion of the other sockets can occur in any order,=
but
will tend to occur on diagonal 2x2 tiles in ascending order within each 2
member team, in each additional socket. As indicate by separated green bar,=
to
right of Socket 0 team 0/1 zone in Socket 1 zone above completed diagonal f=
or
second team in red zone (socket 1) below.
_files/image043.jpg")
As an addit=
ional
optimization, on systems with NUMA capability, the rows of each array (m1,
transpose m2 and output array) are segregated with NUMA considerations. In =
the
2 socket system, the upper half of the rows (turquoise/green) are in socket=
0
locality, and the lower half (red) are socket 1 locality.
Now letR=
17;s
produce the charts and check the results data:
_files/image044.gif")
_files/image046.gif")
The average=
scale
factors (compared to Serial Transpose) for N =3D 128 : 1024 are:
Q6600 =
4.96x
for Parallel Transpose Tag Team and 3.06x for Cilk++
Core i7 920=
&n=
bsp; 4.69x
for Parallel Transpose Tag Team and 3.20x for Cilk++
The peak va=
lues for
selected section, ~880 of Q6600 shows ~ 7.5x for Parallel Tag Team vs 3x for
Cilk++ method.
The Paralle=
l Tag
Team definitely has the advantage over the Cilk++ method (at least in this N
range).
To inflate =
the
figures, let’s compare back to original Serial method without Transpo=
se
_files/image048.gif")
_files/image049.gif")
We can now =
observe
that at selected points in the chart we have attained a 38x improvement over
Serial on Q6600 and near 80x improvement in scaling over the Serial on Core=
i7
920.
The complete
description of Parallel Transposition Tag Team can be found in the sample c=
ode
for MatrixMultiply.cpp included in the QuickThread demo kit.
www.quickthreadprogramming.com
Can this be
improved upon…
I will have=
to say,
yes it can.
Early on in=
this
blog post I mentioned: ”save intrinsic small vector functions =
for
last”.
Internal to=
the
MatrixMultiply, the inner most loop performs a DOT product. All the program=
ming
variations are using an inline function to perform the DOT product. The
QuickThread Parallel Tag Team method (PTT) uses an additional variation on =
this
DOT function to produce two DOT products at the same time. Each thread of the 2 thread team w=
orking
on a 2x2 tile, produce results for half of this tile (1x2). Performing the =
two
DOT products concurrently within each thread improves cache hit probability=
on
the larger matrix sizes.
Below are t=
he two
DOT functions, the two functions modified to use xmm intrinsics (I am not v=
ery
good at using xmm intrinsics, you may be able to do better), and two select=
or
functions that call the appropriate DOT within the test program.
// compute=
DOT
product of two vectors, returns result as double
// used in
QuickThread variations
double DOT(double v1[], double v2[], in=
tptr_t
size)
{
double
temp =3D 0.0;
for(intptr_t
i =3D 0; i < size; i++)
{
temp=
+=3D
v1[i] * v2[i];
}
return temp;
}
double xmmDOT(double v1[], double<=
/span> v2[],
intptr_t size)
{
// __declspec(align(16))=
not
working reliably for me
double temp[4];
intptr_t alignedTemp =3D (((intptr_t)&temp[0]) &=
amp;
8) >> 3;
__m128d _temp =3D _mm_set_pd(0.0, 0.0);
__m128d *_v1 =3D (=
__m128d *)v1;
__m128d *_v2 =3D (=
__m128d *)v2;
intptr_t half=
Size =3D
size / 2;
for(intptr_t i =3D=
0; i
< halfSize; i++)
{
_temp
=3D _mm_add_pd(_temp, _mm_mul_pd(_v1[i], _v2[i]));
}
// fix code to remove te=
mp[4]
array
_mm_store_pd( &temp[alignedTemp], _temp);
if(size & 1)
temp[alignedTemp] +=3D v1[size-1] *
v2[size-1];
return temp[aligne=
dTemp]
+ temp[alignedTemp+1];
}
// compute=
two
DOT products at once
// effecti=
vely
// r[0] =3D
DOT(v1, v2, size);
// r[1] =3D
DOT(v1, v3, size);
// except
running both results at the same time
void=
DOTDOT(double v1[], double<=
/span>
v2[], double v3[], double
r[2], intptr_t size)
{
double temp[2];
temp[0] =3D 0.0;
temp[1] =3D 0.0;
for(int i=3D0; i < size; ++i)
{
temp[0] +=3D v1[i] * v2[i];
temp[1] +=3D v1[i] * v3[i];
} // for(int i=3D0; i &l=
t; size;
++i)
r[0] =3D temp[0];
r[1] =3D temp[1];
}
// compute=
two
DOT products at once
// effecti=
vely
// r[0] =3D
DOT(v1, v2, size);
// r[1] =3D
DOT(v1, v3, size);
// except
running both results at the same time
void=
xmmDOTDOT(double v1[], double<=
/span>
v2[], double v3[], double
r[2], intptr_t size)
{
// __declspec(align(16))=
not
working reliably for me
double temp[6];
intptr_t alignedTemp =3D (((intptr_t)&temp[0]) &=
amp;
8) >> 3;
__m128d _temp0 =3D _mm_set_pd(0.0, 0.0);
__m128d _temp1 =3D _mm_set_pd(0.0, 0.0);
__m128d *_v1 =3D (=
__m128d *)v1;
__m128d *_v2 =3D (=
__m128d *)v2;
__m128d *_v3 =3D (=
__m128d *)v3;
intptr_t half=
Size =3D
size / 2;
for(intptr_t i =3D=
0; i
< halfSize; i++)
{
_temp0 =3D _mm_add_pd(_temp0, _mm_mul_pd(_v1[i], _v2[i]));
_temp1 =3D _mm_add_pd(_temp1, _mm_mul_pd(_v1[i], _v3[i]));
}
_mm_store_pd( &temp[alignedTemp], _temp0);
_mm_store_pd( &temp[alignedTemp+2], _temp1);
if(size & 1)
{
temp[alignedTemp] +=3D v1[size-1] *
v2[size-1];
temp[alignedTemp+2] +=3D v1[size-1=
] *
v3[size-1];
}
r[0] =3D
temp[alignedTemp] + temp[alignedTemp+1];
r[1] =3D temp[alignedTemp+2]
+ temp[alignedTemp+3];
}
bool=
UseXMM =3D false;
double doDOT(double v1[], double<=
/span>
v2[], intptr_t size)
{
if(UseXMM)
return<=
/span>
xmmDOT(v1, v2, size);
return DOT(v1, v2,
size);
}
void=
doDOTDOT(double v1[], double<=
/span>
v2[], double v3[], double
r[2], intptr_t size)
{
if(UseXMM)
xmmDOTDOT(v1, v2, v3, r, size);
else
DOTDOT(v1, v2, v3, r, size);
}
As stated i=
n the
comments, I’ve experience an alignment issue with the compiler direct=
ive
and had to resolve this with a little tweak of the code as a work around. T=
he
incorporation of the xmm intrinsic functions into these two routines were
relatively easy (for an inexperienced xmm programmer like myself). Let̵=
7;s
look at the results:
_files/image051.gif")
On the Q660=
0 (4
core no HT) the S/PTTx with the xmm intrinsic functions clearly exhibited a
benefit, approximately 30% improvement. The QuickThread Parallel Tag Team
“x” method preponderantly uses the xmmDOTDOT (double DOT functi=
on).
However, the S/STx, using the xmmDOT (single DOT) function performs slightly
worse than the C++ code without the intrinsic functions. This would indicate
the compiler optimizations were better than the hand optimizations of an
inexperienced intrinsic programmer (not unusual).
Looking at =
the Core
i7 920 we see a different picture:
_files/image053.gif")
This proces=
sor,
with HT, experienced a detriment in performance (-12%) in using the hand
written xmm helper functions. Note, the executable was the same for both
systems.
For you as =
a vendor
of a program for use on various systems, this is an important piece of
information. Knowing the platform specific behavior means you can query the
system at program startup and then change the selection of which variant of=
the
code to use. Typically you would involve selecting a functor (function poin=
ter)
in the code as opposed to using a selection function.
An alternate
approach for unknown behavior is to use a heuristics approach. The applicat=
ion
would select each variant of your code on the first few calls and measure t=
he
time of execution for each method, then after enough samples are taken, sel=
ect
the best performing variation of the functions and insert the appropriate
functor into the dispatch pointer.
Can this be
improved upon…
I will have=
to say,
yes it can.
Note the ch=
art
lines are rather “noisy”. Additional tuning can improve the har=
mony
and thus move the trend line upwards. This amounts to an estimated addition=
al
15% over the current Parallel Transpose Tag Team method. (xmmDOTDOT on non-=
HT
systems, DOTDOT on HT systems)
15% is usua=
lly not
worth going after, however, note that both Parallel Tag Team method and the
Cilk++ method appear to be dropping off at 1024. Additional tests should be=
run
with larger matrixes, and more importantly on multi-socket systems. When I =
have
an opportunity to collect such data, I will be in the position to publish
updated information regarding this performance test.
Is there a =
superior
technique that can improve upon this?
History sho=
ws, the
answer to this is yes.
In Part-4 w=
e will
examine issues relating to multi-socket systems.
In the last installment we saw the effects of the QuickThread Parallel Tag Team method of Matrix Multiplication performed on = two single processor systems:
_files/image054.gif")
_files/image055.gif")
Where the Intel Q6600 (4 core – no HT) with two = cores (two threads) sharing L1 and L2 caches attained a 40x to 50x improvement ov= er serial method, and in Intel Core i7 920 (4 core – with HT) and with f= our cores (eight threads) sharing one L3 cache and one core (two threads) shari= ng L1 and L2 caches attained 70x to 80x improvement. Let’s see how this performs using two processors, each similar to Core i7 920.
When run on a Dual Xeon 5570 systems with 2 sockets an= d two L3 caches, each shared by four cores (8 threads). and each processor with f= our L2 and four L1 caches each shared by one core and 2 threads, we find:
_files/image057.jpg")
Fig 17
for a scale to serial of 140x to 150x in the N =3D 700= to 1344 range. The performance is almost twice that of the Core i7 920. This was somewhat expected.
There are some interesting observations to be made abo= ut this performance profile. While the 2x speed-up was expected, the Parallel Transpose method performed as well as the Parallel Tag Team method with N = =3D 700 to 1024, then drops off precipitously. This is about half of the performance peak range of the Parallel Tag Team method (700 to 1344).
Why are the plateaus the same height?
What is the interpretation of the reason for the drop-= off difference?
The plateaus are the same height for the same reason w= e saw in Fig 5 and Fig 6 where the Serial Transpose and the Parallel Transpose performance were essentially the same (yellow and red lines in Fig 17 above= ). The reason being: a resource bandwidth limitation. In Fig 5 and Fig 6 the limiting resource appeared to be memory bandwidth (due to Parallel Tag Team method having ample room to out perform Parallel Transpose). Due to the relative equalities of the plateaus (in N =3D 700 to 1024) some other resou= rce than memory band width appears to be the limiting factor. This leaves cache access overhead or SSE Floating Point bottleneck.
Both of these bottlenecks will tend to clip the height= of the performance curve but not the width. You can observe in the chart above that the two Parallel Tag Team methods managed to double the breadth of the peak performance curve thus permitting larger matrices to be handled effectively by the program. The reason for the increase in the breadth (lar= ger matrixes handled) is principally due to more effective reuse of cached data= due to the solution path through the problem (sequence in which computations are made).
The insight learned from Fig 17 is: When your problem working data set exceeds that of the cache system, you may find some paths = to the solution more efficient than a simple nested loop.
In the 5th article we will explore how we c= an extend the performance curve to handle larger matrixes. Will this involve m= ore cores/CPUs and/or different solution path?
In the part 4 we saw the effects of the QuickThread Pa= rallel Tag Team Transpose method of Matrix Multiplication performed on a Dual Xeon 5570 systems with 2 sockets and two L3 caches, each shared by four cores (8 threads). and each processor with four L2 and four L1 caches each shared by= one core and 2 threads, we find:
_files/image058.jpg")
Fig 18 (17 = on part-4)
The Intel Many Core Testing Laboratory was kind enough= to provide me some time using their systems. http://software.intel.com/en-us/a= rticles/intel-many-core-testing-lab/
Running the same method (sans Cilk++) on a 4 processor= Intel Xeon 7560, each processor with 8 cores plus Hyper Threading (total of 32 co= res, 64 threads) we observe:
_files/image060.jpg")
Fig. 19
In this chart we do not see a plateau in the scaling. = This is due to the problem size at N=3D2048 being fully contained within the sys= tem cache. Caution - keep in mind = that the above chart represents the scaling to the cache insensitive Serial meth= od.
When comparing this to the cache sensitive Serial Tran= spose method we find a different set of results:
_files/image062.jpg")
Fig 20
The sharp change in slope at N=3D1500-1700 is mainly d= ue to the drop in performance of the reference data of the Serial Transpose metho= d, rather than due to improvement in PTTx.
Looking at scaling factors (parallel performance / num= ber of hardware threads) is often used as a decision factor in making a purchase. Let’s look at the scaling factor charts:
_files/image064.jpg")
Fig 21
We find that as the problem size increases we observe = a nice positive slope on the scaling factor. This looks exceptionally good. Too go= od to be believed. It is important to remember that the Serial method is not c= ache sensitive and is not a valid base line for comparison.
When we produce the scaling factor as compared to the = cache friendly Serial Transpose method we find a completely different picture:
_files/image066.jpg")
Fig 22
This chart will really deflate your programmer’s= ego. After all this hard work, we find that the scaling factor to a cache sensit= ive Serial Transpose method does not pay off (factor crosses 1) until N =3D 182= 4.
Comparing the factors of the 2x Intel Xeon 5570, as fa= ctored against the Serial Transpose we find:
_files/image068.jpg")
Fig 23
As expected, both systems can attain super scaling at different problem sizes. This is due to the different amount of cache memor= y on each system.
Although scaling factor provides a good perspective as= to return on investment with respect to purchase of more processors, the scali= ng factor of one processor architecture is not meaningful when compared to a different processor architecture. A company ought to be interested in total return on investment, and this includes a time element.
When looking at the time element, we get a completely different picture. When comparing the fastest method (QuickThread Parallel = Tag Team Transpose with SSE intrinsic functions) we find:
_files/image070.jpg")
Fig 24
When including time, as a determination for cost benef= it, we find that there is a rather drastic transition in the cost benefit ratio as= you cross a particular threshold in the problem size (N=3D1400). The point bein= g made here is to use appropriately sized test cases when making evaluations for purchase decisions. The cost/benefit and performance curves will not always= be suitable for extrapolation.
When we run the fastest method (QuickThread Parallel T= ag Team Transpose with SSE intrinsic functions) to larger matrix sizes we find=
_files/image072.jpg")
Fig 25
Matrixes up to N =3D 3000 to 4096 can be handled with 4 processors (32 cores / 64 threads), larger matrixes may require additional processors and/or a revised method.
Conclusions up to this point:
The fastest method: Parallel Tag Team Transpose with S= SE intrinsic functions, relies on the QuickThread ability to schedule affinity pinned threads by cache level proximity. The ability to coordinate work usi= ng cache sensitivity can pay off big in your optimization strategies.
Larger matrix sizes could be handled in an improved ma= nner with the same number of processors (4x 8 core with HT) when combined with an additional tiling strategy which will include additional overhead. This is typically called the divide and con= cur method, often used by parallel programmers.
Taking the matrix at N =3D 5200, and splitting it in t= wo (both axis) yields a tile of N =3D 2600 and four such tiles. This requires 4 x 2 = =3D 8 iterations using the smaller tiles. The matrix at N =3D 2600 took approxima= tely 0.33 seconds to compute, therefore estimated computation time would be at 0= .33 x 8 =3D 2.64. An estimated 10x improvement over the un-tiled method, but wh= ich may not be fast enough for your purposes.
Would divide and concur be the best strategy to use?= p>
This depends on the interpretation of best.
In terms of relative performance return for effort in = programming, this may be so. However, in looking at Fig 18 (17 on part-4), and comparing= the Cilk++ to QuickThread Parallel Tag Team XMM method, we have demonstrated th= at by paying particular attention to cache locality, specifically, what’= s in L1, L2 and L3 caches, and when it is in those caches, that you can attain an additional 1.4x to 2.5x performance boost in performance.
I will attempt to lay out the strategy which I believe= will make effective use of the system caches. While the sketch below won’t= show the specific method, it will demonstrate the general plan of attack.
The current Parallel Tag Team (transpose) method divid= es the work by L3, then subsequently L2 regions and then takes an L1 friendly path= in producing the results. This strategy works exceptionally well up until the = size of the matrix reaches a point where the execution path begins to evict data from the L3 cache. It is my postulation that by employing a method where you follow the same path of the Parallel Tag Team (transpose) method, but impos= e a clipping technique on the distance from the diagonal, that you can minimize= L3 cache evictions. The chart below illustrates a clipped L2 path through the current L3 workspace.
_files/image074.jpg")
Fig 26
In the above chart, the general execution path follows= the arrow. The colored (red) cells indicate the output cells who’s results have been computed. The white cells indicate those output cells that have y= et to be computed.
In the current Parallel Tag Team (transpose) method al= l of the above cells would have been colored, in the proposed method for large matrixes, a clipping technique limits the distance from the diagonal of the output cells to be computed while processing the diagonal. N.B. The above i= s a simplification of a 1P system.
In processing of a large matrix, had the computations included the empty cells, the computation would first suffer L2 cache evict= ions to L3, then at some size, eviction of L3. This is (postulation) possibly confirmed by Fig 25.
_files/image076.jpg")
Fig 27
In above Fig 27 (Fig 25 with arrows added), the red ar= row depicts L2 evictions and the blue arrow depicts L3 evictions.
Back to Fig 26. Upon completion of output cells in the= Fig 26 we will find:
_files/image078.jpg")
Fig 28
Where the X’s mark the cells in the output L2 zo= ne who’s results have been completed. The blue cells represent columns (stored as rows) in the m2t array that are still residing in L2 cache, and = the red cells represent row cells in the m1 array that are in the L2 cache. Additionally, (not depicted by colorization) some portion of the bottom row= (s) and right most column(s) are still residing in the L1 cache. The remaining un-X’d white may, or may not, be residing in the L3 cache.
The next computation sequence (subject to verification) ought to follow the sequence as depicted by the arrows in Fig 29:
_files/image080.jpg")
Fig 29
The red and blue ends of the output matrix should be processed in an alternating sequence as you progress along the arrows towar= ds the first diagonal.
In the earlier mentioned divide and concur method (til= ing), you would process 4 smaller tiles twice each or 8x the time of a smaller ti= le, presumably of a size found optimal for L2 cache size. The tiling method mig= ht benefit from L2 residual data resulting in a 6x to 8x run time of the small= er matrix as opposed to 8x the run time.
In the proposed method (call it cross diagonal), and f= or the size range depicted above in Figs 28 and 29, and based upon my prior experi= ence with the Parallel Tag Team Transpose technique, it is estimated that it may= be possible to produce the result in 1.5x to 2x the time of the smaller matrix. Potentially besting the divide and concur method (tiling) by a factor of 4x= . It should be stressed that the actual differences may vary from this estimate. Extrapolation, as mentioned earlier, often does not follow the curve established by present data.
I hope you have found my series of articles insightful= . This article cannot convey the detail of the QuickThread Parallel Tag Team Trans= pose XMM method whereas the code can convey this detail. For those interested in obtaining a copy of the code and a demo license for QuickThread feel free to contact me at my email address below. QuickThread runs on Windows and Linux systems. Both x32 and x64 for Windows but only x64 for Linux (Ubuntu and Red Hat).
Jim Dempsey=
The following is the source code use as the test progr= am. Portions of this source code requires the QuickThread parallel programming toolkit. Formatting edits were perform for the purpose of this document (shorter tab spacing, long comments broken to multiple comment lines and lo= ng statements were reformatted with appropriate line breaks).
/*
* MatrixMultiply.cpp
*
*&=
nbsp;
Created on: Jun 11, 2010
* Author: ji=
m
*/
// derived=
from:
http://msdn.microsoft.com/en-us/library/dd728073.aspx
#if<=
span
style=3D'font-family:"Courier New";mso-no-proof:yes'> defined(__linux)
// Linux
dependent headers
// numa.h =
used
by the
#include "numa.h"
#else
// compile=
with:
/EHsc
#include <windows.h>
#endif
// #define
USE_matrix_compare
#include <emmintrin.h>
// Headers=
for
QuickThread
#include <QuickThread.h>
#include <parallel_task.h>
#include <parallel_for.h>
#include <parallel_distribute.h>
using namespace
qt; //
QuickThread uses namespace qt
#include <iostream>
using namespace
std;
#if<=
span
style=3D'font-family:"Courier New";mso-no-proof:yes'> defined(__linux)
#define USE_random
#endif
#if<=
span
style=3D'font-family:"Courier New";mso-no-proof:yes'> defined(USE_random)
#include <random>
#else
#include <boost\random.hpp>
using namespace
boost;
#endif
long=
PageSize =3D 0;  =
; // Virtual Memory page size in bytes
int_Native nThreadsPer=
L2 =3D
0; // se=
t at
init time
double** m1 =3D NULL;
double** m2 =3D NULL;
double** resultSerial =3D NU=
LL;
double** resultSerialt =3D N=
ULL;
double** resultSerialtXMM =
=3D NULL;
double** resultParallel =3D =
NULL;
double** resultParallelt =3D=
NULL;
double** resultParalleltXMM =
=3D
NULL;
double** resultParalleltt =
=3D NULL;
double** resultParallelttXMM=
=3D
NULL;
qt_allocator<double> cacheAligned_doubles(64);
// On
initializaton set to L1 cache line size 64, 128, 256, ...
int_Native CacheLineSize_L1 =3D 0; // (byt=
es)
// may be
CacheLineSize_L1 or 2*CacheLineSize_L1
int_Native CacheFlushLineSize =3D 0; // (bytes)
// Size of=
L1
cache (determined at init time)
int_Native CacheSize_L1 =3D 0; // (bytes)
// Size of=
L2 or
L3 cache, whichever larger/present
int_Native CacheSize_Larger =3D 0; //
(bytes)
// qtContr=
ol
object used to access internal functions
qtControl qtCtrl;
// Calls t=
he
provided work function
// and ret=
urns
the number of milliseconds
// that it=
takes
to call that function.
template <class Function>
uint64_t
time_call(Function&& f)
{
uint64_t begin =3D __rdtsc();;
f();
uint64_t end =3D __rdtsc();;=
r=
eturn
end - begin;
}
// Creates=
a
square matrix with the given number of rows and columns.
double** create_matrix(intpt=
r_t
size);
// Frees t=
he
memory that was allocated for the given square matrix.
void=
destroy_matrix(double** m, intptr_t size);
// Initial=
izes
the given square matrix with values that are generated
// by the =
given
generator function.
template <class Generator>
double** initialize_matrix(<=
span
style=3D'color:blue'>double** m, intptr_t size, Generator& gen);=
// Compute=
s the
product of two square matrices.
// this th=
e a
text book standard matrix multiplication of square matrix
void=
matrix_multiply(double** m1, double<=
/span>**
m2, double** result, intptr_t size)
{
f=
or
(intptr_t i =3D 0; i < size; i++)
{
for (intptr_t j =3D 0; j < size; j++)
{
double temp =3D 0;
for (in=
t
k =3D 0; k < size; k++)
{
=
temp +=3D m1[i][k] * m2[k][j];
}
resul=
t[i][j]
=3D temp;
}
}
}
// compute=
DOT
product of two vectors, returns result as double
// used in
QuickThread variations
double DOT(double v1[], double v2[], in=
tptr_t
size)
{
double
temp =3D 0.0;
for(intptr_t
i =3D 0; i < size; i++)
{
temp=
+=3D
v1[i] * v2[i];
}
return temp;
}
double xmmDOT(double v1[], double<=
/span>
v2[], intptr_t size)
{
double temp[4];
intptr_t alignedTemp =3D (((intptr_t)&temp[0]) &=
amp;
8) >> 3;
__m128d _temp =3D _mm_set_pd(0.0, 0.0);
__m128d *_v1 =3D (=
__m128d *)v1;
__m128d *_v2 =3D (=
__m128d *)v2;
intptr_t half=
Size =3D
size / 2;
for(intptr_t i =3D=
0; i
< halfSize; i++)
{
_temp
=3D _mm_add_pd(_temp, _mm_mul_pd(_v1[i], _v2[i]));
}
_mm_store_pd( &temp[alignedTemp], _temp);
if(size & 1)
temp[alignedTemp] +=3D v1[size-1] *
v2[size-1];
return temp[aligne=
dTemp]
+ temp[alignedTemp+1];
}
// compute=
two
DOT products at once
// effecti=
vely
// r[0] =3D
DOT(v1, v2, size);
// r[1] =3D
DOT(v1, v3, size);
// except
running both results at the same time
void=
DOTDOT(double v1[], double<=
/span>
v2[], double v3[], double
r[2], intptr_t size)
{
double temp[2];
temp[0] =3D 0.0;
temp[1] =3D 0.0;
for(intptr_t i=3D0=
; i <
size; ++i)
{
temp[0] +=3D v1[i] * v2[i];
temp[1] +=3D v1[i] * v3[i];
} // for(intptr_t i=3D0;=
i <
size; ++i)
r[0] =3D temp[0];
r[1] =3D temp[1];
}
// compute=
two
DOT products at once
// effecti=
vely
// r[0] =3D
DOT(v1, v2, size);
// r[1] =3D
DOT(v1, v3, size);
// except
running both results at the same time
void=
xmmDOTDOT(double v1[], double<=
/span>
v2[], double v3[], double
r[2], intptr_t size)
{
double temp[6];
intptr_t alignedTemp =3D (((intptr_t)&temp[0]) &=
amp;
8) >> 3;
__m128d _temp0 =3D _mm_set_pd(0.0, 0.0);
__m128d _temp1 =3D _mm_set_pd(0.0, 0.0);
__m128d *_v1 =3D (=
__m128d *)v1;
__m128d *_v2 =3D (=
__m128d *)v2;
__m128d *_v3 =3D (=
__m128d *)v3;
intptr_t half=
Size =3D
size / 2;
for(intptr_t i =3D=
0; i
< halfSize; i++)
{
_temp0 =3D _mm_add_pd(_temp0, _mm_mul_pd(_v1[i], _v2[i]));
_temp1 =3D _mm_add_pd(_temp1, _mm_mul_pd(_v1[i], _v3[i]));
}
_mm_store_pd( &temp[alignedTemp], _temp0);
_mm_store_pd( &temp[alignedTemp+2], _temp1);
if(size & 1)
{
temp[alignedTemp] +=3D v1[size-1] *
v2[size-1];
temp[alignedTemp+2] +=3D v1[size-1=
] *
v3[size-1];
}
r[0] =3D
temp[alignedTemp] + temp[alignedTemp+1];
r[1] =3D
temp[alignedTemp+2] + temp[alignedTemp+3];
}
bool=
UseXMM =3D false;
double doDOT(double v1[], double<=
/span>
v2[], intptr_t size)
{
if(UseXMM)
return<=
/span>
xmmDOT(v1, v2, size);
return DOT(v1, v2,
size);
}
void=
doDOTDOT(double v1[], double<=
/span>
v2[], double v3[], double
r[2], intptr_t size)
{
if(UseXMM)
xmmDOTDOT(v1, v2, v3, r, size);
else
DOTDOT(v1, v2, v3, r, size);
}
// Compute=
s the
product of two square matrices.
void=
matrix_multiplyTransp=
ose(double** m1, double<=
/span>**
m2, double** result, intptr_t size)
{
// create a matrix to ho=
ld the
transposition of m2
// N.B.
// When m2 will be used
multiple times it would be more efficient
// to move the transposition outside the multiplication
function
// and perform the transposition once.
// Also, when
matrix_multiplyTranspose is called multiple times
// the empty array m2t c=
ould
be saved across calls, thus saving
// reallocation. However=
, this
would also prohibit recursive
// calls of this function.
double** m2t =3D
create_matrix(size);
for (intptr_t i =
=3D 0; i
< size; i++)
{
for (intptr_t j =3D 0; j < size; j++)
{
m2t[j][i] =3D
m2[i][j];
}
}
for (intptr_t i =
=3D 0; i
< size; i++)
{
for (intptr_t j =3D 0; j < size; j++)
{
result[i][j] =3D doDOT(m1[i], m2t[j=
],
size);
}
}
destroy_matrix(m2t, size);
}
#if<=
span
style=3D'font-family:"Courier New";mso-no-proof:yes'> defined(USE_matrix_compare)
void=
matrix_compare(double** m1, double<=
/span>**
m2,intptr_t size)
{
f=
or
(intptr_t i =3D 0; i < size; i++)
{
for (intptr_t j =3D 0; j < size; j++)
{
&=
nbsp;
double delta =3D abs(m1[i][j] - m2=
[i][j]);
&=
nbsp;
double epsilon =3D max(abs(m1[i][j=
]),
abs(m2[i][j]))
/ pow(10.0,12);
if(delta > epsilon)
{
wcout << L"m1[" << i << L"][" << j << L"] !=3D m2["
<< i << L"]["
<< j << L"]" <=
;<
endl;
}
}
}
}
#endif
// Compute=
s the
product of two square matrices in parallel.
void=
parallel_matrix_multi=
ply(
double** m1, double** m2,
{
parallel_for(
intptr_t(0), siz=
e,
[&](intptr_t
iBegin, intptr_t iEnd)
{
for(intptr_t i =3D iBegin; i < iEnd; ++i)
{
for
(intptr_t j =3D 0; j < size; j++)
{
=
double=
span>
temp =3D 0;
=
for
(intptr_t k =3D 0; k < size; k++)
=
{
=
temp
+=3D m1[i][k] * m2[k][j];
=
}
=
result[i][j] =3D temp;
}
}
});
}
// Compute=
s the
product of two square matrices in parallel.
// using
transposition of matrix prior to multiplication
void=
parallel_matrix_multiplyTranspose(double*=
* m1, double** m2, double<=
/span>**
result, intptr_t size)
{
// create a matrix to ho=
ld the
transposition of m2
// N.B.
// When m2 will be used
multiple times
// it would be more effi=
cient
to move the transposition
// outside the multiplic=
ation
function and perform the
// transposition once.
// Also, when
matrix_multiplyTranspose is called multiple times
// the empty array m2t c=
ould
be saved across calls, thus saving
// reallocation. However=
, this
would also prohibit recursive
// calls of this function.
double** m2t =3D
create_matrix(size);
// QuickThread
parallel_for(
AllThreads$,
intptr_t(0), siz=
e,
[&](intptr_t
iBegin, intptr_t iEnd)
{
for(intptr_t i =3D iBegin; i < iEnd; ++i)
&n=
bsp; {
for
(intptr_t j =3D 0; j < size; j++)
{
=
m2t[j][i] =3D m2[i][j];
}
}
});
parallel_for(
AllThreads$,
intptr_t(0), size,
[&](intptr_t iBegin, intptr_t
iEnd)
{
for(intptr_t i =3D iBegin; i < iEnd; ++i)
{
for (intptr_t
j =3D 0; j < size; j++)
{
result[i][j] =3D doDOT(m1[i], m2t[j=
],
size);
}
}
});
destroy_matrix(m2t, size);
}
struct TagTeamsSystem
{
// ctor captured args
double** m1;
double** m2;
double** result;
intptr_t size;
volatile bool TreadWaitingForTransposition;
// allocated transpositi=
on
array for m2
double** m2t;
// create array indicate=
ing
number of cells in column pair
// transposed
// 0 =3D no data transpo=
sed,
size =3D all data transposed
// Transposition perform=
ed 2
columns to 2 rows at a time
// (potential for last
transposition being 1 column to one row)
// m2t_transposed holds
transpostion counts for the pair of rows
volatile intptr_t*
m2t_transposed;
// *** m2t_transposed
currently has junk
bool HaveAllocationError;
bool allocationError() { return HaveAllocationError; }
TagTeamsSystem(
double**
_m1, double** _m2, double**
_result, intptr_t _size)
{
m1 =3D _m1;
m2 =3D _m2;
result =3D _result;
size =3D _size;
TreadWaitingForTransposition =3D <=
span
style=3D'color:blue'>false;
m2t =3D new
double*[size];
m2t_transposed =3D new intptr_t[(size + 1) / 2];
if(!m2t
|| !m2t_transposed)
{
HaveAllocationEr=
ror
=3D true;
return;
}
// ***=
m2t
currently has junk for row pointers
// ***
m2t_transposed currently has junk for counters
=
// tentatively indicate valid allocation
// oth=
er
threads can refute this assertion
HaveAllocationError =3D false;
}
~TagTeamsSystem()
{
if(m2t_transposed)
delete [] m2t_transposed;
if(m2t)
delete [] m2t;
}
};
struct TagTeamsProcessor
{
TagTeamsSystem* tts;
intptr_t iBeginL3;
intptr_t iEndL3;
bool HaveAllocationError;
double** m1;
double** m2;
double** m2t;
double** result;
intptr_t size=
;
volatile intptr_t*
m2t_transposed;
bool allocationError() { return HaveAllocationError; }
TagTeamsProcessor(
TagTeamsSystem* _tts, intptr_t _iBeginL3, intptr_t _iEndL3)
{
tts =3D _tts;
iBeginL3 =3D _iBeginL3;
iEndL3 =3D _iEndL3;
HaveAllocationError =3D false;
// make
local copies of often used variables
m1 =3D tts->m1;
m2 =3D tts->m2;
m2t =3D tts->m2t;
result =3D tts->result;
size =3D tts->size;
m2t_transposed =3D
tts->m2t_transposed;
// fir=
st
null out the entire slice of our rows in m2t
// (in=
the
event other thread has allocation error)
for(intptr_t
iRow =3D iBeginL3; iRow < iEndL3; ++iRow)
m2t[iRow] =3D NU=
LL;
// cle=
ar
counts held in m2t_transposed
for(intptr_t
i =3D iBeginL3; i<iEndL3; i +=3D 2)
m2t_transposed[i=
/
2] =3D 0;
// now
allocate
for(intptr_t
iRow =3D iBeginL3; iRow < iEndL3; ++iRow)
{
// NUMA aware allocation
m2t[iRow] =3D
cacheAligned_doubles.allocate(size);
if(m2t[iRow] =3D=3D NULL)
{
// indicate allocation error
HaveAllocationError
=3D true;
return; <=
/span>// let dtor clean up
}
} //
for(intptr_t iRow =3D iBeginL3; iRow < iEndL3; ++iRow)
}
~TagTeamsProcessor()
{
//
deallocate our rows of m2t
for(intptr_t
iRow =3D iBeginL3; iRow < iEndL3; ++iRow)
{
if(m2t[iRow] =3D=3D NULL)
break;
// NUMA aware deallocation
cacheAligned_dou=
bles.deallocate(
m2t[iRow], size);
}
}
};
struct TagTeamSameCache
{
TagTeamsProcessor*&n=
bsp; ttp;
intptr_t iBeginL2;
intptr_t iEndL2;
intptr_t iBeginL3;
intptr_t iEndL3;
bool HaveAllocationError;
// Row and
// *** may exceed bounds=
of
array ***
volatile intptr_t<=
span
style=3D'mso-tab-count:1'> iRow2x2;
volatile intptr_t<=
span
style=3D'mso-tab-count:1'> iCol2x2;
volatile intptr_t<=
span
style=3D'mso-tab-count:1'> MasterTileSequenceNumber;
volatile intptr_t<=
span
style=3D'mso-tab-count:1'> SlaveTileSequenceNumber;
intptr_t nTeamMembersInL2;
double** m1;
double** m2;
double** m2t;
double** result;
intptr_t size=
;
volatile intptr_t*
m2t_transposed;
// DoDiagonalsState tabl=
e
// 0 =3D no processing
// 1 =3D transposition s=
tarted
or ended for our diagonal
// 2 =3D column process =
started
or ended for our diagonal (L2)
// 3 =3D column process =
started
or ended for our processor (L3)
// 4 =3D column process =
started
or ended for any processor
char* DoDiagonalsState; // =3D new ch=
ar[(size +
1) / 2];
intptr_t iDoD=
iagonalL2index; // iBeg=
inL2 :
iEndL2
&=
nbsp; //
(use DoDiagonals[iDoDiagonalL2index/2])
intptr_t iDoD=
iagonalL3index; // iBeg=
inL3 :
iEndL3
// (use DoDiagonals[iDoDiagonalL3index/2])
intptr_t iDoD=
iagonalSystem; /=
/ 0 :
size
// (use DoDiagonals[iDoDiagonalSystem/2])
intptr_t iDoC=
olumnIndex; //
iBeginL2 : iEndL2
intptr_t iDoC=
olumnColumn; // iBeg=
inL2 :
iEndL2
intptr_t iLas=
tDiagonalDone;
inline bool IsO=
nDiagonal()
{
return<=
/span>
(iRow2x2 =3D=3D iCol2x2);
}
inline bool Is2x2()
{
return<=
/span>((iRow2x2
+ 2 <=3D size) && (iCol2x2 + 2 <=3D size));
}
inline bool Is2x1()
{
return<=
/span>((iRow2x2
+ 2 <=3D size) && (iCol2x2 + 1 =3D=3D size)); } inline bool Is1x2() { return<=
/span>((iRow2x2
+ 1 =3D=3D size) && (iCol2x2 + 2 <=3D size)); } inline bool Is1x1() { return<=
/span>((iRow2x2
+ 1 =3D=3D size) && (iCol2x2 + 1 =3D=3D size)); } bool allocationError() { return HaveAllocationError; } TagTeamSameCache( TagTeamsProcessor* _ttp, intptr_t _iBeginL2, intptr_t _iEndL2) { ttp =3D _ttp; iBeginL2 =3D _iBeginL2; iEndL2 =3D _iEndL2; iBeginL3 =3D ttp->iBeginL3; iEndL3 =3D ttp->iEndL3; HaveAllocationError =3D false; m1 =3D ttp->m1; m2 =3D ttp->m2; m2t =3D ttp->m2t; result =3D ttp->result; size =3D ttp->size; m2t_transposed =3D ttp->m2t_tra=
nsposed; MasterTileSequenceNumber =3D -1; SlaveTileSequenceNumber =3D -1; DoDiagonalsState =3D new char[(siz=
e + 1) /
2]; memset(DoDiagonalsState, 0, (size =
+ 1)
/ 2); iLastDiagonalDone =3D size; // init=
ialize
out of bounds } ~TagTeamSameCache() { if(DoDiagonalsState) delete [] DoDiagonalsState; } inline bool TileSeenBySlave() { return
( (MasterTileSeque=
nceNumber
=3D=3D SlaveTileSequenceNumber) || ((iEndL2 -
iBeginL2) =3D=3D size) || (nTeamMembers=
InL2
=3D=3D 1)); } inline bool TileAdvancedByMaster() { return ( (MasterTileSeque=
nceNumber
> SlaveTileSequenceNumber) || (nTeamMembers=
InL2
=3D=3D 1)); } inline bool PickNextDiagonal() { // any
remaining tiles on our diagonal if(iDoDiagonalL2index
>=3D iEndL2) return false;=
// no // set=
tile
focus to next tile on diagonal iRow2x2 =3D iDoDiagonalL2index; iCol2x2 =3D iRow2x2; // Not=
ify
Slave thread that tile selection is ready ++MasterTileSequenceNumber; // rec=
ord
last diagonal done (likely in L1/L2 cache) iLastDiagonalDone =3D iRow2x2; // ind=
icate
begun transposition DoDiagonalsState[iLastDiagonalDone=
/
2] =3D 1; // and=
bump
iDoDiagonalL2index of 2x2 tile for next time iDoDiagonalL2index +=3D 2; // ind=
icate
valid tile chosen return<=
/span>
true; } inline bool Pic=
kingFirstTile() { // see=
if
first time call if(MasterTileSequenceNumber
>=3D 0) return false;=
// no // yes,
initialize state machine iDoDiagonalL2index =3D iBeginL2; &=
nbsp; &=
nbsp; //
(use DoDiagonals[iDoDiagonalL2index/2]) iDoDiagonalL3index =3D iBeginL3; &=
nbsp; &=
nbsp; //
(use DoDiagonals[iDoDiagonalL3index/2]) iDoDiagonalSystem =3D 0; // 0 : size &=
nbsp; &=
nbsp; //
(use DoDiagonals[iDoDiagonalSystem/2]) iDoColumnIndex =3D iEndL2; // iBeginL2 : iEndL2 iDoColumnColumn =3D iEndL2; // iBeginL2 : iEndL2 // ret=
urn
next (first) tile on our diagonal return<=
/span> PickNextDiagonal(); } // while we are traversi=
ng
down a column of the last // diagonal picked by our
thread, and near the end // of completing the mat=
rix
multiplication, some other // thread may have finis=
hed up
all its diagonals, and // columns on diagonal, =
and
has found no additional // work to do. When this
occures .AND. when this thread // has additional diagon=
als to
perform, then we want to // process or next diago=
nal
prior to following down the // column of the last ti=
le we
picked. inline bool PickingForDiagonalTileForStarvedThread() { if(ttp->tts->TreadWaitingForTransposition) { // attempt to pick our next diagonal if(PickNextDiagonal()) { // got tile, clear starvation flag ttp->tts->TreadWaitingForTransposition
=3D false; return true;<=
o:p> } } //
if(ttp->tts->TreadWaitingForTransposition) return<=
/span>
false; } inline bool PickingTileDownColumnOfOurDiagonal() { // ski=
p over
our diagonal if/when reached if(iDoColumnIndex
=3D=3D iDoColumnColumn) iDoColumnIndex +=
=3D 2; // are=
we
done with this column? if(iDoColumnIndex
>=3D iEndL2) return false;=
// yes,
indicate pick fail // the=
row
becomes the iDoColumnIndex iRow2x2 =3D iDoColumnIndex; iCol2x2 =3D iDoColumnColumn; ++MasterTileSequenceNumber; /=
/ Notify
Slave thread // and=
bump
iDoColumnIndex for next pick iDoColumnIndex +=3D 2; return<=
/span>
true; } inline bool PickingFirstTileDownColumnOfOurLastDiagona=
l() { // ass=
ure
not at end of matrix if(iLastDiagonalDone
>=3D size) return false;=
if(DoDiagonalsState[iLastDiagonalDone
/ 2] =3D=3D 1) DoDiagonalsState=
[iLastDiagonalDone
/ 2] =3D 2; iDoColumnColumn =3D iLastDiagonalD=
one; iDoColumnIndex =3D iBeginL2; iLastDiagonalDone =3D size; // inva=
lidate
for test above // the
following may fail on very small matrix // do =
not
assume it will always succeed // now=
walk
down this column return<=
/span>
PickingTileDownColumnOfOurDiagonal(); } inline bool PickingFirstTileOfColumnOfOurL3() { for( iLastDiagonalDone =3D iDoDiagonalL3=
index; iLastDiagonalDone < iEndL3; iLastDiagonalDone +=3D 2) { // convert to 2x2 index number intptr_t i =3D
iLastDiagonalDone / 2; // see if we have not already processed this column // we have not performed the diagonal transposition // for tiles of other cores of this processor. // Therefore: // DoDiagonalsState[i] =3D=3D 0 for unprocessed tiles=
//  =
; &=
nbsp;
(by our thread) // DoDiagonalsState[i] =3D=3D 1 we did diagonal // &=
nbsp; (but
not entire column of that diagonal) // DoDiagonalsState[i] =3D=3D 2 we did diagonal //  =
; (and entire=
column
of that diagonal) // DoDiagonalsState[i] =3D=3D 3 other thread in our L=
3 // did diagonal an=
d we
did column of that diagonal // find ((DoDiagonalsState[i] =3D=3D 0) // &&
(m2t_transposed[i] =3D=3D size)) if(DoDiagonalsState[i] =3D=3D 0) { // we haven't done this column pair // see if other thread has completed transposion if(m2t_transposed[i] =3D=3D size) { =
// transposition complete =
// indicate we selected this column =
DoDiagonalsState[i]
=3D 3; =
// see if we can advance index for next time =
if(iLastDiagonalDone =3D=3D iDoDiagonalL3index)=
=
iDoDiagonalL3index
+=3D 2; =
if=
(PickingFirstTileDownColumnOfOurLastDiagonal()) =
re=
turn
true; } // if(m2t_transposed[i] =3D=3D size) } // if(DoDiagonals[i]) else { if(iLastDiagonalDone =3D=3D iDoDiagonalL3index)=
=
iDoDiagonalL3index
+=3D 2; } } //
for(iLastDiagonalDone =3D iDoDiagonalL3index; // iLastDiagonalDone < iEndL3; // iLastDiagonalDone +=3D 2) return<=
/span>
false; } bool
PickingFirstTileOfColumnOfAnyProcessor() { for( iLastDiagonalDone =3D iDoDiagonalSy=
stem; iLastDiagonalDone < size; iLastDiagonalDone +=3D 2) { // convert to 2x2 index number intptr_t i =3D
iLastDiagonalDone / 2; // see if we have not already processed this column if(DoDiagonalsState[i] <=3D 1) { // we haven't done this column pair // see if other thread has completed // transposion if(m2t_transposed[i] =3D=3D size) { =
// transposition complete =
// indicate we selected this diagonal =
DoDiagonalsState[i]
=3D 4; =
// see if we can advance index for next time =
if(iLastDiagonalDone =3D=3D iDoDiagonalSystem)<=
o:p> =
iDoDiagonalSystem
+=3D 2; =
if(PickingFirstTileDownColumnOfOurLastDiagonal(=
)) =
return true;<=
o:p> } // if(m2t_transposed[i] =3D=3D size) } // if(DoDiagonalsState[i]) else { if(iLastDiagonalDone =3D=3D iDoDiagonalSystem)<=
o:p> =
iDoDiagonalSystem
+=3D 2; } } // f=
or( iLastDiagonalDone =3D
iDoDiagonalSystem; // &nbs=
p; iLastDiagonalDone
< size; // &nbs=
p; iLastDiagonalDone
+=3D 2) ttp->tts->TreadWaitingForTra=
nsposition
=3D true; return<=
/span>
false; } inline bool MoreTilesToPick() { return<=
/span>
((iDoDiagonalSystem < size) || (iDoDiagonalL3index < iEndL3) || (iDoDiagonalL2index < iEndL2)); } // select a tile, return=
true
if found, false when done // only master thread of=
team
calls this function // Also, this function s=
elects
the next tile along the diagonal bool SelectTile()<=
o:p> { // wait
until slave thread has observed prior // tile selection (passes on 1st time call) while=
span>(!TileSeenBySlave()) _mm_pause(); // qui=
ck
test for first time call if(PickingFirstTile()) { return true;<=
o:p> } // tes=
t for
thread starvation if(PickingForDiagonalTileForStarvedThread()) { return true;<=
o:p> } if(PickingTileDownColumnOfOurDiagonal()) { return true;<=
o:p> } if(PickingFirstTileDownColumnOfOurLastDiagonal()) { return true;<=
o:p> } if(PickingFirstTileOfColumnOfOurL3()) { return true;<=
o:p> } if(PickNextDiagonal()) { return true;<=
o:p> } if(PickingFirstTileOfColumnOfAnyProcessor()) { return true;<=
o:p> } while=
span>(MoreTilesToPick()) { if(PickingFirstTileOfColumnOfOurL3()) { return true;<=
o:p> } if(PickingFirstTileOfColumnOfAnyProcessor()) { return true;<=
o:p> } _mm_pause(); /=
/ or
qtYield() } // set=
focus
out of range such that // Sla=
ve
thread knows of termination condition iRow2x2 =3D size; iCol2x2 =3D size; // not=
ify
slave of termination condition ++MasterTileSequenceNumber; return<=
/span>
false; } // bool SelectTile() }; struct TagTeamMember { TagTeamSameCache* ttc; intptr_t iTeamMemberInL2; intptr_t nTeamMembersInL2; bool HaveAllocationError; double** m1; double** m2; double** m2t; double** result; intptr_t size=
; volatile intptr_t*
m2t_transposed; intptr_t iRow=
2x2; intptr_t iCol=
2x2; size_t <=
/span>sizeMinus1; // =3D size_l=
ocal - 1; size_t <=
/span>sizeTrunc; // =3D =
size_local
& ~1; &=
nbsp; //
(even number of size) // Redefine pointers as
int64_t int64_t** m2_as_int64; // =3D (int64_t**)m2; int64_t** m2t_as_int64; // =3D (int64_t**)m2t; bool allocationError() { return HaveAllocationError; } TagTeamMember( TagTeamSameCache*
_ttc, intptr_t _iTeamMemberInL2, intptr_t _nTeamMembersInL2) { ttc =3D _ttc; iTeamMemberInL2 =3D _iTeamMemberIn=
L2; nTeamMembersInL2 =3D _nTeamMembers=
InL2; ttc->nTeamMembersInL2 =3D
nTeamMembersInL2; HaveAllocationError =3D false; m1 =3D ttc->m1; m2 =3D ttc->m2; m2t =3D ttc->m2t; result =3D ttc->result; size =3D ttc->size; sizeMinus1 =3D size - 1; // ofte=
n used
evaluation sizeTrunc =3D size & ~1; // even=
number
of size m2t_transposed =3D
ttc->m2t_transposed; // Red=
efine
pointers as int64_t m2_as_int64 =3D (int64_t**)m2; m2t_as_int64 =3D (int64_t**)m2t; } ~TagTeamMember() { if((size
=3D=3D 1) && (iTeamMemberInL2 > 0)) { // advance SlaveTileSequenceNumber ++ttc->SlaveT=
ileSequenceNumber; } } bool SelectTile()<=
o:p> { if(IsMaster()) { ttc->SelectTi=
le(); // copy to local values (may be invalid) iRow2x2 =3D
ttc->iRow2x2; iCol2x2 =3D
ttc->iCol2x2; // return true if we have valid tile return (iRow2x2 < size); } // Sla=
ve
thread // loop
until master advances tile while=
span>(!ttc->TileAdvancedByMaster()) _mm_pause(); // ** =
save
prior to advancing SlaveTileSequenceNumber iRow2x2 =3D ttc->iRow2x2; iCol2x2 =3D ttc->iCol2x2; // adv=
ance
SlaveTileSequenceNumber ++ttc->SlaveTileSequenceNumber;=
// ret=
urn
true if not void (have work to do) return<=
/span>
(iRow2x2 < size); } // bool SelectTile() inline bool IsMaster() { re=
turn
(iTeamMemberInL2 =3D=3D 0); } inline bool IsSlave() { ret=
urn
(iTeamMemberInL2 !=3D 0); } inline bool IsOnlyThread() { return
(nTeamMembersInL2 =3D=3D 1); } inline bool IsO=
nDiagonal() { return<=
/span>
(iRow2x2 =3D=3D iCol2x2); } inline bool Is2x2() { return<=
/span>((iRow2x2
+ 2 <=3D size) && (iCol2x2 + 2 <=3D size)); } inline bool Is2x1() { return<=
/span>((iRow2x2
+ 2 <=3D size) && (iCol2x2 + 1 =3D=3D size)); } inline bool Is1x2() { return<=
/span>((iRow2x2
+ 1 =3D=3D size) && (iCol2x2 + 2 <=3D size)); } inline bool Is1x1() { return<=
/span>((iRow2x2
+ 1 =3D=3D size) && (iCol2x2 + 1 =3D=3D size)); } void DoDiagonalAsO=
nlyThread() { // mus=
t be
on processor with diminished capacity // onl=
y 1
thread in this tag team //
Equivilent to DoDiagonalAsMasterThread() without // notification // And
performing double DOT on what would have been done // by Slave intptr_t iRow =3D iRow2x2; /=
/ same
coord as 2x2 // (when master thread) intptr_t iCol =3D iCol2x2; /=
/ same
coord as 2x2 // (when master thread) if(iCol
=3D=3D sizeMinus1) { // last column is single column for(intptr_t i=3D0; i < sizeTrunc; i +=3D 2)=
{ int64_t temp[2]; // compiler should SSE register temp // collect 2x2 tile from next row pair // in tile column pair temp[0]
=3D m2_as_int64[i][iCol]; temp[1]
=3D m2_as_int64[i+1][iCol]; // store the transposed 1x2 tile into the m2t array m2t_as_int64[iCol][i]
=3D temp[0]; m2t_as_int64[iCol][i+1]
=3D temp[1]; } // check for odd size if(size & 1) { m2t_as_int64[iCol][sizeMinus1] =3D m2_as_int64[sizeMinus1][iCol]; } } else { // not last column is single column // N.B. keep sizeTrunc as local variable instead of // outer scope reference // the following for loop will run faster // walk down the column pair of m2 transposing to // row pair in m2t for(intptr_t i=3D0; i < sizeTrunc; i +=3D 2)=
{ int64_t temp[4]; // compiler should SSE register temp // collect 2x2 tile from next row pair // in tile column pair temp[0]
=3D m2_as_int64[i][iCol]; temp[2]
=3D m2_as_int64[i][iCol+1]; temp[1]
=3D m2_as_int64[i+1][iCol]; temp[3]
=3D m2_as_int64[i+1][iCol+1];=
// store the transposed 2x2 tile into the m2t array m2t_as_int64[iCol][i]
=3D temp[0]; m2t_as_int64[iCol][i+1]
=3D temp[1]; m2t_as_int64[iCol+1][i]
=3D temp[2]; m2t_as_int64[iCol+1][i+1]
=3D temp[3]; } // check for odd size if(size & 1) { m2t_as_int64[iCol][sizeMinus1] =3D m2_as_int64[sizeMinus1][iCol]; m2t_as_int64[iCol+1][sizeMinus1] =3D m2_as_int64[sizeMinus1][iCol+1]; } } // ind=
icate
to other threads that we completed transposition m2t_transposed[iCol / 2] =3D size;=
if(Is2x2()) { // master thread finishes up with double DOT // for upper row of 2x2 doDOTDOT( m1[iRow], m2t[iCol], m2t[iCol+1], &result[iRow][iCol], size); // then double DOT for lower row of 2x2 doDOTDOT( m1[iRow+1], m2t[iCol], m2t[iCol+1], &result[iRow+1][iCol], size); } else { // 1x1 result[iRow][iCo=
l] =3D
doDOT(m1[iRow], m2t[iCol], size); } } // void
DoDiagonalAsOnlyThread() void
DoDiagonalAsMasterThread() { // Mas=
ter
thread intptr_t iRow =3D iRow2x2; /=
/ same
coord as 2x2 // (when master thread) intptr_t iCol =3D iCol2x2; /=
/ same
coord as 2x2 // (when master thread) // mas=
ter of
tag team performs transposition // and=
DOT
of its row in the tag team tile // Not=
ify
the other thread(s) of of my team // aft=
er the
cache line(s) is(are) flushed intptr_t notificationInterval =3D CacheFlushLineSize /=
sizeof(double=
); if(iCol
=3D=3D sizeMinus1) { // last column is single column for(intptr_t i=3D0; i < sizeTrunc; i +=3D 2)=
{ int64_t temp[2]; // compiler should SSE register temp // collect 2x2 tile from next row pair in tile column=
pair temp[0]
=3D m2_as_int64[i][iCol]; temp[1]
=3D m2_as_int64[i+1][iCol]; // store the transposed 1x2 tile into the m2t array m2t_as_int64[iCol][i]
=3D temp[0]; m2t_as_int64[iCol][i+1]
=3D temp[1]; // see if we wrote the last pair in cache flush line<=
o:p> if(((i+2)%notificationInterval) =3D=3D 0) { =
// inform other thread(s) of this team that // transposition has occured up through and including // this column =
m2t_transposed[iCol
/ 2] =3D i+2; } } // check for odd size if(size & 1) { m2t_as_int64[iCol][sizeMinus1] =3D m2_as_int64[sizeMinus1][iCol]; // TileColTransposed updated below } } else { // not last column is single column // N.B. keep sizeTrunc as local variable instead of // outer scope reference, the following for loop will run=
// faster // Walk down the column pair of m2 transposing to // row pair in m2t for(intptr_t i=3D0; i < sizeTrunc; i +=3D 2)=
{ int64_t temp[4]; // compiler should SSE register temp // collect 2x2 tile from next row pair in tile column=
pair temp[0]
=3D m2_as_int64[i][iCol]; temp[2]
=3D m2_as_int64[i][iCol+1]; temp[1]
=3D m2_as_int64[i+1][iCol]; temp[3]
=3D m2_as_int64[i+1][iCol+1];=
// store the transposed 2x2 tile into the m2t array m2t_as_int64[iCol][i]
=3D temp[0]; m2t_as_int64[iCol][i+1]
=3D temp[1]; m2t_as_int64[iCol+1][i]
=3D temp[2]; m2t_as_int64[iCol+1][i+1]
=3D temp[3]; // see if we wrote the last pair in cache flush line<=
o:p> //
iBeginL2 : iEndL2 //
iBeginL3 : iEndL3