Rayan87's Blog

Partitioned Global Address Space (PGAS)

April 21, 2010

Introduction

While Moore’s Law continues to predict the doubling of transistors on an integrated circuit every 18 months, performance and power considerations have forced chip designers to embrace multi-core processors in place of higher frequency uni-core processors. As desktop and high-performance computing architectures tend towards distributed collections of multi-core nodes, a new parallel programming paradigm is required to fully exploit the complex distributed and shared-memory hierarchies of these evolutionary systems.

Recently, a programming model has been developed that has the potential to exploit the best features of this distributed shared-memory architecture. Not only does this model promise improved runtime performance on distributed clusters of SMPs, its data and execution semantics support increased programmer productivity. This model is called the Partitioned Global Address Space (PGAS) model.

The Partitioned Global Address Space (PGAS) paradigm provides both a data and execution model that has the potential to dramatically improve runtime performance and programmer productivity on multi-core architectures using shared memory.

Memory Models

There are 2 models for memory usage:

Shared Memory Model.
Distributed Memory Model

Shared Memory Model

The shared-memory programming model typically exploits a shared memory system, where any memory location is directly accessible by any of the computing processes (i.e. there is a single global address space). This programming model is similar in some respects to the sequential single-processor programming model with the addition of new constructs for synchronizing multiple access to shared variables and memory locations.

Distributed Memory Model

The distributed-memory programming model exploits a distributed-memory system where each processor maintains its own local memory and has no direct knowledge about another processor’s memory (a “share nothing” approach). For data to be shared, it must be passed from one processor to another as a message.

Why PGAS?

The PGAS is the best of both worlds. This parallel programming model combined the performance and data locality (partitioning) features of distributed memory with the programmability and data referencing simplicity of a shared-memory (global address space) model.
The PGAS programming model aims to achieve these characteristics by providing:

A local-view programming style (which differentiates between local and remote data partitions).
A global address space (which is directly accessible by any process).
Compiler-introduced communication to resolve remote references.
One-sided communication for improved inter-process performance.
Support for distributed data structures.

In this model variables and arrays can be either shared or local. Each process has private memory for local data items and shared memory for globally shared data values. While the shared-memory is partitioned among the cooperating processes (each process will contribute memory to the shared global memory), a process can directly access any data item within the global address space with a single address.

Languages of PGAS

Currently there are three (3) PGAS programming languages that are becoming commonplace on modern computing systems:

Unified Parallel C (UPC)
Co-Array Fortran (CAF)
Titanium

Unified Parallel C (UPC)

Its an extended parallel extension of ANSI C with a distributed shared memory parallel programming language. Common and familiar syntax and semantics for parallel C with simple extensions to ANSI C. The UPC provides standard library functions to move data to/from shared memory which can be used to move chunks in the shared space or between shared and private spaces.

UPC Execution Model

A number of threads working independently in SPMD (Single Process, Multiple Data) fashion. MYTHREAD specifies thread index (0..THREADS-1) and the number of threads specified at compile time or run time.
No implicit Synchronization among the threads, only when needed. There are 4 mechanisms:

Barriers: for blocking and non-blocking.
Locks: to protect data against multiple writers.
Memory consistency control: has to do with the order of shared operations.
Fence: equivalent to null strict reference to ensure that all shared references are issued.

A quick Example

//vect_add.c
#include <upc_relaxed.h>
#define N 100*THREADS
shared int v1[N], v2[N], v1plusv2[N];
void main(){
int i;
for(i=0; i
If (MYTHREAD==i%THREADS)
v1plusv2[i]=v1[i]+v2[i];
}

UPC Runtime model

The figure shows the high-level system diagram for a UPC application compiled using the Berkeley UPC compiler. The generated C code runs on top of the UPC runtime system, which provides platform independence and implements language-specific features such as shared memory allocation and shared pointer manipulation. The runtime system implements remote operations by calling the GASNet communication interface, which provides hardware-independent lightweight networking primitives.

UPC Memory model

A shared pointer can reference all locations in the shared space, while a private pointer may reference only addresses in its private space or in its portion of the shared space. Static and dynamic memory allocations are supported for both shared and private memory.

UPC pointers

There are 4 different ways for declaring pointers in UPC, each way declare a different type of pointer

Int *p1; This is a private pointer pointing locally. it could be used to access private data or local shared data.
Shared int *p2; This is a private pointer pointing in to shared space. it could be used for independent access of threads to data in shared space.
Int *shared p3; This is a shared pointer pointing locally, but its not recommended.
Shared int *shared p4; This is a shared pointer pointing to the shared space. it could be used for common access of all threads to data in shared space.

Co-Array Fortran (CAF)

The CAF is a simple extension to Fortran 90 that allows programmers to write efficient parallel applications using a Fortran-like syntax. It also assumes the SPMD programming model with replicated data objects called co-arrays. Co-array objects are visible to all processors and each processor can read and write data belonging to any other processor by setting the index of the co-dimension to the appropriate value.
The CAF creates multiple images of the same program where text and data are replicated in each image. it marks some variables with co-dimensions that behave like normal dimensions and express a logical problem decomposition. It also allows one sided data exchange between co-arrays using a Fortran like syntax. On the other hand, CAF requires the underlying run-time system to map the logical problem decomposition onto specific hardware.

CAF Syntax

The CAF syntax is a simple parallel extension to normal Fortran syntax, where it uses normal rounded brackets () to point data in local memory, and square brackets [] to point data in remote memory.

CAF Execution Model

The number of images is fixed and each image has its own index, retrievable at run-time. Each image executes the same program independently of the others and works on its own local data. An image moves remote data to local data through explicit CAF syntax while an “object” has the same name in each image. The programmer inserts explicit synchronization and branching as needed.

CAF Memory Model

There are 4 memory models:

One-to-one model.
Many-to-one model.
One-to-many model.
Many-to-many model.

What do co-dimensions mean?

real :: x(n)[p,q,*]

Replicate an array of length n, one on each image.
Build a map so each image knows how to find the array on any other image.
Organize images in a logical (not physical) three dimensional grid.
The last co-dimension acts like an assumed size array: *
A specific implementation could choose to represent memory hierarchy through the co-dimensions.

CAF I/O

There is one file system visible to all images, where an an image can open a file alone or as a part of a team. The programmer controls access to the file using direct access I/O and CAF intrinsic functions.

Titanium

The Titanium is based on java but on compile, its first compiled to C then to machine code. It has the same SPMD parallelism model as UPC and CAF but dynamic java threads are not supported. The Titanium analyzes global synchronization and optimizes pointers, communication and memory. Titanium’s global address space is based on pointers rather than shared variables.There is no distinction between a private and shared heap for storing objects. Any object maybe referenced by global or local pointers.

Titanium features over java

Multi-dimensional arrays: iterators, sub arrays, copying.
Immutable “value” classes.
Templates.
Operator overloading.
Scalable SPMD parallelism replaces threads.
Global address space with local/global reference distinction.
Checked global synchronization.
Zone-based memory management (regions).
Libraries for collective communication, distributed arrays, bulk I/O, performance profiling.

Titanium Execution Model

Titanium has the same execution model as UPC and CAF. Basic java programs maybe run as titanium programs, but all processors do all the work.

Eg. Parallel hello world:
class HelloWorld {
public static void main (String [] argv) {
System.out.println(“Hello from proc“
+ Ti.thisProc()
+ “ out of “
+ Ti.numProcs());
}
}

Titanium Runtime Model

The latest versions of Titanium include distributed-memory backends that communicate using GASNet, a high-performance communication interface designed especially for SPMD global address-space languages like Titanium (and UPC) that offers better portability and higher-level operations which can leverage hardware-specific features that support a global-address space model.

Titanium also supports using Active Messages 2.0 as the standardized networking interface for some of the older cluster-based parallel backends. Active Messages is a low-level, high-performance communication paradigm first proposed by von Eicken et al. that basically amounts to a super-lightweight RPC mechanism, which is generally implemented as a zero-copy, fully user-level protocol that is highly tuned to the networking hardware. Titanium uses several different AM 2.0 implementations for various backends: Lanai AM AMMPI AMUDP AMLAPI

Titanium memory model

Globally shared address space is partitioned, where pointers are either local or global. Global pointers may point to remote locations.

Conclusion

UPC is easy to program in for C writers, significantly than alternative paradigms at times.
UPC exhibits very little overhead when compared with MPI for problems that are parallel.
The CAF syntax gives the programmer more control and flexibility.
Co-dimensions in the CAF match any problem decomposition.
The CAF performance is better than the library based models.
The titanium has all the benefits of java plus all the features that has been added to handle parallel programming

Posted in Uncategorized | Leave a Comment »