Using MPI on SGI Altix Systems

This document provides some usage details on the MPI implementations provided on the sgi Altix platform.

Please refer to the MPI page at LRZ for the API documentation. Here we only discuss how to handle the specialities of MPI implementations available on SGI Altix systems, in particular SGIs proprietary MPT (Message Passing Toolkit).

News (in reverse order)

December 2007

MPT is updated to version 1.17 with various bug fixes

August 2007

With the MPT updates installed for the Phase 2 system, additional functionality has become available:

Threaded MPI libraries: A separate module mpi.altix/intel_mt is available which provides a multi-threaded MPI library. It is only necessary to re-link the application.
One-sided calls: MPT 1.16 now provides additional one-sided MPI calls like MPI_Win_create, MPI_Putetc.
Intel MPI: Intel MPI can now run cross-partition.

Parallel Environments

A parallel environment which includes wrapper scripts mpicc, mpif90, mpiCC and provides a startup mechanism for distributed memory programs (mpirun, mpiexec) is automatically set up at login by default load of the environment module mpi.altix. Other MPI environments, which can be accessed by switching to a different module, are listed in the following table:

supported MPI environments
Hardware Interface	supported Compiler	MPI flavour	Environment Module Name
cache-coherent NUMA; within a partition or across partition boundaries	Intel 8.1 and higher	SGI MPT (version 1.16) This is the default environment	mpi.altix
cache-coherent NUMA; within a partition or across partition boundaries	Intel 9.0 and higher	Intel MPI	mpi.intel
cache-coherent NUMA; within a shared-memory partition only	Intel 9.1	Open MPI Experimental, does not work properly yet.	mpi.ompi
cache-coherent NUMA; within a shared-memory partition only	Intel 8.1 and higher	Shared memory MPICH. Please read the MPI documentation for the Linux Cluster for details on how to use this. On sgi Altix, this variant should not normally be used.	mpi.shmem

Please consult the HLRB-II batch document for information on the partitioning of the Altix 4700.

Compiling and linking programs

After setting up the environment, your program needs to be compiled and linked. Here are examples for the usage of the Fortran 90, C, and C++ wrappers:

mpif90 -o myfprog.exe myfprog.f90
mpicc  -o mycprog.exe mycprog.c
mpiCC  -o myCCprog.exe myCCprog.cpp

The compilation step can also be performed separately from the linking; please add the -c compiler switch in this case. Further compiler switches (optimization, debugging, checking etc.) can be added as for the native Intel compiler calls; in addition the -vtrace switch - specifiable at compilation as well as linkage - is supported, which will instrument your program for MPI Trace Collection.

Locating the MPI libraries

Some software packages want an entry for the location of the MPI libraries. If you use the wrapper scripts, you should normally be able to leave the corresponding environment variables empty. If you do not wish to use the wrapper scripts, or if you do mixed-language programming, please specify

-lmpi -lffio -lsma -lpthread   for Fortran,-lmpi -lsma -lpthread  for C, and
-lmpi -lmpi++abi1002 -lsma -lpthread  for C++.

Note: The -lmpi++abi1002 setting applies on SLES10 based systems, the older -lmpi++ should not be used any more.

Using a non-default environment

The following steps need to be observed:

Switch over to desired environment: E.g.,
module switch mpi.altix mpi.intel
Completely recompile and re-link your application. This also involves rebuilding any libraries which include any MPI functionality.
Do not forget to also switch to the same environment before running your application. The mpiexec commands are different and incompatible for the different parallel environments.

The reason for this procedure is that SGI MPT is neither source nor binary compatible with the alternative packages; also the mpiexec command used for startup of the MPI programs based on the MPT alternatives is a different one than the PBS-provided version.

Execution of parallel programs

Once you have built your parallel application as described above, there are various methods available to start up your program

SGI MPT in interactive mode

In this case, you can use the mpirun command:

mpirun -np 6 ./myprog.exe

will start up 6 MPI tasks. If your program also was compiled with OpenMP and the OMP_NUM_THREADS environment variable is set to a value ≠ 1, additional threads may also be started up by each MPI task.

MPMD startup is also supported via the syntax

mpirun -np 6 ./myprog1.exe : -np 4 ./myprog2.exe

Note: if the executable is not located in your current directory, MPI startup will be unsuccessful, since mpirun does not take account of the entries in your PATH variable. Please use the full path name of your executable in this case. If the executable itself can be located via an entry in $PATH, the following command will work for the bash shell:

mpirun -np 6 $(which myprog.exe)

SGI MPT in batch mode

In this case (which also includes PBS interactive shells!) we urgently recommend that you use the MPI-2 style mpiexec command delivered with PBS to start up your program. In particular, multi-partition runs will only work properly if mpiexec is used. As a rule, all necessary setup information will be automatically read from the PBS configuration file, hence it is usually sufficient to specify

mpiexec ./myprog.exe

Note that if you do wish to specify the number of MPI tasks (which especially is necessary when running MPMD style programs), you need to use the -n switch (instead of -np):

mpiexec -n 8 ./myprog1.exe : -n 12 ./myprog2.exe

will start up 8 tasks of myprog1.exe and 12 tasks of myprog2.exe, which start off using a common MPI_COMM_WORLD with 20 tasks.

Intel MPI / OpenMPI

Intel's MPI implementation as well as OpenMPI can presently be used only within one shared-memory partition. Switch over the mpi.intel or mpi.ompi environments to make use of one of these packages. For both alternatives it is recommended to use the provided mpiexec command for startup of parallel programs.

Controlling MPI execution for SGI MPT

mpirun options

The following table provides a selection of options which can be set as flags for the mpirun command.

Flag	Explanation
-f file_name	pick up the mpirun arguments from file file_name
-p prefix_string	Specifies a string to prepend to each line of output from stderr and stdout for each MPI process. The following prerequisites and recommendations apply: To delimit lines of text that come from different hosts, each output to stdout/stderr must be terminated with a new line character. the MPI_UNBUFFERED_STDIO environment variable shall not be set. Some special strings are available for obtaining MPI-internal information; LRZ recommends a setting like -p "\|--%g of %G on %@-->" This will for each MPI task print out the task identifier, total number of tasks, and the host the task is running on.
-stats	Prints statistics about the amount of data sent with MPI calls during the MPI_Finalize process. Data is sent to stderr. Users can combine this option with the -p option to prefix the statistics messages with the MPI rank. For more details, see the MPI_SGI_stat_print(3) man page.
-v	Displays comments on what mpirun is doing when launching the MPI application.

Using memory mapping

Memory mapping is a functionality available within SGI MPT, which provides optimized communication behaviour for some applications by enabling e.g., single copy mechanisms. For some MPT calls, e.g. one-sided calls, shmem calls or global shared memory using memory mapping is in fact mandatory. By default, this feature is enabled for SGI MPT.

However using memory mapping also has a downside, which consists in extensive usage of pinned memory pages which may considerably increase the memory usage of your application uncontrollably unless you take steps to prevent this. The following alternatives are available:

Deactivate default single copy by setting MPI_DEFAULT_SINGLE_COPY_OFF to any value. This will keep memory mapping available for those routines for which it is mandatory.
Increase the value of MPI_BUFFER_MAX. This will suppress using single copy for all messages smaller than the supplied value.
Deactivate memory mapping altogether by setting MPI_MEMMAP_OFF. Beware that certain functionality for which memory mapping is mandatory will not work in this case.
Limit mapped memory usage by setting the MPI_MAPPED_HEAP_SIZE and MPI_MAPPED_STACK_SIZE to some value not too much larger than the maximum size of your messages. Since a silent changeover to non-mapped memory may have a performance impact, you will need to re-check performance after adjusting to new values.

Please see a more detailed description of the aforementioned environment variables in the table below. All changes to the default environment may incur performance variations which in turn can depend on the sizing of your application message sizes. Hence you need to be very careful in properly tuning for your application and your application setup.

Memory usage when using memory mapping

Looking at memory usage with tools like ps or top when memory mapping is enabled may indicate a very large memory overhead. In fact, this is not the case since the pinned memory pages get accounted to each process by the Linux kernel even though there exists only one instance of them. If you want to obtain a reliable estimate for memory usage, you need to disable memory mapping.

Environment variables

MPI execution can be more finely controlled by setting certain environment variables to suitable values. The exact settings may depend on the application as well as the parallel configuration the application is run on. The mpi.altix module will perform some settings where deviations from the SGI defaults appear reasonable; but of course the user may need to make further changes. Some settings have considerable performance impact!

Name	Function	Remarks
Controlling task distribution (e.g., for hybrid parallelism)
MPI_DSM_CPULIST	Specifies a list of CPUs (relative to current CPUset) on which to run an MPI application.	Unset by default. Usually only necessary for complex setups like hybrid and/or MPMD jobs.
MPI_DSM_DISTRIBUTE	Activates NUMA job placement mode. This mode ensures that each MPI process gets a unique CPU and physical memory on the node with which that CPU is associated. The CPUs are chosen by simply starting at relative CPU 0 and incrementing until all MPI processes have been forked. To choose specific CPUs, use the MPI_DSM_CPULIST environment variable.	LRZ/PBS sets this by default.
MPI_DSM_PPM	Sets the number of MPI processes per blade. The value must be less than or equal to the number of cores per blade (or memory channel).	The default is the number of cores per blade.
MPI_OPENMP_INTEROP	Setting this variable modifies the placement of MPI processes to better accommodate the OpenMP threads associated with each process. For this variable to take effect, you must also set MPI_DSM_DISTRIBUTE.	Set to any value to enable.
MPI_OMP_NUM_THREADS	Can be set to a colon separated list of positive integers, representing the value of the OMP_NUM_THREADS environment variable for each host-program specification on the mpirun command line.	Set to OMP_NUM_THREADS value by default, or 1 if OMP_NUM_THREADS is unset.
Controlling task execution
MPI_NAP	This variable affects the way in which ranks wait for events to occur: unset: The MPI library spins in a tight loop when awaiting events. Best possible response time, but each waiting rank uses CPU time at wall-clock rates. defined with no value (export MPI_NAP=""): The MPI library makes a system call while waiting, which might yield the CPU to another eligible process that can use it. If no such process exists, the rank receives control back nearly immediately, and CPU time accrues at near wall-clock rates. If another process does exist, it is given some CPU time, after which the MPI rank is again given the CPU to test for the event. set to integer value: the rank sleeps for that many milliseconds before again testing to determine if an event has occurred. This dramatically reduces the CPU time that is charged against the rank, and might increase the system's "idle" time. This setting is best if there is usually a significant time difference between the times that sends and matching receives are posted.	Setting to a moderate value is useful for master-slave codes where the master shares CPU resources with one of the slaves. Defining MPI_NAP without value is best used if the system is oversubscribed (there are more processes ready to run than there are CPUs). Leaving MPI_NAP undefined is best if sends and matching receives occur nearly simultaneously.
MPI_UNBUFFERED_STDIO	Disable buffering of stdio/stderr. If MPI processes produce very long output lines, the program may crash due to running out of STDIO buffer; this	Set to any value to enable. If enabled, the option -prefix is ignored.
Memory mapping, remote memory access
MPI_BUFFER_MAX	Specifies a minimum message size, in bytes, for which the message will be considered a candidate for single-copy transfer. Setting this variable to a large value, e.g. larger than the maximum message size, may improve performance very much.	LRZ sets this to a default of 32768
MPI_DEFAULT_SINGLE_COPY_OFF	Disables the single-copy mode. Users of MPI_Send should continue to use the MPI_BUFFER_MAX environment variable to control single-copy.	If unset, single copy mode is enabled; this causes transfers of more than 2000 Bytes that use MPI_Isend, MPI_Sendrecv, MPI_Alltoall, MPI_Bcast, MPI_Allreduce and MPI_Reduce to use the single-copy mode optimization. Set to any value to disable.
MPI_MAPPED_HEAP_SIZE	Sets the new size (in bytes) for the amount of heap that is memory mapped per MPI process.	Default: The physical memory available per CPU less the static region size. This variable will only have an effect if memory mapping is on.
MPI_MAPPED_STACK_SIZE	Sets the new size (in bytes) for the amount of stack that is memory mapped per MPI process. The default size of the mapped stack is the physical memory available per CPU less the static region size.	Default: The stack size limit. If stack size is set to unlimited, the mapped region is set to the physical memory available per CPU. This variable will only have an effect if memory mapping is on.
MPI_MEMMAP_OFF	Turns off the memory mapping feature. The memory mapping feature provides support for single-copy transfers and MPI-2 one-sided communication on Linux for single and multi-partition jobs.	Unset by default. Set to any value to switch memory mapping off.
Diagnostics and debugging support
MPI_CHECK_ARGS	Run-time checking of MPI function arguments. Segmentation faults might occur if bad arguments are passed to MPI.	Useful for debugging. Adds several microseconds to latency!
MPI_COREDUMP	Controls which ranks of an MPI job can dump core on receipt of a core-dumping signal. Valid values are: FIRST: The first rank on each host to receive a core-dumping signal should dump core. NONE: No rank should dump core. ALL: All ranks should dump core if they receive a core-dumping signal. INHIBIT: disables MPI signal-handler registration for core-dumping signals; stack traceback and MPI signal handler invocation are then suppressed.	Default setting is as if FIRST were specified. Please note that you will need to issue the command ulimit -c unlimited before starting MPI execution to actually obtain core files since a maximum core size of 0 is set as system default. Intel's idb is used to generate the traceback information; use of the -g -traceback compilation switch is recommended to enable source location.
MPI_DSM_VERBOSE	Print information about process placement unless MPI_DSM_OFF is also set. Output is sent to stderr.	Unset by default. Set to any value to enable.
MPI_MEMMAP_VERBOSE	Display additional information regarding the memory mapping initialization sequence. Output is sent to stderr.	Unset by default. Set to any value to enable.
MPI_SHARED_VERBOSE	Setting this variable allows for some diagnostic information concerning messaging within a host to be displayed on stderr.	Off by default.
MPI_SLAVE_DEBUG_ATTACH	Specifies the MPI process to be debugged. If you set MPI_SLAVE_DEBUG_ATTACH to N, the MPI process with rank N prints a message during program startup, describing how to attach to it from another window using the gdb or idb debugger. The message includes the number of seconds you have to attach the debugger to process N. If you fail to attach before the time expires, the process continues.	Off by default.
MPI_STATS	Enables printing of MPI internal statistics.	Off by default. Note: This variable should not be set if the program uses threads.
MPI-internal limits
MPI_BUFS_PER_HOST	Number of shared message buffers (of size 16 KB) that MPI is to allocate for each host. These buffers are used to send and receive long inter-partition messages.	SGI default is 32. Increase if default buffering proves insufficient.
MPI_BUFS_PER_PROC	Number of shared message buffers (of size 16 KB) that MPI is to allocate for each MPI task. These buffers are used to send and receive long intra-partition messages.	SGI default is 32. Increase if default buffering proves insufficient.
MPI_COMM_MAX	Maximum number of communicators that can be used in an MPI program.	Default value is 256.
MPI_GROUP_MAX	Maximum number of groups available for each MPI process.	Default value is 32.
MPI_MSGS_MAX	This variable can be set to control the total number of message headers of size 128 kBytes that can be allocated. This allocation applies to messages exchanged between processes on a single host. If you set this variable, specify the maximum number of message headers.	May improve performance if your application generates many small messages. Default is 512.
MPI_REQUEST_MAX	Determines the maximum number of nonblocking sends and receives that can simultaneously exist for any single MPI process. Use this variable to increase internal default limits. MPI generates an error message if this limit (or the default, if not set) is exceeded.	The default value is 16384
MPI_TYPE_DEPTH	Sets the maximum number of nesting levels for derived data types. Limits the maximum depth of derived data types that an application can create. MPI generates an error message if this limit (or the default, if not set) is exceeded.	By default, 8 levels can be used.
MPI_TYPE_MAX	Determines the maximum number of data types that can simultaneously exist for any single MPI process. Use this variable to increase internal default limits. MPI generates an error message if this limit (or the default, if not set) is exceeded.	1024 by default.

Controlling MPI execution for Intel MPI

Environment variables

Here some of the environment variables controlling the execution of Intel MPI are described.

Name	Function	Remarks
SGI Altix xpmem DAPL specific settings
DAPL_XPMEM_MEMMAP_VERBOSE	This variable can be set (to any value) to issue additional diagnostic messages, especially if your code crashes with DAPL related error messages.	May improve performance if your application generates many small messages. Default is 512.
DAPL_XPMEM_MAPPED_HEAP_SIZE	This variable must be set to the size of the statically defined heap memory area (in Bytes) if your code has many statically defined large arrays. Not setting this may lead to crashes in the DAPL setup phase ("Assertion `remote_ep' failed").	the command size myprog.exe will give you an estimate for the upper limit to set.
DAPL_XPMEM_MAPPED_STACK_SIZE	This variable must be set to the size of the statically defined stack memory area (in Bytes) if your code has many automatic arrays which turn out to be large.	It may be a good idea to use the -heap-arrays compilation switch to circumvent usage of this variable.

Extensions to MPI, special topics

The discussion in this section pertains to all available MPI implementations; where this is not the case remarks are inserted in dark brown coloured letters.

The shmem programming interface

In addition to MPI calls, a SPMD parallel program on Altix systems may also use the efficiently implemented shmem library calls. These make use of the RDMA facilities of the SGI interconnect (NUMAlink); indeed they also work across partition boundaries. Shmem calls are generally similar in semantics to one-sided MPI communication calls, but easier to use: For example,

shmem_double_put(target,source,len,pe)

is a facility for transferring a double precision array source(len) to the location target(len) on the remote process pe. The target object must be remotely accessible (aka symmetric), i. e. typically either a static array or dynamically allocated by executing the collective call shpalloc (3F) on a suitably defined Cray-type pointer. Repeatedly executed shmem calls targeting the same process will usually require an additional synchronization call - in the above case: shmem_fence()- to enforce memory ordering. Also note that the interface is not generic: For each data type used there is a distinct API call available (if at all). Here is a list of further functionality available:

shmem_get: transfer data from remote to local process
shmem_ptr: return a pointer to a memory location on a remote process
collective calls for reduction, broadcast, barrier
administrative calls for starting up and getting process IDs: If shmem is used in conjunction with MPI, please use the standard MPI administrative calls instead!

Due to the cache coherency properties of the Altix systems, the cache management functions - while still available for compatibility - are not actually required. Please consult the documentation referenced below for detailed shmem information. This programming paradigm is only available when using SGI MPT.

Global shared memory

The GSM feature provides expanded shared memory capabilities across partitioned Altix systems and additional shared memory placement specifications within a single host configuration. Additional (however non-portable) API calls provide a way to allocate a global shared memory segment with desired placement options, free that segment, and provide information about that segment. For example, calling the subroutine

gsm_alloc(len, placement, flags, comm, base, ierror)

will provide a memory segment of size len bytes at address base (accessed via a Cray-type pointer) for all processes in the MPI communicator comm. Data written to this segment by any process will be visible to all other processes after a synchronization call (usually MPI_Barrier). Please consult the documentation referenced below for detailed GSM information. This programming paradigm is only available when using SGI MPT.

MPI-2 issues

Supported features

The following table gives an overview over MPI-2 features which are known to be supported by at least one implementation available on the system.

Feature	SGI MPT	Intel MPI (latest release)
MPI-IO	yes	yes
One-sided communication	mostly (alternative: shmem)	mostly
Language Bindings	yes (Intel compilers)
Process Creation/Management	yes	no
External Interfaces	partially	mostly

If you find anything missing (especially subfeatures which you urgently need), please drop a mail to LRZ HPC support.

Memory management via MPI_Alloc_mem

In order to safely use allocated memory within, e. g., MPI one-sided communication routines, one should use MPI_Alloc_mem instead of Fortran 90 ALLOCATE or C malloc. Until the Fortran 2003 C interoperability features are available, one needs to use a non-standard language extension for calls from Fortran, namely Cray Pointers, since MPI_Alloc_mem wants a C pointer argument. The following code fragment shows how to handle this situation with double precision buffers:

     integer(kind=MPI_ADDRESS_KIND) size, intptr
      integer :: itemsize, length
!     Specify largest anticipated value. Will not be actually allocated at this point
      pointer (baseptr, farray(200))
      double precision :: farray
      :
      call MPI_Type_extent(MPI_DOUBLE_PRECISION, itemsize, ierror)
!     length at most 200
      size=length*itemsize
      call MPI_Alloc_mem(size, MPI_INFO_NULL, intptr, ierror)
!     enable access to allocated storage via farray:
      baseptr = intptr
      :
!     Now perform one-sided calls etc.

Hybrid parallelism

Using the two distinct programming models distributed memory parallelism and shared memory parallelism in a single program is also denoted hybrid parallelism. Usually, this will involve using MPI calls and either OpenMP directives or multi-threaded library calls. To obtain good performance and scalability for such a program is in many cases a more difficult and complex task than using one parallel paradigm on its own.

Programming considerations

For performance reasons, you might want to execute MPI calls from within threaded regions of your program. In this case, you must replace the initial MPI_Init call in your MPI program by MPI_Init_thread, specifying the level of threading support you need. The routine will then return the threading level actually available for the implementation. Beware that this level may well be lower than what you first requested. Hence provisions must be made within your program for possible limitations of multi-threaded MPI execution. Furthermore, even if multi-threading is fully supported, making use of it may well incur a performance penalty within the MPI substrate.

Further details are described in the man page for the MPI_Init_thread API call.

Overview of supported thread level parallelism for available MPI implementations
MPI Flavour and Version	provides	Remarks
SGI MPT 1.16	MPI_THREAD_SINGLE	MPI calls only in serial region. The default MPT is not thread-safe.
SGI MPT 1.16	MPI_THREAD_MULTIPLE	Perform module unload mpi.altix module load mpi.altix/intel_mt to switch to fully thread-safe libraries
Intel MPI 2.0	MPI_THREAD_FUNNELED	Threaded MPI calls only from master thread
Intel MPI 3.0	MPI_THREAD_MULTIPLE	Threaded MPI calls from any thread. Please specify the -mt_mpi or the -threads option. The latter will not only link the thread-safe libriaries, but also generally perform thread-safe linkage. The default libraries only support MPI_THREAD_SINGLE.
Open MPI 1.1	MPI_THREAD_MULTIPLE	Threaded MPI calls from any thread. Only lightly tested.

Setting up a hybrid program run - Case 1: uniform threading

Please proceed as follows:

export OMP_NUM_THREADS=4
export MPI_OPENMP_INTEROP=
mpirun -np 6 ./myprog.exe

This will start up six MPI tasks with 4 threads each. The MPI_OPENMP_INTEROP variable tells MPT to space out MPI tasks so as to keep all forked threads as near to the master thread as possible. For the other MPI implementations this variable however has no effect. You will need to perform your own pinning (e.g., via the LRZ service library); for Intel MPI there is also inbuilt pinning functionality available which will be described here once tested.

Setting up a hybrid program run - Case 2: non-uniform threading

You may have two programs which form a MPMD unit, or wish to run your single program with varying thread numbers. Please proceed as follows:

export OMP_NUM_THREADS=1
export MPI_OMP_NUM_THREADS=4:8
export MPI_OPENMP_INTEROP=
mpirun -np 2 ./myprog1.exe : -np 1 ./myprog2.exe

This will start up two 4-threaded tasks of myprog1.exe, and one 8-threaded task of myprog2.exe. The OMP_NUM_THREADS variable is ignored unless there are too few entries in MPI_OMP_NUM_THREADS, in which case the OMP_NUM_THREADS will be used for the missing entries.

For non-MPT MPI implementations, external determination of varying thread number is not supported; you need to use e.g., the OpenMP call omp_set_num_threads(n) within your program - and then there are still the placement/pinning issues mentioned in the previous subsection to deal with.

Coping with inhomogeneity

Since the HLRB-II has become slightly inhomogeneous with the phase 2 upgrade, it may be necessary to cope with this situation if (very large) jobs are run which need to make use of both high density (4 cores per memory channel) and high bandwidth (2 cores per memory channel) blades. One possibility to avoid workload imbalances is to simply set

export MPI_DSM_PPM=2

which will have no effect on the bandwidth blades, but will only use one socket on the density blades. If hybrid parallelism should be used on the basis of 2 threads per socket, the combination

export MPI_DSM_PPM=1
export OMP_NUM_THREADS=2
export KMP_AFFINITY=compact,0

should do the job.

Documentation

General Information on MPI

Please refer to the

MPI page at LRZ

for the API documentation and information about the different flavors of MPI available.

Manual pages

For sgi MPT, please consult the man pages mpi (1), mpirun (1). Also, each MPI API call has its own man page.
For the shmem API, consult the man page shmem_intro (1). Again, each shmem routine has its individual man page.
For the global shared memory API, consult the man page gsm_intro (1), which also contains references to further API calls.

SGI's MPT documentation

Some of the following links lead to a password-protected area. To obtain user name and password for access, please type the command get_manuals_passwd when logged in to the system.

MPT page on SGI's web site with summary information
sgi MPT User's Guide, in PDF (200 kByte) and HTML format.

Intel MPI documentation

See the previous subsection for how to access the password-protected area on the LRZ web server.

Version	Available Docs
3.0	Release Notes (Text file), Getting Started (PDF), and Reference Manual (PDF)

OpenMPI documentation

Apart from a manual page for mpirun there is a FAQ on the OpenMPI web site.

Information for ...

Using MPI on SGI Altix Systems

News (in reverse order)

December 2007

August 2007

Parallel Environments

Compiling and linking programs

Locating the MPI libraries

Using a non-default environment

Execution of parallel programs

SGI MPT in interactive mode

SGI MPT in batch mode

Intel MPI / OpenMPI

Controlling MPI execution for SGI MPT

mpirun options

Using memory mapping

Memory usage when using memory mapping

Environment variables

Controlling MPI execution for Intel MPI

Environment variables

Extensions to MPI, special topics

The shmem programming interface

Global shared memory

MPI-2 issues

Supported features

Memory management via MPI_Alloc_mem

Hybrid parallelism

Programming considerations

Setting up a hybrid program run - Case 1: uniform threading

Setting up a hybrid program run - Case 2: non-uniform threading

Coping with inhomogeneity

Documentation

General Information on MPI

Manual pages

SGI's MPT documentation

Intel MPI documentation

OpenMPI documentation