Using MPI on SGI Altix Systems
This document provides some usage details on the MPI implementations provided on the sgi Altix platform.
Please refer to the MPI page at LRZ for the API documentation. Here we only discuss how to handle the specialities of MPI implementations available on SGI Altix systems, in particular SGIs proprietary MPT (Message Passing Toolkit).
News (in reverse order)
December 2007
- MPT is updated to version 1.17 with various bug fixes
August 2007
With the MPT updates installed for the Phase 2 system, additional functionality has become available:
- Threaded MPI libraries: A separate module mpi.altix/intel_mt is available which provides a multi-threaded MPI library. It is only necessary to re-link the application.
- One-sided calls: MPT 1.16 now provides additional one-sided MPI calls like MPI_Win_create, MPI_Putetc.
- Intel MPI: Intel MPI can now run cross-partition.
Parallel Environments
A parallel environment which includes wrapper scripts mpicc, mpif90, mpiCC and provides a startup mechanism for distributed memory programs (mpirun, mpiexec) is automatically set up at login by default load of the environment module mpi.altix. Other MPI environments, which can be accessed by switching to a different module, are listed in the following table:
supported MPI environments |
|||
---|---|---|---|
Hardware Interface |
supported Compiler |
MPI flavour |
Environment Module Name |
cache-coherent NUMA; within a partition or across partition boundaries |
Intel 8.1 and higher |
SGI MPT (version 1.16) |
mpi.altix |
cache-coherent NUMA; within a partition or across partition boundaries |
Intel 9.0 and higher |
Intel MPI |
mpi.intel |
cache-coherent NUMA; within a shared-memory partition only |
Intel 9.1 |
Open MPI |
mpi.ompi |
cache-coherent NUMA; within a shared-memory partition only |
Intel 8.1 and higher |
Shared memory MPICH. Please read the MPI documentation for the Linux Cluster for details on how to use this. On sgi Altix, this variant should not normally be used. |
mpi.shmem |
Please consult the HLRB-II batch document for information on the partitioning of the Altix 4700.
Compiling and linking programs
After setting up the environment, your program needs to be compiled and linked. Here are examples for the usage of the Fortran 90, C, and C++ wrappers:
mpif90 -o myfprog.exe myfprog.f90 mpicc -o mycprog.exe mycprog.c mpiCC -o myCCprog.exe myCCprog.cpp
The compilation step can also be performed separately from the linking; please add the -c
compiler switch in this case. Further compiler switches (optimization, debugging, checking etc.) can be added as for the native Intel compiler calls; in addition the -vtrace
switch - specifiable at compilation as well as linkage - is supported, which will instrument your program for MPI Trace Collection.
Locating the MPI libraries
Some software packages want an entry for the location of the MPI libraries. If you use the wrapper scripts, you should normally be able to leave the corresponding environment variables empty. If you do not wish to use the wrapper scripts, or if you do mixed-language programming, please specify
-lmpi -lffio -lsma -lpthread for Fortran,-lmpi -lsma -lpthread for C, and -lmpi -lmpi++abi1002 -lsma -lpthread for C++.
Note: The -lmpi++abi1002 setting applies on SLES10 based systems, the older -lmpi++ should not be used any more.
Using a non-default environment
The following steps need to be observed:
- Switch over to desired environment: E.g.,
module switch mpi.altix mpi.intel - Completely recompile and re-link your application. This also involves rebuilding any libraries which include any MPI functionality.
- Do not forget to also switch to the same environment before running your application. The mpiexec commands are different and incompatible for the different parallel environments.
The reason for this procedure is that SGI MPT is neither source nor binary compatible with the alternative packages; also the mpiexec command used for startup of the MPI programs based on the MPT alternatives is a different one than the PBS-provided version.
Execution of parallel programs
Once you have built your parallel application as described above, there are various methods available to start up your program
SGI MPT in interactive mode
In this case, you can use the mpirun command:
mpirun -np 6 ./myprog.exe
will start up 6 MPI tasks. If your program also was compiled with OpenMP and the OMP_NUM_THREADS environment variable is set to a value ≠ 1, additional threads may also be started up by each MPI task.
MPMD startup is also supported via the syntax
mpirun -np 6 ./myprog1.exe : -np 4 ./myprog2.exe
Note: if the executable is not located in your current directory, MPI startup will be unsuccessful, since mpirun does not take account of the entries in your PATH variable. Please use the full path name of your executable in this case. If the executable itself can be located via an entry in $PATH
, the following command will work for the bash shell:
mpirun -np 6 $(which myprog.exe)
SGI MPT in batch mode
In this case (which also includes PBS interactive shells!) we urgently recommend that you use the MPI-2 style mpiexec command delivered with PBS to start up your program. In particular, multi-partition runs will only work properly if mpiexec is used. As a rule, all necessary setup information will be automatically read from the PBS configuration file, hence it is usually sufficient to specify
mpiexec ./myprog.exe
Note that if you do wish to specify the number of MPI tasks (which especially is necessary when running MPMD style programs), you need to use the -n switch (instead of -np):
mpiexec -n 8 ./myprog1.exe : -n 12 ./myprog2.exe
will start up 8 tasks of myprog1.exe and 12 tasks of myprog2.exe, which start off using a common MPI_COMM_WORLD with 20 tasks.
Intel MPI / OpenMPI
Intel's MPI implementation as well as OpenMPI can presently be used only within one shared-memory partition. Switch over the mpi.intel or mpi.ompi environments to make use of one of these packages. For both alternatives it is recommended to use the provided mpiexec command for startup of parallel programs.
Controlling MPI execution for SGI MPT
mpirun options
The following table provides a selection of options which can be set as flags for the mpirun command.
Flag | Explanation |
---|---|
-f file_name | pick up the mpirun arguments from file file_name |
-p prefix_string |
Specifies a string to prepend to each line of output from stderr and stdout for each MPI process. The following prerequisites and recommendations apply:
|
-stats | Prints statistics about the amount of data sent with MPI calls during the MPI_Finalize process. Data is sent to stderr. Users can combine this option with the -p option to prefix the statistics messages with the MPI rank. For more details, see the MPI_SGI_stat_print(3) man page. |
-v | Displays comments on what mpirun is doing when launching the MPI application. |
Using memory mapping
Memory mapping is a functionality available within SGI MPT, which provides optimized communication behaviour for some applications by enabling e.g., single copy mechanisms. For some MPT calls, e.g. one-sided calls, shmem calls or global shared memory using memory mapping is in fact mandatory. By default, this feature is enabled for SGI MPT.
However using memory mapping also has a downside, which consists in extensive usage of pinned memory pages which may considerably increase the memory usage of your application uncontrollably unless you take steps to prevent this. The following alternatives are available:
- Deactivate default single copy by setting MPI_DEFAULT_SINGLE_COPY_OFF to any value. This will keep memory mapping available for those routines for which it is mandatory.
- Increase the value of MPI_BUFFER_MAX. This will suppress using single copy for all messages smaller than the supplied value.
- Deactivate memory mapping altogether by setting MPI_MEMMAP_OFF. Beware that certain functionality for which memory mapping is mandatory will not work in this case.
- Limit mapped memory usage by setting the MPI_MAPPED_HEAP_SIZE and MPI_MAPPED_STACK_SIZE to some value not too much larger than the maximum size of your messages. Since a silent changeover to non-mapped memory may have a performance impact, you will need to re-check performance after adjusting to new values.
Please see a more detailed description of the aforementioned environment variables in the table below. All changes to the default environment may incur performance variations which in turn can depend on the sizing of your application message sizes. Hence you need to be very careful in properly tuning for your application and your application setup.
Memory usage when using memory mapping
Looking at memory usage with tools like ps or top when memory mapping is enabled may indicate a very large memory overhead. In fact, this is not the case since the pinned memory pages get accounted to each process by the Linux kernel even though there exists only one instance of them. If you want to obtain a reliable estimate for memory usage, you need to disable memory mapping.
Environment variables
MPI execution can be more finely controlled by setting certain environment variables to suitable values. The exact settings may depend on the application as well as the parallel configuration the application is run on. The mpi.altix module will perform some settings where deviations from the SGI defaults appear reasonable; but of course the user may need to make further changes. Some settings have considerable performance impact!
Name |
Function |
Remarks |
---|---|---|
Controlling task distribution (e.g., for hybrid parallelism) |
||
MPI_DSM_CPULIST |
Specifies a list of CPUs (relative to current CPUset) on which to run an MPI application. |
Unset by default. Usually only necessary for complex setups like hybrid and/or MPMD jobs. |
MPI_DSM_DISTRIBUTE |
Activates NUMA job placement mode. This mode ensures that each MPI process gets a unique CPU and physical memory on the node with which that CPU is associated. The CPUs are chosen by simply starting at relative CPU 0 and incrementing until all MPI processes have been forked. To choose specific CPUs, use the MPI_DSM_CPULIST environment variable. |
LRZ/PBS sets this by default. |
MPI_DSM_PPM |
Sets the number of MPI processes per blade. The value must be less than or equal to the number of cores per blade (or memory channel). |
The default is the number of cores per blade. |
MPI_OPENMP_INTEROP |
Setting this variable modifies the placement of MPI processes to better accommodate the OpenMP threads associated with each process. For this variable to take effect, you must also set MPI_DSM_DISTRIBUTE. |
Set to any value to enable. |
MPI_OMP_NUM_THREADS |
Can be set to a colon separated list of positive integers, representing the value of the OMP_NUM_THREADS environment variable for each host-program specification on the mpirun command line. |
Set to OMP_NUM_THREADS value by default, or 1 if OMP_NUM_THREADS is unset. |
Controlling task execution |
||
MPI_NAP |
This variable affects the way in which ranks wait for events to occur:
|
Setting to a moderate value is useful for master-slave codes where the master shares CPU resources with one of the slaves. Defining MPI_NAP without value is best used if the system is oversubscribed (there are more processes ready to run than there are CPUs). Leaving MPI_NAP undefined is best if sends and matching receives occur nearly simultaneously. |
MPI_UNBUFFERED_STDIO |
Disable buffering of stdio/stderr. If MPI processes produce very long output lines, the program may crash due to running out of STDIO buffer; this |
Set to any value to enable. If enabled, the option -prefix is ignored. |
Memory mapping, remote memory access |
||
MPI_BUFFER_MAX |
Specifies a minimum message size, in bytes, for which the message will be considered a candidate for single-copy transfer. |
LRZ sets this to a default of 32768 |
MPI_DEFAULT_SINGLE_COPY_OFF |
Disables the single-copy mode. Users of MPI_Send should continue to use the MPI_BUFFER_MAX environment variable to control single-copy. |
If unset, single copy mode is enabled; this causes transfers of more than 2000 Bytes that use MPI_Isend, MPI_Sendrecv, MPI_Alltoall, MPI_Bcast, MPI_Allreduce and MPI_Reduce to use the single-copy mode optimization. Set to any value to disable. |
MPI_MAPPED_HEAP_SIZE |
Sets the new size (in bytes) for the amount of heap that is memory mapped per MPI process. |
Default: The physical memory available per CPU less the static region size. This variable will only have an effect if memory mapping is on. |
MPI_MAPPED_STACK_SIZE |
Sets the new size (in bytes) for the amount of stack that is memory mapped per MPI process. The default size of the mapped stack is the physical memory available per CPU less the static region size. |
Default: The stack size limit. If stack size is set to unlimited, the mapped region is set to the physical memory available per CPU. This variable will only have an effect if memory mapping is on. |
MPI_MEMMAP_OFF |
Turns off the memory mapping feature. The memory mapping feature provides support for single-copy transfers and MPI-2 one-sided communication on Linux for single and multi-partition jobs. |
Unset by default. Set to any value to switch memory mapping off. |
Diagnostics and debugging support |
||
MPI_CHECK_ARGS |
Run-time checking of MPI function arguments. Segmentation faults might occur if bad arguments are passed to MPI. |
Useful for debugging. Adds several microseconds to latency! |
MPI_COREDUMP |
Controls which ranks of an MPI job can dump core on receipt of a core-dumping signal. Valid values are:
|
Default setting is as if FIRST were specified. Please note that you will need to issue the command ulimit -c unlimited before starting MPI execution to actually obtain core files since a maximum core size of 0 is set as system default. Intel's idb is used to generate the traceback information; use of the -g -traceback compilation switch is recommended to enable source location. |
MPI_DSM_VERBOSE |
Print information about process placement unless MPI_DSM_OFF is also set. Output is sent to stderr. |
Unset by default. Set to any value to enable. |
MPI_MEMMAP_VERBOSE |
Display additional information regarding the memory mapping initialization sequence. Output is sent to stderr. |
Unset by default. Set to any value to enable. |
MPI_SHARED_VERBOSE |
Setting this variable allows for some diagnostic information concerning messaging within a host to be displayed on stderr. |
Off by default. |
MPI_SLAVE_DEBUG_ATTACH |
Specifies the MPI process to be debugged. If you set MPI_SLAVE_DEBUG_ATTACH to N, the MPI process with rank N prints a message during program startup, describing how to attach to it from another window using the gdb or idb debugger. The message includes the number of seconds you have to attach the debugger to process N. If you fail to attach before the time expires, the process continues. |
Off by default. |
MPI_STATS |
Enables printing of MPI internal statistics. |
Off by default. Note: This variable should not be set if the program uses threads. |
MPI-internal limits |
||
MPI_BUFS_PER_HOST |
Number of shared message buffers (of size 16 KB) that MPI is to allocate for each host. These buffers are used to send and receive long inter-partition messages. |
SGI default is 32. Increase if default buffering proves insufficient. |
MPI_BUFS_PER_PROC |
Number of shared message buffers (of size 16 KB) that MPI is to allocate for each MPI task. These buffers are used to send and receive long intra-partition messages. |
SGI default is 32. Increase if default buffering proves insufficient. |
MPI_COMM_MAX |
Maximum number of communicators that can be used in an MPI program. |
Default value is 256. |
MPI_GROUP_MAX |
Maximum number of groups available for each MPI process. |
Default value is 32. |
MPI_MSGS_MAX |
This variable can be set to control the total number of message headers of size 128 kBytes that can be allocated. This allocation applies to messages exchanged between processes on a single host. If you set this variable, specify the maximum number of message headers. |
May improve performance if your application generates many small messages. Default is 512. |
MPI_REQUEST_MAX |
Determines the maximum number of nonblocking sends and receives that can simultaneously exist for any single MPI process. Use this variable to increase internal default limits. MPI generates an error message if this limit (or the default, if not set) is exceeded. |
The default value is 16384 |
MPI_TYPE_DEPTH |
Sets the maximum number of nesting levels for derived data types. Limits the maximum depth of derived data types that an application can create. MPI generates an error message if this limit (or the default, if not set) is exceeded. |
By default, 8 levels can be used. |
MPI_TYPE_MAX |
Determines the maximum number of data types that can simultaneously exist for any single MPI process. Use this variable to increase internal default limits. MPI generates an error message if this limit (or the default, if not set) is exceeded. |
1024 by default. |
Controlling MPI execution for Intel MPI
Environment variables
Here some of the environment variables controlling the execution of Intel MPI are described.
Name |
Function |
Remarks |
---|---|---|
SGI Altix xpmem DAPL specific settings |
||
DAPL_XPMEM_MEMMAP_VERBOSE |
This variable can be set (to any value) to issue additional diagnostic messages, especially if your code crashes with DAPL related error messages. |
May improve performance if your application generates many small messages. Default is 512. |
DAPL_XPMEM_MAPPED_HEAP_SIZE |
This variable must be set to the size of the statically defined heap memory area (in Bytes) if your code has many statically defined large arrays. Not setting this may lead to crashes in the DAPL setup phase ("Assertion `remote_ep' failed"). |
the command size myprog.exe will give you an estimate for the upper limit to set. |
DAPL_XPMEM_MAPPED_STACK_SIZE |
This variable must be set to the size of the statically defined stack memory area (in Bytes) if your code has many automatic arrays which turn out to be large. |
It may be a good idea to use the -heap-arrays compilation switch to circumvent usage of this variable. |
Extensions to MPI, special topics
The discussion in this section pertains to all available MPI implementations; where this is not the case remarks are inserted in dark brown coloured letters.
The shmem programming interface
In addition to MPI calls, a SPMD parallel program on Altix systems may also use the efficiently implemented shmem library calls. These make use of the RDMA facilities of the SGI interconnect (NUMAlink); indeed they also work across partition boundaries. Shmem calls are generally similar in semantics to one-sided MPI communication calls, but easier to use: For example,
shmem_double_put(target,source,len,pe)
is a facility for transferring a double precision array source(len) to the location target(len) on the remote process pe. The target object must be remotely accessible (aka symmetric), i. e. typically either a static array or dynamically allocated by executing the collective call shpalloc (3F) on a suitably defined Cray-type pointer. Repeatedly executed shmem calls targeting the same process will usually require an additional synchronization call - in the above case: shmem_fence()- to enforce memory ordering. Also note that the interface is not generic: For each data type used there is a distinct API call available (if at all). Here is a list of further functionality available:
- shmem_get: transfer data from remote to local process
- shmem_ptr: return a pointer to a memory location on a remote process
- collective calls for reduction, broadcast, barrier
- administrative calls for starting up and getting process IDs: If shmem is used in conjunction with MPI, please use the standard MPI administrative calls instead!
Due to the cache coherency properties of the Altix systems, the cache management functions - while still available for compatibility - are not actually required. Please consult the documentation referenced below for detailed shmem information. This programming paradigm is only available when using SGI MPT.
Global shared memory
The GSM feature provides expanded shared memory capabilities across partitioned Altix systems and additional shared memory placement specifications within a single host configuration. Additional (however non-portable) API calls provide a way to allocate a global shared memory segment with desired placement options, free that segment, and provide information about that segment. For example, calling the subroutine
gsm_alloc(len, placement, flags, comm, base, ierror)
will provide a memory segment of size len bytes at address base (accessed via a Cray-type pointer) for all processes in the MPI communicator comm. Data written to this segment by any process will be visible to all other processes after a synchronization call (usually MPI_Barrier). Please consult the documentation referenced below for detailed GSM information. This programming paradigm is only available when using SGI MPT.
MPI-2 issues
Supported features
The following table gives an overview over MPI-2 features which are known to be supported by at least one implementation available on the system.
Feature | SGI MPT | Intel MPI (latest release) |
---|---|---|
MPI-IO | yes | yes |
One-sided communication | mostly (alternative: shmem) | mostly |
Language Bindings | yes (Intel compilers) | |
Process Creation/Management | yes | no |
External Interfaces | partially | mostly |
If you find anything missing (especially subfeatures which you urgently need), please drop a mail to LRZ HPC support.
Memory management via MPI_Alloc_mem
In order to safely use allocated memory within, e. g., MPI one-sided communication routines, one should use MPI_Alloc_mem instead of Fortran 90 ALLOCATE or C malloc. Until the Fortran 2003 C interoperability features are available, one needs to use a non-standard language extension for calls from Fortran, namely Cray Pointers, since MPI_Alloc_mem wants a C pointer argument. The following code fragment shows how to handle this situation with double precision buffers:
integer(kind=MPI_ADDRESS_KIND) size, intptr integer :: itemsize, length ! Specify largest anticipated value. Will not be actually allocated at this point pointer (baseptr, farray(200)) double precision :: farray : call MPI_Type_extent(MPI_DOUBLE_PRECISION, itemsize, ierror) ! length at most 200 size=length*itemsize call MPI_Alloc_mem(size, MPI_INFO_NULL, intptr, ierror) ! enable access to allocated storage via farray: baseptr = intptr : ! Now perform one-sided calls etc.
Hybrid parallelism
Using the two distinct programming models distributed memory parallelism and shared memory parallelism in a single program is also denoted hybrid parallelism. Usually, this will involve using MPI calls and either OpenMP directives or multi-threaded library calls. To obtain good performance and scalability for such a program is in many cases a more difficult and complex task than using one parallel paradigm on its own.
Programming considerations
For performance reasons, you might want to execute MPI calls from within threaded regions of your program. In this case, you must replace the initial MPI_Init call in your MPI program by MPI_Init_thread, specifying the level of threading support you need. The routine will then return the threading level actually available for the implementation. Beware that this level may well be lower than what you first requested. Hence provisions must be made within your program for possible limitations of multi-threaded MPI execution. Furthermore, even if multi-threading is fully supported, making use of it may well incur a performance penalty within the MPI substrate.
Further details are described in the man page for the MPI_Init_thread API call.
Overview of supported thread level parallelism for available MPI implementations | ||
---|---|---|
MPI Flavour and Version | provides | Remarks |
SGI MPT 1.16 | MPI_THREAD_SINGLE | MPI calls only in serial region. The default MPT is not thread-safe. |
SGI MPT 1.16 | MPI_THREAD_MULTIPLE | Perform module unload mpi.altix module load mpi.altix/intel_mt to switch to fully thread-safe libraries |
Intel MPI 2.0 | MPI_THREAD_FUNNELED | Threaded MPI calls only from master thread |
Intel MPI 3.0 | MPI_THREAD_MULTIPLE | Threaded MPI calls from any thread. Please specify the -mt_mpi or the -threads option. The latter will not only link the thread-safe libriaries, but also generally perform thread-safe linkage. The default libraries only support MPI_THREAD_SINGLE. |
Open MPI 1.1 | MPI_THREAD_MULTIPLE | Threaded MPI calls from any thread. Only lightly tested. |
Setting up a hybrid program run - Case 1: uniform threading
Please proceed as follows:
export OMP_NUM_THREADS=4 export MPI_OPENMP_INTEROP= mpirun -np 6 ./myprog.exe
This will start up six MPI tasks with 4 threads each. The MPI_OPENMP_INTEROP variable tells MPT to space out MPI tasks so as to keep all forked threads as near to the master thread as possible. For the other MPI implementations this variable however has no effect. You will need to perform your own pinning (e.g., via the LRZ service library); for Intel MPI there is also inbuilt pinning functionality available which will be described here once tested.
Setting up a hybrid program run - Case 2: non-uniform threading
You may have two programs which form a MPMD unit, or wish to run your single program with varying thread numbers. Please proceed as follows:
export OMP_NUM_THREADS=1 export MPI_OMP_NUM_THREADS=4:8 export MPI_OPENMP_INTEROP= mpirun -np 2 ./myprog1.exe : -np 1 ./myprog2.exe
This will start up two 4-threaded tasks of myprog1.exe, and one 8-threaded task of myprog2.exe. The OMP_NUM_THREADS variable is ignored unless there are too few entries in MPI_OMP_NUM_THREADS, in which case the OMP_NUM_THREADS will be used for the missing entries.
For non-MPT MPI implementations, external determination of varying thread number is not supported; you need to use e.g., the OpenMP call omp_set_num_threads(n) within your program - and then there are still the placement/pinning issues mentioned in the previous subsection to deal with.
Coping with inhomogeneity
Since the HLRB-II has become slightly inhomogeneous with the phase 2 upgrade, it may be necessary to cope with this situation if (very large) jobs are run which need to make use of both high density (4 cores per memory channel) and high bandwidth (2 cores per memory channel) blades. One possibility to avoid workload imbalances is to simply set
export MPI_DSM_PPM=2
which will have no effect on the bandwidth blades, but will only use one socket on the density blades. If hybrid parallelism should be used on the basis of 2 threads per socket, the combination
export MPI_DSM_PPM=1 export OMP_NUM_THREADS=2 export KMP_AFFINITY=compact,0
should do the job.
Documentation
General Information on MPI
Please refer to the
for the API documentation and information about the different flavors of MPI available.
Manual pages
- For sgi MPT, please consult the man pages mpi (1), mpirun (1). Also, each MPI API call has its own man page.
- For the shmem API, consult the man page shmem_intro (1). Again, each shmem routine has its individual man page.
- For the global shared memory API, consult the man page gsm_intro (1), which also contains references to further API calls.
SGI's MPT documentation
Some of the following links lead to a password-protected area. To obtain user name and password for access, please type the command get_manuals_passwd when logged in to the system.
- MPT page on SGI's web site with summary information
- sgi MPT User's Guide, in PDF (200 kByte) and HTML format.
Intel MPI documentation
See the previous subsection for how to access the password-protected area on the LRZ web server.
Version | Available Docs |
---|---|
3.0 | Release Notes (Text file), Getting Started (PDF), and Reference Manual (PDF) |
OpenMPI documentation
Apart from a manual page for mpirun there is a FAQ on the OpenMPI web site.