MIC QuickStart Guide - shawfdong/hyades GitHub Wiki
Aesyle is our testbed for MIC (Many Integrated Core) computing. The node is equipped with two (2x) Sandy Bridge Xeon E5-2630L processors at 2.0 GHz, two (2x) Knights Corner Xeon Phi Coprocessors 5110P, as well as 64 GB memory and one 500GB hard drive.
It is instructive to compare the double precision peak performance of Xeon E5-2630L vs. that of Xeon Phi 5110P:
- Xeon E5-2630L: 96 GFLOPS = 2.0 (GHz) x 6 (cores) x 256/64 (AVX) x 2 (FMA)
- Xeon Phi 5110P: 1.01 TFLOPS = 1.053 (GHz) x 60 (cores) x 512/64 (AVX) x 2 (FMA)
The node has a public IP address as well as a few private ones. The public hostname is aesyle.ucsc.edu (IP: 128.114.126.227). As long as you have a valid account on Hyades, you can SSH to Aesyle, using the same username, password or SSH key as those for Hyades:
$ ssh -l username aesyle.ucsc.edu
If you are already on Hyades, you can log onto Aesyle simply with:
$ ssh aesyle
In order to enable connection to and from the Xeon Phi coprocessors by using SSH without a password, run the following commands the very first time you log onto Aesyle:
[aesyle]$ ssh-keygen -t rsa -N "" -f $HOME/.ssh/id_rsa -v [aesyle]$ echo -n 'from="10.*" ' >> $HOME/.ssh/authorized_keys [aesyle]$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys [aesyle]$ chmod 600 $HOME/.ssh/authorized_keys
Aesyle's environment is almost identical to those of other nodes in the Hyades cluster. The NFS shares are served from the FreeBSD file server via the private 10GbE network and mounted at /home and /trove, respectively. The Lustre file system is mounted at /pfs. 3 modules are loaded by default on Aesyle, providing basic environment for MIC computing:
[aesyle]$ module list Currently Loaded Modulefiles: 1) intel_mpi/4.1.3 2) intel_compilers/14.0.1 3) intel_mic
Each of Xeon Phi coprocessors runs an embedded Linux OS in memory. When the coprocessor starts up, its boot-loader loads a root file system and Linux kernel that are stored on the host system. I've customized the root file system such that /home} and /pfs are mounted on the coprocessors too, in order to provide a consistent user environment on both the host and the coprocessors.
The 2 Xeon Phi coprocessors are named mic0 and mic1, respectively. Once you are on aesyle, you can log onto them using SSH:
[aesyle]$ ssh mic0or
[aesyle]$ ssh mic1
Once you are in, you'll see a mostly familiar Linux environment. Feel free to explore its every nook and cranny. NOTE the embedded Linux utilizes BusyBox to provide a number of UNIX tools. The usage of these tools may differ slightly when compared to the usage of similar tools on the host Linux.
It is worth noting that the embedded Linux sees a total of 240 cores/processors in each Xeon Phi 5110P, although there are only 60 physical cores in each coprocessor:
[mic0]$ grep -c ^processor /proc/cpuinfo 240
The Xeon Phi coprocessors utilizes hardware multithreading on each physical core – 4 threads per core – as a key to mask the latencies inherent in an in-order microarchitecture. This should not be confused with Hyper-Threading on Intel Xeon processors that exists primarily to more fully feed a dynamic execution engine. For Xeon Phi, the number of threads per core utilized is a tunable parameter in an application and should be set based on experience running the application.
There are 3 execution modes of running applications on the Xeon Phi coprocessors: native, symmetric & offload[1].
In native mode, applications run directly on Xeon Phi coprocessors[2]. Here are some examples on how to build a native application that runs directly on an Intel Xeon Phi coprocessor and its embedded Linux operating system.
To cross-compile hello.c, a serial "Hello world" program written in C, using the Intel C compiler, run:
[aesyle]$ cd /pfs/dong [aesyle]$ icc -mmic hello.c -o hello.icc.k1omNOTE
- -mmic enables cross-compiling of applications for the MIC Knights Corner microarchitecture.
- The default optimization level for the Intel compilers is -O2.
- The binary, hello.icc.k1om, can run only on Xeon Phi coprocessors.
[aesyle]$ ssh mic0 /pfs/dong/hello.icc.k1om Hello, world!
or you can first log onto the coprocessor, then run the application:
[aesyle]$ ssh mic1 [mic1]$ /pfs/dong/hello.icc.k1om Hello, world! [mic1]$ exit
If you prefer GCC, Intel MPSS comes with a few customized GCC utilities for building native MIC applications on the x86_64 host, located in /usr/linux-k1om-4.7/bin/. For example, to cross-compile hello.c using gcc, run:
[aesyle]$ /usr/linux-k1om-4.7/bin/x86_64-k1om-linux-gcc hello.c -o hello.gcc.k1omNOTE
- -march=k1om is the default option.
- When running /usr/linux-k1om-4.7/bin/x86_64-k1om-linux-readelf -h against a MIC ELF binary, it shows the machine type (instruction set architecture) as Intel K1OM (0xB5). For a list of legal values for e_machine (architecture), check /opt/mpss/3.4.1/sysroots/k1om-mpss-linux/usr/include/elf.h.
- By comparison, the machine type of an x86-64 ELF binary is Advanced Micro Devices X86-64 (0x3E).
To cross-compile omp_hello.c, an OpenMP "Hello world" program written in C, using the Intel C compiler, run:
[aesyle]$ icc -mmic -openmp omp_hello.c -o omp_hello.k1omwhere
- -mmic enables cross-compiling of applications for the MIC Knights Corner microarchitecture.
- -openmp enables the parallelizer to generate multi-threaded code based on OpenMP* directives.
[aesyle]$ ssh mic0 /pfs/dong/omp_hello.k1om
or you can first log onto the coprocessor, then run the OpenMP program:
[aesyle]$ ssh mic1 [mic1]$ /pfs/dong/omp_hello.k1om [mic1]$ exit
To cross-compile mpi_hello.c, an MPI "Hello world" program written in C, using Intel C compiler and Intel MPI, run:
[aesyle]$ mpiicc -mmic mpi_hello.c -o mpi_hello.k1om
You can run an MPI session of 60 processes on the coprocessor with:
[aesyle]$ ssh mic0 mpirun -n 60 /pfs/dong/mpi_hello.k1om
or you can first log onto the coprocessor, then run the MPI program:
[aesyle]$ ssh mic1 [mic1]$ mpirun -n 60 /pfs/dong/mpi_hello.k1om [mic1]$ exit
In symmetric mode, applications run on both the host processors and the coprocessors at the same time.
Compile the MPI code (mpi_hostname.c) for both the x86-64 architecture and the MIC Knights Corner architecture:
[aesyle]$ cd /pfs/dong [aesyle]$ mpiicc mpi_hostname.c -o mpi_hostname.x86-64 [aesyle]$ mpiicc -mmic mpi_hostname.c -o mpi_hostname.k1om
Run the MPI program on all the processor and coprocessor cores[3]:
[aesyle]$ mpirun -n 12 -host aesyle /pfs/dong/mpi_hostname.x86-64 : \ -n 60 -host mic0 /pfs/dong/mpi_hostname.k1om : \ -n 60 -host mic1 /pfs/dong/mpi_hostname.k1omNOTE
- Here we start a total of 132 MPI ranks (processes), with 12 on the host, 60 on mic0 and 60 on mic1.
- The host runs the x64 executable mpi_hostname.x86-64 and the coprocessors run the MIC executable mpi_hostname.k1om.
- By default, Intel MPI use InfiniBand if available. We can also use the TCP network fabrics -genv I_MPI_FABRICS shm:tcp.
- It is unnecessary to manually copy the executable and its dependencies (Intel libraries and MPI tools, etc) to the coprocessors. The environment has been properly configured on Aesyle to assure everything just works!
[aesyle]$ cd /pfs/dong/ [aesyle]$ ln -s mpi_hostname.x86-64 mpi_hostname [aesyle]$ export I_MPI_MIC_POSTFIX=.k1om
Create a machine file (/pfs/dong/machines), with simple host:ranks pairs on separate lines:
aesyle:12 mic0:60 mic1:60
Now we can run the MPI application in symmetric mode, with a far simpler command:
[aesyle]$ mpirun -machinefile machines /pfs/dong/mpi_hostnameNOTE
- The host runs /pfs/dong/mpi_hostname, which is a symbolic link to the x64 executable /pfs/dong/mpi_hostname.x86-64.
- The I_MPI_MIC_POSTFIX environment variable instructs mpirun to add the .k1om postfix to the executable name when running on the coprocessors; so the coprocessors run the MIC executable /pfs/dong/mpi_hostname.k1om.
- This shorthand way may not be easier than alternatives on Aesyle; but it is the preferred way of running applications in symmetric mode in a cluster environment with tons of Xeon Phi coprocessors.
In offload mode, an application starts execution on the host; as the computation proceeds it offloads part or all of the computation from its processes or threads to the coprocessors. This is the common execution model in other coprocessor/accelerator operating environments, like CUDA, OpenCL and OpenACC.
OpenMP 4.0[4], released in July 2013, adds support for accelerators, by introducing a few target directives. Intel compilers support some, but not all, new features in OpenMP 4.0[5]. I've written a sample "Hello world" program (omp4_hello.c), to demonstrate some of the OpenMP4.0 features.
Compile the code with Intel C compiler:
[aesyle]$ icc -openmp omp4_hello.c -o omp4_hello.x
By default, the program will utilize all host processor cores and coprocessor hardware threads, i.e., it will start 6 threads on each Xeon E5-2630L processor, and 236 threads on each Xeon Phi 5110P coprocessor (not 240 threads, because the last core is reserved for running the offload daemon coi_daemon). For brevity, let's start only 2 threads on the hosts, and 2 threads on each coprocessor.
[aesyle]$ export OMP_NUM_THREADS=2 [aesyle]$ export MIC_OMP_NUM_THREADS=2 [aesyle]$ ./omp4_hello.x Hello, world! I am thread 0 on host Hello, world! I am thread 1 on host Number of threads = 2 on host Number of devices = 2 Hello, world! I am thread 0 on mic0 Hello, world! I am thread 1 on mic0 Number of threads = 2 on mic0 Hello, world! I am thread 0 on mic1 Hello, world! I am thread 1 on mic1 Number of threads = 2 on mic1
Intel compilers provide several proprietary pragmas, offload and others with the prefix offload_ to explicitly direct data movement and code execution[6]. I've written a sample "Hello world" program (offload_hello.c) that is equivalent to the OpenMP 4.0 program above (omp4_hello.c) but uses Intel-Specific Pragmas.
Compile the code with Intel C compiler:
[aesyle]$ icc -openmp offload_hello.c -o offload_hello.x
By default, the program will utilize all host processor cores and coprocessor hardware threads. For brevity, let's start only 2 threads on the hosts, and 2 threads on each coprocessor.
[aesyle]$ export OMP_NUM_THREADS=2 [aesyle]$ export MIC_OMP_NUM_THREADS=2 [aesyle]$ ./offload_hello.x Hello, world! I am thread 0 on host Hello, world! I am thread 1 on host Number of threads = 2 on host Number of devices = 2 Hello, world! I am thread 0 on mic1 Hello, world! I am thread 1 on mic1 Number of threads = 2 on mic1 Hello, world! I am thread 0 on mic0 Hello, world! I am thread 1 on mic0 Number of threads = 2 on mic0
NOTE
- For the time being, OpenMP 4.0 support is likely limited; using Intel-specific pragmas is the best way to program offload mode.
- Here we only touch upon the Explicit Offload model (non-shared memory model), where we explicitly directs data movement and code execution.
- We can also use the Implicit Offload model (virtual-shared memory model), which is suitable when the data exchanged between the CPU and the coprocessor is more complex than scalars, arrays, and structs that can be copied from one variable to another using a simple memcpy.[7].
- Sample codes using the explicit memory copy model can be found on Aesyle at:
- Fortran
- /opt/intel/composer_xe_2013_sp1.1.106/Samples/en_US/Fortran/mic_samples/
- C++
- /opt/intel/composer_xe_2013_sp1.1.106/Samples/en_US/C++/mic_samples/
Intel Math Kernel Library includes a unique Automatic Offload (AO) feature that enables computationally intensive MKL functions called in user code to benefit from attached Xeon Phi coprocessors automatically and transparently[8]. This feature allows us to leverage the additional computational resources provided by the coprocessor without changes to Intel MKL calls in our codes. The data transfer and the execution management is completely automatic and transparent for the user.
Because the coprocessor(s) are connected to the host system via Peripheral Component Interconnect Express (PCIe) , AO support is provided only for functions that involve sufficiently large problems and those having large computation versus data access ratios. As of Intel MKL 11.0 only the following functions are enabled for automatic offload:
- Level-3 BLAS functions
- ?GEMM (for M,N > 2048, k > 256)
- ?TRSM (for M,N > 3072)
- ?TRMM (for M,N > 3072)
- ?SYMM (for M,N > 2048)
- LAPACK functions
- LU (M,N > 8192)
- QR
- Cholesky
To enable automatic offload either the function mkl_mic_enable() has to be called within the source code or the environment variable MKL_MIC_ENABLE=1 has to be set. If no Xeon Phi coprocessor is detected the application runs on the host without penalty.
Build a program for automatic offload, the same way as building code for the Xeon host:
[aesyle]$ icc -O3 -mkl file.c -o file
By default, the MKL library decides when to offload and also tries to determine the optimal work division between the host and the targets (MKL can take advantage of multiple coprocessors). In case of the BLAS routines the user can specify the work division between the host and the coprocessor by calling the routine mkl_mic_set_Workdivision(MKL_TARGET_MIC,0,0.5)
or by setting the environment variable:
[aesyle]$ export MKL_MIC_0_WORKDIVISION=0.5Both examples specify to offload 50% of computation only to the first coprocessor (mic0).
- Intel Xeon Phi Coprocessor
- Intel Xeon Phi Coprocessor: Software Developers Guide
- Intel C++ Compiler 14.0 - Intel MIC Architecture
- Intel Fortran Compiler 14.0 - Intel MIC Architecture
- Programming and Compiling for Intel Many Integrated Core Architecture
- Debugging Intel Xeon Phi Applications on Linux Host
- Intel Xeon Phi Coprocessor High-Performance Programming by James L. Jeffers and James Reinders. The ebook is available at UCSC library.
- Intel Xeon Phi Coprocessor Architecture and Tools: The Guide for Application Developers by Rezaur Rahman. The Kindle edition is freely available at Amazon too.
- ^ Intel Xeon Phi Programming Environment
- ^ Building a Native Application for Intel Xeon Phi Coprocessor
- ^ Using the Intel MPI Library on Intel Xeon Phi Coprocessor Systems
- ^ OpenMP 4.0 Application Program Interface
- ^ OpenMP 4.0 Features in Intel Fortran Composer XE 2013
- ^ Fortran vs. C offload directives and functions
- ^ Intel C++ Compiler 14.0 - Using Shared Virtual Memory
- ^ Using Intel® MKL Automatic Offload on Intel Xeon Phi Coprocessors