RCS examples are provided to assist you in learning the software and the development of your applications on the Shared Computing Cluster (SCC). The instructions provided along with the code assume that the underlying OS is Linux. If these examples are run on a different architecture, you might need to make some changes to the code and/or the way the program is built and executed.
RCS Examples for Intel Xeon Phi Knights Landing (KNL)
Directory Structure
examples - This directory presents KNL example programs.
Usage notes
Kinghts Landing (KNL) is an
Intel Xeon Phi
many-integrated-core (MIC) processor.
Currently there are two KNL nodes on SCC.
The host names are scc-ib1 and scc-ib2.
There are 68 physical cores on each KNL node.
Each core supports 4 computing threads by the hyper-threading technique.
Each node therefore supports a total of 272 threads for multi-threaded programs: 68 cores, 4 hyper threads per core. The optimal number of threads for any program will have to be determined via testing.
Different from the previous generation of Xeon Phi --- Knights Corner (KNC), the KNL is self-hosted.
The operating system runs directly on the KNL architecture, in contrast with the KNC architecture where an additional Xeon CPU was used to host the OS. The CentOS 7 system is installed on the SCC KNL nodes. Please note that this is an updated version compared with the SCC login and compute nodes which run CentOS 6.
Intel Omni-path is installed to support data communication between the two KNL nodes.
Note that a single KNL core is much slower than a regular Xeon CPU core.
It is not recommended to run a serial program on KNL as the run time will be much longer compared with the regular SCC compute nodes.
To be accelerated on KNL, the programms have to be parallelized, for example by MPI, OpenMP or hybrid MPI-OpenMP.
C or Fortran codes can be compiled and run on KNL. If the C or Fortran codes are parallelized and optimized appropriately, they can be accelerated considerably by the KNL architeture. Intel Math Kernel Library (MKL) functions can be automatically accelerated on KNL. For Python programmers, if the numpy or scipy libraries are built with Intel MKL, the numpy or scipy functions can be automatically accelerated on KNL too. The SCC has several versions of Python provided by Intel available via the module system which are built with the MKL.
Please refer to the following instructions to compile and run C or Fortran programs on the KNL nodes.
Compile programs
It is recommended to compile programs directly on KNL nodes, so that they are best optimized.
The compiled version for the KNL nodes will not be able to execute on regular SCC nodes due to the use of CPU instructions that are only available on the KNL nodes.
To compile, request a single KNL core first,
qrsh -l knl
Then load the Intel compiler,
module load intel/2017
[Note: the Intel compiler provides the best optimized options for KNL, possibly by a large margin!]
To compile an OpenMP C code,
[Note: The option -xmic-avx512 is to make use of the 512-bit vectorization on KNL. This can accelerate the program signifigcantly. The level-3 optimization (-O3) usually (but not always) yields better performance than the level-2 optimization (-O2).]
To compile MPI programs, load an MPI implementation first,
module use /share/module/knlmodule load openmpi/3.0.0_intel-2017_knl
[Note: The openmpi/3.0.0_intel-2017_knl is an MPI implementation working for KNL on SCC.]
Then compile an MPI C code,
mpicc -O3 -xmic-avx512 name.c -o executable
or an MPI Fortran code,
mpifort -O3 -xmic-avx512 name.f90 -o executable
Adding the following compiling options may yield better performance in general.
-fma : to generate fused multiply-add (FMA) instructions.
-finline-functions : to enable function inlining.
Run OpenMP programs
First request one KNL node by qrsh,
qrsh -l knl -pe omp 68
[Note: the "-pe omp 68" means 68 pysical cores are requested. The queue system is only aware of the physical cores on the KNL nodes. Users can specifiy OMP_NUM_THREADS=272 to make use of the total 272 threads.]
Then run the OpenMP program,
export OMP_NUM_THREADS=272/path/to/executable
Run MPI programs
First request the KNL nodes by qrsh. For exampele, request one KNL node,
qrsh -l knl -pe mpi_68_tasks_per_node 68
or two KNL nodes,
qrsh -l knl -pe mpi_68_tasks_per_node 136
Then load the MPI implementation,
module use /share/module/knlmodule load openmpi/3.0.0_intel-2017_knl
Then run the MPI program on one KNL node,
mpirun -np 68 /path/to/executable
or on two KNL nodes,
mpirun -np 136 /path/to/executable
[Note: It is recommended to run up to 68 (other than 272) MPI tasks per node. If the program is a hybrid MPI-OpenMP code then each task can be allowed up to 4 threads using the command: export OMP_NUM_THREADS=4.]
Run programs in background
Users can run batch jobs in the background using qsub,
qsub script
Please refer to example scripts for OpenMP or MPI programs in the example directory.
[Note: Currently the maximum runtime on the SCC KNL nodes is 24 hours.]
Contact Information
help@scc.bu.edu
Operating System Requirements
The examples presented in this directory were written in C or Fortran.
- c or Fortran compilers available
Note: Research Computing Services (RCS) example programs are provided
"as is" without any warranty of any kind. The user assumes the entire risk of
quality, performance, and repair of any defects. You are encouraged to copy
and modify any of the given examples for your own use.