HPC

HPC

From Consultancy.EdVoncken.NET

Jump to: navigation, search

Contents

HPC, or High Performance Computing, is a specialized form of computing, with a growing number of users. It is mostly associated with highly computational workloads.

With current CPU clock speeds stabilizing around 3GHz, the trend has shifted towards integrating multiple CPU cores on a single chip. These multi-core CPUs need suitable software to unlock their potential by executing tasks in parallel instead of sequentially.

Most current software was not written with parallel execution in mind. Parallel programming is still a science in itself.

HPC News

HPC Cluster toolkits

In large-scale deployments, you should make your life easier by using a toolkit that helps you build, manage and use your HPC cluster. Some examples:

Message Passing Interface (MPI)

The Message Passing Interface, or MPI, provides communication between processes that may run on different cluster nodes.

Schedulers

Platform Lava

Platform Lava is an open source entry-level workload scheduler designed to meet a wide range of workload scheduling needs for clusters up to 512-nodes. Lava is available via the HPC Community site and is also included as a component of Platform Cluster Manager (PCM). Read the PCM FAQ for more information.

SLURM

SLURM stands for Simple Linux Utility for Resource Management. SLURM is an Open Source resource manager designed for Linux clusters of all sizes.

Torque

Torque Resource Manager is an open source resource manager, derived from the original OpenPBS scheduler. OpenPBS, in turn, was derived from the commercial product PBS (Portable Batch System).

Condor

Condor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, Condor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management.

Oracle Grid Engine

Formerly known as Sun Grid Engine (SGE), Oracle Grid Engine is still available.

The future of OGE / SGE was uncertain, as with most Open Source projects formerly hosted by Sun. In late 2010, the project has been forked and continues on Sourceforge as "Open Grid Scheduler".

Storage

Network storage comes in two flavors: NAS and SAN. NAS storage provides file-based storage, while SAN provides block-based storage. NAS filesystems like NFS and CIFS can be shared with multiple nodes. SAN storage can not easily be shared.

NFS

Each node in the cluster needs access to the data. For small to medium sized environments, NFS servers or appliances like a NetApp Filer may very well suit your needs. Parallel NFS (pNFS) adds parallel I/O to NFS v4.1, and should improve efficiency quite dramatically.

Large clusters usually employ some kind of Cluster File System instead of NFS or other traditional storage.

GFS

Commercial HPC vendors

Platform Computing

Platform LSF (Load Sharing Facility) is a very popular workload manager in enterprise environments.

Applications

References