HPC
From Consultancy.EdVoncken.NET
Contents |
HPC, or High Performance Computing, is a specialized form of computing, with a growing number of users. It is mostly associated with highly computational workloads.
With current CPU clock speeds stabilizing around 3GHz, the trend has shifted towards integrating multiple CPU cores on a single chip. These multi-core CPUs need suitable software to unlock their potential by executing tasks in parallel instead of sequentially.
Most current software was not written with parallel execution in mind. Parallel programming is still a science in itself.
HPC News
HPC Cluster toolkits
In large-scale deployments, you should make your life easier by using a toolkit that helps you build, manage and use your HPC cluster. Some examples:
- Project KUSU
- OSCAR - Open Source Cluster Application Resources
- Rocks
- Red Hat HPC Solution - Open Source, Kusu + Lava with commercial support
Message Passing Interface (MPI)
The Message Passing Interface, or MPI, provides communication between processes that may run on different cluster nodes.
Schedulers
Platform Lava
Platform Lava is an open source entry-level workload scheduler designed to meet a wide range of workload scheduling needs for clusters up to 512-nodes. Lava is available via the HPC Community site and is also included as a component of Platform Cluster Manager (PCM). Read the PCM FAQ for more information.
SLURM
SLURM stands for Simple Linux Utility for Resource Management. SLURM is an Open Source resource manager designed for Linux clusters of all sizes.
Torque
Torque Resource Manager is an open source resource manager, derived from the original OpenPBS scheduler. OpenPBS, in turn, was derived from the commercial product PBS (Portable Batch System).
Condor
Condor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, Condor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management.
Oracle Grid Engine
Formerly known as Sun Grid Engine (SGE), Oracle Grid Engine is still available.
The future of OGE / SGE was uncertain, as with most Open Source projects formerly hosted by Sun. In late 2010, the project has been forked and continues on Sourceforge as "Open Grid Scheduler".
Storage
Network storage comes in two flavors: NAS and SAN. NAS storage provides file-based storage, while SAN provides block-based storage. NAS filesystems like NFS and CIFS can be shared with multiple nodes. SAN storage can not easily be shared.
NFS
Each node in the cluster needs access to the data. For small to medium sized environments, NFS servers or appliances like a NetApp Filer may very well suit your needs. Parallel NFS (pNFS) adds parallel I/O to NFS v4.1, and should improve efficiency quite dramatically.
Large clusters usually employ some kind of Cluster File System instead of NFS or other traditional storage.
- The Red Hat Cluster Suite NFS Cookbook - Setting up a LoadÂBalanced NFS Cluster with Failover Capabilities
GFS
Commercial HPC vendors
Platform Computing
Platform LSF (Load Sharing Facility) is a very popular workload manager in enterprise environments.
- Platform Computing - known for Platform LSF and Symphony
- Platform buys HP MPI - The Register
Applications
References
- HPC Community
- So you want to build a cluster: Five things to consider before you start
- Project KUSU - Build, manage and use Linux HPC clusters
- Platform Lava - Workload scheduler
- HPC at Dell