K2 is finally up, after a chiller replacement and miscellaneous troubles.
SIESTA (version siesta-3.0-b) is now centrally available on k2 and a separate queue 'siesta.q' has been configured! Please contact us for access to it.
Please use your directory in the /mnt/oss partition to store jobs and job-related data. Please don't store excess data in your home directory in /export.
About High Performance Computing Laboratory
The High-Performance Computing Facility at IUAC has been set up with a grant from the Department of Science and Technology, to provide supercomputing access to university users across the country, and also to boost the ion-solid, nuclear physics, and atomic physics simulation programs at IUAC. The facility is targeted at computational chemists, physicists, and biologists in the university system, working in the areas of materials science, atomic and molecular physics and chemistry, radiation biology, and nuclear physics.
The facility has been operational since 2010 and welcomes users from universities, colleges, and institutes across the country.
In its first phase, a state-of-the-art data center with chilled water-cooled racks, an SMP system, and a distributed memory cluster were set up. In 2013, a second distributed memory cluster was added. The systems now available to users are
K2: A 62 teraflop MPI cluster, consisting of 200 compute nodes, 3200 compute cores, a 40 Gbps Infiniband interconnect, 4 GB of RAM per compute core, and a 55 TB Lustre parallel file system. The software platform consists of the Rocks cluster manager, Centos 6.3 operating system on the nodes, Intel compilers, and Intel MPI library, and the Sun Grid Engine resource manager.
Kalki: A 9 teraflop MPI cluster, consisting of 96 compute nodes, 768 compute cores, a 20 Gbps Infiniband interconnect, 2 GB of RAM per compute core, and an 8-node 6 TB PVFS2 file system, with Rocks, Centos 5.3, GNU compilers, OpenMPI and SGE.
Distributed memory CPU-intensive jobs parallelized using MPI should run well on both clusters. Remote access to both clusters is possible through an SSH shell.
If you wish to use either of the systems for computations in nuclear physics, atomic and molecular physics, materials science, or radiation biology, please e-mail a request to sumit[at]iuac[dot]res[dot]in, including a short (~ 1 page) description of the proposed work, the software you require, and the resources you need for a typical run (number of cores, amount of RAM, disk space, time). If your request is approved, we will get back to you with details of how you can access the system.
Instructions for Users
- Kindly do not run any job on the main/head node.
- Please put all your data files in the directory /mnt/oss/your_user_name, and run your programs from there.
- Please take regular backups of your data. If you need help with this, please e-mail Ipsita Satpathy. We do not have the resources to take backups for all user data, and accidents do happen.
- Use QSUB script to run every job.
- For jobs that are in "qw" state and need only a change of number of cores and/or queue, use qalter instead of killing the job and resubmitting it.
- Please use ONLY multiples of 16 cores for all your MPI jobs. This isolates the nodes you are using from nodes other users are using and protects you from nodes that crash or hang because of errors in running programs. It also protects you when the admin starts debugging or removing other users' jobs; accidents can happen.
- All parallel jobs requiring less than 48 cores and a wall time of fewer than 4 days should be put on all.q queue.
- All serial jobs (single-core jobs) should be run on serial.q.
- If you need access to more than 48 cores per job, or more than 4 days of wall clock time to run a job, please e-mail Sumit Mookerjee to get access to largejob.q.
- Please use test.q to run test jobs, to check they run as you expect before you place them on the production queues.
NOTE that jobs under the following conditions will be TERMINATED WITHOUT WARNING:
- Any job on "largejob.q" with less than 48 cores.
- Any job on all.q with more than 48 cores.
- Any serial jobs on all.q or largejob.q.
- Any job running on the head node.
- Jobs not under the control of SGE.
- For all/general jobs requiring up to 48 cores, this is the queue.
- Time limit of 96 hours(or 4 days) has been put on all all.q jobs.
- Any job fired on this queue will get automatically terminated after 96 hours of running.
- This is the queue for jobs that need a large number of cores and long time. In our context, large is 48 or more cores.
- No time limit on largejob.q jobs.
- This is the queue for testing out your MPI program to see if it works.
- For the first time or after code modifications, or if you are running benchmarks to check run times.
- Time limit for jobs: 6 hours. Any job in this queue running longer than 6 hours will get automatically terminated.
- Total cores allocated for this queue: 128. There are a total of 8 nodes (128 cores) allocated, so you should be able to do a full test, including IB.
- Please keep your runs as short as possible, and the resources used to the minimum you need.
Running Job Policies
- All jobs must be run through the batch schedulers on the cluster.
- No jobs should be fired directly on the head node. It is reserved exclusively for login and interaction with the schedulers.
- To login to any of the cluster facility, you will need to use SSH. See the FAQ entry for remote access for details.
- All accounts on cluster machines are to be 'owned' by the research advisor or head of the group. If for some reason you require individual accounts for group membes, please contact Sumit Mookerjee.
Storage and Disk Space Usage
- Users are not allowed to store data in home directory above some specific limit which may vary depending upon requirement and availablity.
- If you need more than 3 GB of space, please use the /mnt/oss partition on K2, or /mnt/pvfs2 on Kalki.
- Users are not allowed to gain access to compute nodes directly via ssh and work from there.
- All jobs that are to be run on the cluster must be submitted via the job schedulers. Please see the FAQ and the queue policies.
- You may not run your programs in the background, or run ANY programs on the head node.
- Jobs not conforming to these rules will be terminated and the user account may be locked.
- Sun Grid Engine www.wikis.sun.com
- Sun Grid Engine Wikipedia www.en.wikipedia.org
- Simple Job Array Howto www.wiki.gridengine.info
- Sun Grid Engine Quick Start www.web.njit.edu
- LONI Grid Computing Notes www.loni.ucla.edu
Frequently Asked Questions
What is the procedure to get an access to the supercomputing facility at IUAC?
To access the supercomputing facility at IUAC, you have to request for an account, including a short (~ 1 page) description of the proposed work, the software you require, and the resources you need (number of cores, amount of RAM, disk space, time) for a typical run.
You may send this information to sumit[at]iuac[dot]res[dot]in. If your request is approved, we will get back to you with details of how you can access the system.
What are KALKI and K2?
KALKI and K2 are the the two systems that currently comprise the IUAC supercomputing facility.
Both systems are MPI clusters. Kalki has 96 compute nodes, 768 compute cores, and an 8-node 6 TB PVFS2 file system, with a Linpack Rmax score of 6.5 teraflop. K2 has 200 compute nodes, 3200 compute cores and a 55 TB Lustre parallel file system, with a Linpack sustained rating of 62 teraflops. Distributed memory CPU-intensive jobs parallelized using MPI should run well on both cluster.
How do I connect to KALKI and K2?
Kindly contact the system administrators for connection to the HPC facility. For new accounts, contact Sumit Mookerjee; for access issues for existing accounts, contact Ipsita Satpathy.
What is the hardware and OS configuration of the supercomputing facility at IUAC?
KALKI has 96 compute nodes with dual quad core Xeon CPUs (total 768 cores at 3.0 GHz, 16 GB RAM and 500 GB disk per node) with a 20 GB/s Infiniband interconnect, and 6 TB of additional storage on a PVFS2 cluster. The KALKI head node is also a dual quad core Xeon system with 32 GB of RAM. The cluster is built using Rocks 5.1 and CentOS. The system supports both the GNU suite (gcc, GSL and OpenMPI) and the Intel suite (icc/ifort, MKL and IMPI),
K2 has 200 compute nodes with dual octa-core Xeon CPUs (total 3200 cores at 2.4 GHz, 64 GB RAM and 500 GB disk per node) with a 40 GB/s Infiniband interconnect, and 55 TB of additional storage on a Lustre parallel file system. The K2 head node is also a dual octa-core Xeon system with 128 GB of RAM. The cluster is built using Rocks 6.1 and CentOS 6.3, and with Intel compilers, Intel MKL and Intel MPI.
How do I submit jobs on KALKI or K2?
For submitting jobs on KALKI/K2, use only the qsub command. Your job will be submitted through the Sun Grid Engine resource manager. The sample qsub script for KALKI should work on K2 as well for most applications. For example, if your script name is "sample_script.txt" you should give the following command to submit it:
$ qsub sample_script.txt
For a sample qsub script, look here .
How do I change the queue and parallel environment in the KALKI qsub script?
To decide which queue you should use, check out the queue management policy here. Then edit/insert into the qsub script something like these lines:
- #$ -q all.q
- #$ -cwd
- #$ -pe mpich 16
- The first line specifies the use of the "all.q" queue for calculations.
- The second line tells to work in the Current Working Directory (cwd) and the third line tells to use MPICH parallel environment with 16 cores.
- On Kalki, the other parallel environment available is the OpenMPI ORTE (-pe orte 8). Since the KALKI compute nodes have 8 cores each, this also means use one whole machine or node to do the calculations. On K2, please use multiples of 16 as the core count.
How do I find out how many cores are free on KALKI and K2?
The following command tell you how many cores are available on various queues:
qstat -g c
What are the limits on number of cores and time for various queues?
For K2, please see this page.
For Kalki, the time and core limits for various queues are tabulated below:
|Queue||Core Limit||Time Limit||Usage|
|all.q||8 to 40||96 hours (4 days)||all except WIEN2K jobs|
|largejob.q||48 to 192||336 hours (14 days)||all except WIEN2K jobs|
|wien.q||8 to 64 cores||No time limit||WIEN2K jobs only|
|test.q||1 to 32||6 hours||Non-WIEN2K jobs - for testing purpose only|
New account requests, remote login issues, new software requests, policy suggestions:
Dr. Sumit Mookerjee
Principal Investigator, IUAC HPC Project.
Softwares, Applications and Scientific libraries related requests and issues:
Ms. Ipsita Satpathy
System related queries, hardware and queue system issues, comments about this web site:
Ms. Ipsita Satpathy