Niagara, Canada's Supercomputer
Updated: Apr 14
Niagara is a Canadian supercomputer hosted and owned by the University of Toronto and operated by SciNet. It can be used by all Canadian researchers via their active Compute Canada account. So if you're a Canadian researcher, go ahead and get your access, because it's truly amazing! Last year I had to do a massive amount of computational experiments for my most recent paper (on handling stochasticity in telecommunication networks; why not have a look at it?!), and my resources seemed minuscule in comparison. So I finally started dipping my toes in Niagara (oh the pun!). Niagara's Quickstart documentation is quite nice and can get you started immediately. Here I will simplify that guide for first-time users, and expand it with some additional steps that might come in handy for us ORMS researchers.
Apply For A Compute Canada Database (CCDB) Account
Go to this link and follow the instructions to get your account. If you are a PI, then you're good to go. Otherwise, you need to ask your PI to get their CCRI number, which you need for creating your account. After activating your Compute Canada account, you have to head on to https://ccdb.computecanada.ca and request access to Niagara. After 1 or 2 days your access will be granted. The following links will be useful (constantly!):
Status of the servers and scheduler: There will be outages, planned maintenance periods, cooling system's pump seal explosions (yes it happened), and reservations. These are all very rare, but if you see that for some reason you cannot connect to the server, with this link you can check whether the server or scheduler is down.
SciNet portal: This is your portal on SciNet and provides updated information on your usage and the history of your submitted jobs (even the scripts you have submitted to the nodes).
Login to Niagara
There are two types of nodes on Niagara: “login” and “compute” nodes. When you login, you are directed to a login node. On a login node, you can upload/download your files, test them, and finally submit your jobs to the compute nodes on the cluster.
To login, as usual, you run a ssh command:
Every user has two main directories with predefined names: $HOME and $SCRATCH. For a researcher with username "tiger" working with a PI with username "daniel" (I do have a toddler!), the paths look like this:
$HOME=/home/d/daniel/tiger -> read-only for the compute nodes. It’s mostly useful for keeping backups or installing softwares.
$SCRATCH=/scratch/d/daniel/tiger -> 25Tb of storage that expires after 2 months. Compute nodes can read and write on $SCRATCH, so it’s suitable for working on the projects and submitting the jobs.
So, "tiger" logs into Niagara with the following ssh command and enters the password when asked.
Assume that Niagara assigns "tiger" a login node named "nia-login02". In the following, you see the initial commands that "tiger" ran on Niagara, and their outcomes:
Data Management on Niagara
I usually use CloudMounter for file management on servers. Its integration with macOS is seamless. There are other free softwares as well, but I didn’t find them quite as stable. You don’t need a software for file management though.
Using scp for under 10Gb files
You can use the scp command as you always do on Linux. For uploading source to path on Niagara, "tiger" uses the following command:
and for download to destination:
Through the web app
(The following instructions for Web app might have slightly changed since last year, but in essence it’s the same.)
Install Globus Connect Personal on your machine first (instructions below) -> this gives you a name for your machine
https://globus.computecanada.ca -> web app of Compute Canada for data management
Login with Compute Canada id
In Transfer Files, on one window, type the name you got from installing Globus Connect Personal on your machine
On another window type computecanada#niagara, and on the path field, write your home (or scratch) path (you can bookmark for future use)
Go to your desired directory, select the file you want to transfer, and press on the blue arrow to transfer the files
Some common softwares and modules are already installed on Niagara. A list of available softwares can be found via:
Before running any program that uses such modules, we have to load them. So if you want to run a java program, you need to run the following command:
If a software is not installed, you can either ask the support to install it for you or your research group, or it can be installed on your own space. Let us install our two wonderful MIP solvers, Cplex and Gurobi.
Download the Cplex bin file (linux x86_64)
Upload the bin file to your Niagara space (preferably your $HOME, because $SCRATCH expires every 2 months. Then in the scripts you can use $HOME as a predefined path to refer to the address where Cplex is installed)
Use the following command to change its permission to read and write:
Run the installer:
For installation, it needs an absolute path, e.g., /home/d/daniel/tiger/ILOG/CPLEX_Studio1210
If you want to use CPLEX or CP Optimizer engines through their Python APIs, you need to tell Python where to find them. To do so, enter the following command into the terminal:
Install Anaconda and Gurobi
I honestly cannot add anything else to Gurobi's installations instructions. You can find them here.
For testing purposes, you can use the login nodes. This should be small instances that take only a couple of minutes to finish, and use up to 2GB memory. The commands are as usual with all Linux servers. Just remember, we have to first load the softwares that we need (the ones installed on Niagara). For instance, "tiger" wants to run a Java program named myTest (in folder tests) that uses Cplex as its solver:
Let us analyze the second command:
By using the keyword nohup, your program continues to execute, even after you log out. It means "no hang up".
java invokes the java interpreter.
-Djava.library.path=$HOME/ILOG/CPLEX_Studio1210/cplex/bin/x86-64_linux/ adds Cplex to Java library path
-classpath $HOME/ILOG/CPLEX_Studio1210/cplex/lib/cplex.jar:bin” adds cplex.jar and bin folder of myTest to the classpath. These two are separated by ":"
tests/myTest is the Java class "tiger" is running
10 2 are the arguments that are passed to the main() function of the Java program. They are separated with a space.
> ./logs/logMyTest_a10_b2.out reroutes the console output to a log file with the given name logMyTest_a10_b2.out in folder logs
2>&1 tells the shell to route the standard error output (2, the file descriptor for standard error, stderr) to the standard output (file descriptor 1). Using &1, you're telling the shell that this is not a file named 1, it's a file descriptor. Roughly speaking, in the previous step you told the shell to change the standard output to your log file. Now you're adding that you want the errors to be written there as well.
& is for running the program in the background, in a subshell. So your main shell does not wait for the program to be finished and you can continue executing other commands. You can check the performance of the task with the command "top”.
To test your script for job submission though, you can reserve up to 4 compute nodes for an hour. For example:
In the above command, “1” means that only one node is requested for debugging. By using this mode, you can make sure that the job submission is using the node to the full capacity. To run a batch file named myfile.sh on a compute node while testing, use the following commands (more on batch files in the next section):
Some useful tips:
When you have multiple .sh files, for running the next ones you should use
instead of $ ./FILENAME.sh.
For running the file in the background, you don't need to (shouldn't!) use the nohup command. When a compute node is reserved for debugging, after logging out every job gets killed since the resources are released. However, when jobs are submitted to a compute node by the scheduler, you are not logged into them anyway, and your job will continue running there even when you log out of your login node. So, to sum up, for executing a batch file in the background on a compute node in the debugging mode:
Finally, let us kill all at once, every job "tiger" has submitted:
Of course only "tiger" can do that!
In most cases, you will want to submit from your $SCRATCH directory, so that the output of your compute job can be saved in a log file (as mentioned above, $HOME is read-only on the compute nodes). See the default resource capacity here. When you submit your jobs, a Scheduler assigns a compute node from the cluster to your jobs and they are run on that particular compute node. Every compute node would be reserved for just a single user, so they would have access to the full capacity of the node (40 core and 200GB) and no one else can use that node at the same time.
To submit your job to the Scheduler, put your project in your $SCRATCH folder. Then run your batch file myfile.sh with the following command:
Example of a batch file for a single job on a single compute node
If "tiger" wants to submit the Java program to the Scheduler and only needs one compute node, myfile.sh should look like this:
In the above batch file, “#!/bin/bash” is telling Linux that this is a batch file. Jobs are placed in a queue and run accordingly. The lines that start with #SBATCH are commands for the Scheduler that decides on the order and priorities of the jobs in the queue. In this example, we are requesting 1 node, and since we have only one single job (task), we are telling the Scheduler to assign 40 cores to this single job. Then we give it a name and an upper bound on the running time (which can be up to 24hrs). If we have a single job and output, using “--output=myTest_output_%j.txt”, outputs can be saved in the log file myTest_output_%j.txt, where %j would be job ID assigned by the Scheduler. With “--mail-type=FAIL”, Niagara will send you an email if your job fails for any reason. Then we have the commands we want to run. So for example, here "tiger" moves to the directory of the project myProject with the cd command (because the current working directory is $SCRATCH), then loads the Java module, and finally runs the Java program with the usual command. As was mentioned before, there is no need for “2>&1 &” because the job is running in the background by default.
I have found the following commands very useful:
Check the statues of your jobs on the queue: $squeue -u <username>
An estimate of when a certain job is going to start: $ squeue --start -j <jobid>
Check a summary of the status of recent jobs: $ sacct
Cancel a job: $scancel <jobid>
Memory and cpu usage of a job: $jobperf <jobid>
* As a side note, having the line “#SBATCH --account def-daniel” is not necessary, but I found out if you have it in your batch file, you can still submit your jobs even if the Scheduler is down. "def-daniel" is the default allocation for your research group which is under the supervision of the account holder "daniel". When the Scheduler is down, you get the following error after submitting your job:
Also, Scheduler will be red on the status page.
Update: On Niagara you can nicely parallelize your jobs through preinstalled modules. You can head out to this post for job parallelization.