Distributed Jobs on the Vision Cluster
Matlab
An easy way to get started writing distributed software is using the MatlabMPI toolbox. It uses the Message Passing Interface (MPI) to communicate between compute nodes over the file system. It takes about a minute to set up, and less than an hour to learn the basic functionality.
Installation instructions for the vision cluster:
1. Check out the latest MatlabMPI version using subversion:
svn co svn+ssh://<username>@vision401/common/welinder/svn-public/projects/MatlabMPI
2. If you don't already have one, create a startup.m file in your /home/<username>/matlab/ directory. This file is used by Matlab for user-defined options at startup [1]. In our case, this ensures that all instances of Matlab can communicate - because that's where they check for startup.m by default.
3. Add MatlabMPI/src and MatlabMPI/queue to path (by placing below text inside startup.m)
addpath(genpath('/your_root_path/MatlabMPI/src')); addpath(genpath('/your_root_path/MatlabMPI/queue'));
4. In MatlabMPI/src/MatMPI_Comm_settings.m change 'rsh' to 'ssh' (if not already done by default)
5. For Windows: add 'addpath .\MatMPI' to startup.m:
edit([matlabroot '\toolbox\local\startup.m'])
6. For Windows: in MatlabMPI/src/MatMPI_Commands.m
remove the '/nodesktop' option from the Windows launch (or will crash on windows)
MatlabMPI has some synchronization issues on the vision cluster (see the section below) where some nodes can hang for several minutes waiting for the NFS cache to be updated. Peter Welinder and Piotr Dollar created a solution with some hacks to force the NFS to flush the cache. The code was also modified to output strerr as well as stdout to the logfiles.
Synchronization issues
The cache settings on the NFS system can cause problems with files written by some computers not appearing on other computers.
More info can be found here: [2], [3]
Archive
Before we had the MonsterVision Cluster queuing system, but it has been taken out of service.