注册 登录  
 加关注
查看详情
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

notes

@-@

 
 
 

日志

 
 
 
 

How to Use OpenPBS  

2009-06-01 18:34:07|  分类: MPI |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

Basic setup

0.) Setup the PSR system
First you need to run the psr_setup command. This will encrypt your AFS password with the PSR public key and store the encrypted password in the proper place in your home directory. When a batch job runs, it decrypts this file with the private key and uses it to obtain AFS tokens for you.
In a perfect world, you would only do this once. In reality, you will probably have to do this once in a great while. If we think the private key has been compromised, then we will change it. Let's hope this happens very rarely.
Note: the psr_setup command does not check to see if you correctly entered your AFS password. The result of an incorrect password is that the batch system will not be able to obtain AFS tokens for you, and your job may not properly run. If you think you made a typo, run the psr_setup command again.
1.) Create a batch script to run your code
There are some simple examples below. This is not strictly necessary if your job is simple enough, but it helps. Standard out and standard error are captured by the batch system and returned at the end of your job. If your job simply prints output to the screen, then you don't need to redirect that output to a file, pbs will take care of it for you.
2.) Submit your job with the qsub command requesting the proper resources
There are a few resources that you must specify: pmem, physical memory, in megabytes and cput, cpu time in hours. You can run qsub with the -I parameter for an interactive session. This is best used for debugging.
To submit jobs to the 30 node cluster, use gridtest1 as the cluster master. To submit jobs to the mini 3 node cluster, use bumble as the cluster master.

Example Batch Scripts

Example 1, simple command that sends output to stdout
# example1.pbs
# Invoke with:
# qsub -l nodes=1,pmem=1800mb,mem=1800mb,ncpus=1,cput=75:00:00 example1.pbs

# uncomment the following to see the values of PBS Environment variables
# echo "qsub host is"
# echo $PBS_O_HOST
# echo "original queue is"
# echo $PBS_O_QUEUE
# echo "qsub working directory absolute is"
# echo $PBS_O_WORKDIR
# echo "pbs environment is"
# echo $PBS_ENVIRONMENT
# echo "pbs batch id"
# echo $PBS_JOBID
# echo "pbs job name from me is"
# echo $PBS_JOBNAME
# echo "Name of file containing nodes is"
# echo $PBS_NODEFILE
# echo "contents of nodefile is"
# cat $PBS_NODEFILE
# echo "Name of queue to which job went is"
# echo $PBS_QUEUE

# make sure we are in the right directory in case writing files
cd $PBS_O_WORKDIR

# echo Run actual commands here
/usr/bin/openssl speed
Example 2, LAM/MPI multi-machine job
# example2.pbs
# invoke with
# qsub -l nodes=2:ppn=1,pmem=1800mb,mem=1800mb,ncpus=2,cput=5:00:00 example2.pbs
# note that we are using one cpu per node. That :ppn=1 is critical.

# comment these out if you wish
echo "qsub host is"
echo $PBS_O_HOST
echo "original queue is"
echo $PBS_O_QUEUE
echo "qsub working directory absolute is"
echo $PBS_O_WORKDIR
echo "pbs environment is"
echo $PBS_ENVIRONMENT
echo "pbs batch id"
echo $PBS_JOBID
echo "pbs job name from me is"
echo $PBS_JOBNAME
echo "Name of file containing nodes is"
echo $PBS_NODEFILE
echo "contents of nodefile is"
cat $PBS_NODEFILE
echo "Name of queue to which job went is"
echo $PBS_QUEUE
# make sure we are in the right directory in case writing files
cd $PBS_O_WORKDIR

# Setup the LAM/MPI topology
lamboot -v $PBS_NODEFILE
# Run the mpi job
# Note: the lam/mpi mpirun is in /usr/bin, use the full path

/usr/bin/mpirun -np 2 -wd $PBS_O_WORKDIR ./dft_mpi
# Clean up the LAM/MPI topology (kill the lamd daemons on the nodes)
wipe -v $PBS_NODEFILE
Example 3 - MPICH Multi-machine job
# example3.pbs
# invoke with
# qsub -l nodes=2:ppn=1,pmem=1800mb,mem=1800mb,ncpus=2,cput=5:00:00 example3.pbs
# note that we are using one cpu per node. That :ppn=1 is critical.

# comment these out if you wish
echo "qsub host is"
echo $PBS_O_HOST
echo "original queue is"
echo $PBS_O_QUEUE
echo "qsub working directory absolute is"
echo $PBS_O_WORKDIR
echo "pbs environment is"
echo $PBS_ENVIRONMENT
echo "pbs batch id"
echo $PBS_JOBID
echo "pbs job name from me is"
echo $PBS_JOBNAME
echo "Name of file containing nodes is"
echo $PBS_NODEFILE
echo "contents of nodefile is"
cat $PBS_NODEFILE
echo "Name of queue to which job went is"
echo $PBS_QUEUE

# make sure we are in the right directory in case writing files
cd $PBS_O_WORKDIR

# Run the mpi job
# Note: the MPICH mpirun command is in /usr/local/bin
# use the full path
/usr/local/bin/mpirun -np 2 -machinefile $PBS_NODEFILE ./dft_mpi

In both example2 and example3 if one were running a threaded mpi job, then the ppn=1 would become ppn=2 for cluster nodes. Note that np in the mpirun command would not increase to 4 but ncpus in the qsub command would. Alternately, you could use 4 cpus with either ppn=1 or ppn=2 with a non-threaded mpi job. In that case both qsub ncpus=4 and mpirun -np 4 would be specified.

Example 4

There are lots of comments in the example. This uses tcsh/csh for the scripting.

# example4.pbs
# Invoke with:
# qsub -l nodes=1,pmem=80mb,mem=80mb,ncpus=1,cput=1:00:00 example4.pbs


# Assumptions: You start from a directory in your home AFS space. You
# have a program called calculate_data that reads an initial state
# from data.in and generates a data.out file. Then periodically
# thereafter, a new data.out file is generated until the calculation
# finishes. Only the last data.out is valuable, but you want to keep 2
# in AFS space just in case. There may be other files generated during
# the calculation, but these are temporary working files that you
# don't worry about being backed up.


# make sure we are in the right directory in AFS to start
cd $PBS_O_WORKDIR

#setup
# Create a unique working directory in /scratch, replace rbr with your
# username
mkdir -p /scratch/rbr/test_$PBS_JOBID
# Copy the program and the input file to the new directory
cp calculate_data /scratch/rbr/test_$PBS_JOBID
cp data.in /scratch/rbr/test_$PBS_JOBID
# Go to the new directory in /scratch
cd /scratch/rbr/test_$PBS_JOBID

# Run the actual calculation. Run it in the background. Collect stdout and
# stderr in the file out.out
./calculate_data >& out.out &

# Wait for data.out to exist
while (1)
if ( -f data.out ) then
break
else
sleep 60
endif
end

# Wait 10 seconds and then get the modification time of data.out
sleep 10
@ oldtime = -C data.out
set i=1
# copy data.out to AFS space with iteration number
cp data.out $PBS_O_WORKDIR/data.out.$i
while (1)
# get the modification time of data.out
@ newtime = -C data.out
if ( $newtime > $oldtime ) then
# Newer data.out exists. increment counter, copy file to AFS
@ i = $i + 1
cp data.out $PBS_O_WORKDIR/data.out.$i
@ j = $i - 2
# remove old backup file
if ( -f $PBS_O_WORKDIR/data.out.$j ) then
rm $PBS_O_WORKDIR/data.out.$j
endif
# update modification time
@ oldtime = $newtime
endif
# wait a minute before looping again
sleep 60
set running = `ps | grep -v grep | grep -c calculate_data`
# Check to see if program is still running
if ( $running == 0 ) then
# if done, copy data.out one more time just in case
cp data.out $PBS_O_WORKDIR
cp out.out $PBS_O_WORKDIR/out.$PBS_JOBID
break
endif
end

Useful OpenPBS Man Pages

  评论这张
 
阅读(784)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2018