.. unix-course exercises file .. include:: **These exercises assume you have an account on** *gyra.ualg.pt* The first time you login to your account, ``ssh`` will create keys for you - do not enter a password - just keep pressing until you get a prompt. .. toctree:: :maxdepth: 2 -------------------- When you use a command you have not seen before, look at the manual pages first: ``$ man `` ========= Exercises ========= 1. Manupulating large numbers of files ====================================== Login to gyra.ualg.pt with the your account name (here I use "test1"): :: [cymon@spiro ~]$ ssh gyra.ualg.pt -l test1 test1@gyra.ualg.pt's password: Last login: Sun Apr 17 13:49:36 2011 from 10.10.40.121 Rocks 5.3 (Rolled Tacos) Profile built 16:35 28-Oct-2010 Kickstarted 17:57 28-Oct-2010 [test1@gyra ~]$ Open an `interactive session `_ on one of the compute nodes of the cluster by issue the command: :: [test1@gyra exercise1]$ qrsh If you want to see where on the cluster you are working: :: [test1@gyra ~]$ qstat Make a new directory and change to it: :: [test1@gyra ~]$ mkdir exercise1 [test1@gyra ~]$ cd exercise1 [test1@gyra exercise1]$ ``+++++++++++ Do not execute the next two commands +++++++++++++++++`` Note that you can string commands together with a ``;`` (semi-colon): :: [test1@gyra ~]$ mkdir exercise1; cd exercise1 [test1@gyra exercise1]$ Even better is to use the ``&&`` directive, which means *execute the following only if the previous command finished without error*: :: [test1@gyra ~]$ mkdir exercise1 && cd exercise1 [test1@gyra exercise1]$ ``++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++`` Next we are going to download a compressed archive - called a *tarball* - of some data. A tarball often has the extension ``.tar.gz`` or ``.tgz`` if it is a ``gzip`` compressed archive, or ``.bz2 .tbz`` if its a ``bzip2`` compressed archive. We are going to download the archive with the programme ``wget`` (web-get) which will fetch pages or files using the http protocol. Right-click this link, and copy and paste the link location into your terminal: `analysisA tarball <_static/analysisA.tar.gz>`_ :: [test1@gyra exercise1]$ wget http://gyra.ualg.pt/unix-course/_static/analysisA.tar.gz --2011-04-17 14:57:55-- http://gyra.ualg.pt/unix-course/_static/analysisA.tar.gz Resolving gyra.ualg.pt... 193.136.227.166 Connecting to gyra.ualg.pt|193.136.227.166|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 3681427 (3.5M) [application/x-gzip] Saving to: `analysisA.tar.gz' 100%[=================================================] 3,681,427 --.-K/s in 0.01s 2011-04-17 14:57:55 (327 MB/s) - `analysisA.tar.gz' saved [3681427/3681427] [test1@gyra exercise1]$ ls -l total 3600 -rw-r--r-- 1 test1 biouser 3681427 Apr 17 14:46 analysisA.tar.gz To expand the archive issue the following command, if the command completes sucessfully we are also going remove the compressed archive: :: [test1@gyra exercise1]$ tar zxvf analysisA.tar.gz && rm analysisA.tar.gz [test1@gyra exercise1]$ ls analysisA Change to the directory and list the contents: :: [test1@gyra exercise1]$ cd analysisA [test1@gyra analysisA]$ ls accd_incl_2.seq psaa_incl_2.seq rpl19_inc OK, so that's a fair number of files - but how many exactly? Here we are going to list the files, one per line, and count the number of lines by piping the standard-out stream to ``wc`` (word count) with an ``-l`` option that means *count the lines*: :: [test1@gyra analysisA]$ ls -1 | wc -l 234 Some of these files are NEXUS formatted data matrices, e.g: ``accd_incl_2.seq``. Here we print just the first 6 lines of the file to standard-out using the command ``head``: :: [test1@gyra analysisA]$ head -n 6 accd_incl_2.seq #NEXUS begin data; dimensions ntax=34 nChar=726; format datatype=dna gap=- missing=?; matrix [test1@gyra analysisA]$ If fact there are a whole bunch of NEXUS data matrices all with the file extension ``_incl_2.seq``. How many exactly? :: [test1@gyra analysisA]$ ls -1 *_incl_2.seq | wc -l 76 [test1@gyra analysisA]$ One of the files is log file called ``log.out``, take a look at it using ``cat`` (which prints the file to standard-out): :: [test1@gyra analysisA]$ cat log.out So that wasnt very useful - everything just flew passed and off the screen. Its quite a big file: :: [test1@gyra analysisA]$ wc -l log.out 7156 log.out [test1@gyra analysisA]$ wc -c log.out 247246 log.out over 7,000 lines and nearly a quarter million characters. Here we use a pager to view the file a page at a time. Press the space-bar to advance the pager and ``q`` or CTRL-C to stop the pager when you get bored: :: [test1@gyra analysisA]$ more log.out Reading accd_incl_2.seq... Reading atpa_incl_2.seq... Reading atpb_incl_2.seq... Reading atpe_incl_2.seq... Reading atpf_incl_2.seq... Reading atph_incl_2.seq... Reading atpi_incl_2.seq... Reading ccsa_incl_2.seq... Reading cema_incl_2.seq... Lets say we want to find out the fate of the matrix "rbcl_incl_2.seq" in the log file. We can use the programme ``grep`` to search the file for a string (sequence of characters). Any lines containing the search string will be printed, in addition the ``-n`` option prints the line number. :: [test1@gyra analysisA]$ grep -n rbcl_incl_2.seq log.out 45:Reading rbcl_incl_2.seq... 7153:set006.seq (1 genes): ['rbcl_incl_2.seq'] :: Rename all the files with a ``.model`` extension removing the ``set`` prefix and replacing it with ``exercise1-``: :: [test1@gyra analysisA]$ rename set exercise1- *.model [test1@gyra analysisA]$ ls -l *.model -rw-r--r-- 1 test1 biouser 42 Mar 19 08:47 exercise1-000001.seq.model -rw-r--r-- 1 test1 biouser 42 Mar 19 04:10 exercise1-000002.seq.model -rw-r--r-- 1 test1 biouser 42 Mar 19 04:10 exercise1-000003.model -rw-r--r-- 1 test1 biouser 40 Mar 19 04:10 exercise1-000004.model -rw-r--r-- 1 test1 biouser 42 Mar 19 07:21 exercise1-000005.model -rw-r--r-- 1 test1 biouser 42 Mar 19 04:10 exercise1-000006.model -rw-r--r-- 1 test1 biouser 42 Mar 19 08:47 exercise1-001002.seq.model -rw-r--r-- 1 test1 biouser 42 Mar 19 08:47 exercise1-001003.seq.model -rw-r--r-- 1 test1 biouser 40 Mar 19 08:47 exercise1-001004.seq.model -rw-r--r-- 1 test1 biouser 42 Mar 19 08:47 exercise1-001005.seq.model -rw-r--r-- 1 test1 biouser 42 Mar 19 08:47 exercise1-001006.seq.model -rw-r--r-- 1 test1 biouser 41 Mar 18 23:19 exercise1-002003.model -rw-r--r-- 1 test1 biouser 39 Mar 18 16:36 exercise1-002004.model -rw-r--r-- 1 test1 biouser 41 Mar 19 07:21 exercise1-002005.model -rw-r--r-- 1 test1 biouser 41 Mar 18 16:36 exercise1-002006.model -rw-r--r-- 1 test1 biouser 39 Mar 18 23:19 exercise1-003004.model -rw-r--r-- 1 test1 biouser 41 Mar 19 07:21 exercise1-003005.model -rw-r--r-- 1 test1 biouser 41 Mar 18 23:19 exercise1-003006.model -rw-r--r-- 1 test1 biouser 39 Mar 19 07:21 exercise1-004005.model -rw-r--r-- 1 test1 biouser 39 Mar 18 11:09 exercise1-004006.model -rw-r--r-- 1 test1 biouser 41 Mar 19 07:21 exercise1-005006.model [test1@gyra analysisA]$ Here we decide to discard all these analyses but to keep all the ``.model`` files in an archive just in case: :: [test1@gyra analysisA]$ tar -zcvf mymodels.tar.gz ./exercise1-* ./exercise1-000001.seq.model ./exercise1-000002.seq.model ./exercise1-000003.model ./exercise1-000004.model ./exercise1-000005.model ./exercise1-000006.model ./exercise1-001002.seq.model ./exercise1-001003.seq.model ./exercise1-001004.seq.model ./exercise1-001005.seq.model ./exercise1-001006.seq.model ./exercise1-002003.model ./exercise1-002004.model ./exercise1-002005.model ./exercise1-002006.model ./exercise1-003004.model ./exercise1-003005.model ./exercise1-003006.model ./exercise1-004005.model ./exercise1-004006.model ./exercise1-005006.model [test1@gyra analysisA]$ file mymodels.tar.gz mymodels.tar.gz: gzip compressed data, from Unix, last modified: Mon Apr 18 11:53:43 2011 [test1@gyra analysisA]$ mv mymodels.tar.gz .. && cd .. [test1@gyra exercise1]$ rm -rf analysisA [test1@gyra exercise1]$ ls mymodels.tar.gz [test1@gyra exercise1]$ ``+++++++++++ rm -rf is very dangerous! ++++++++++++++++++++++++++++`` Be **very** careful with the command ``rm -rf`` it means remove with recursion (``-r`` i.e. sub-dirctories as well) and with force (``-f``). Consider these three commands: - ``$ rm -rf ./temp`` |rarr| delete temp - ``$ rm -rf ./ temp`` |rarr| delete everything in current dir, and temp - ``$ rm -rf . / temp`` |rarr| delete everything in current dir, **and the entire filesystem!**, and temp Unless you are the system administrator you wont be able to delete the entire filesystem, but you could easily delete everything in your /home dir and files belonging to other users who are in your *group* - they may not be best please if you did this. ``++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++`` Delete ``exercise1``: :: [test1@gyra ~]$ rm -rf exercise1 2. Installing software in your home directory ============================================= Ordinarily, if you needed to use a particular piece of software that wasn't installed, you would ask the administrator to install it system-wide. But there is nothing stopping you installing software in your /home directory and using it on the cluster. Here we are going to build from source, install, and use ClustalW (an alignment software) in your /home directory. ClustalW is normally available system-wide but has been temporarily removed from your $PATH for this exercise: :: [test1@gyra ~]$ clustalw2 -bash: clustalw2: command not found Create 2 directories: ``src work``: :: [test1@gyra ~]$ mkdir src work [test1@gyra ~]$ Change to the ``src`` directory and download the `Clustalw source tarball `_. Untar the tarball, and change to the source directory. :: [test1@gyra clustalw-2.0.12]$ pwd /home/test1/src/clustalw-2.0.12 [test1@gyra clustalw-2.0.12]$ ls aclocal.m4 clustalw_help config.guess config.sub configure configure.ac depcomp install-sh LICENSE m4 Makefile.am Makefile.in missing README src The ClustalW sources are in the typical `GNU build system format `_ also refered to as the "configure;make;make install" system. We want to install the software in your home directory. By default ``configure`` installs to ``/usr/local`` we need to change this setting with the prefix command: :: [test1@gyra clustalw-2.0.12]$ ./configure -h `configure' configures clustalw 2.0.12 to adapt to many kinds of systems. Usage: ./configure [OPTION]... [VAR=VALUE]... To assign environment variables (e.g., CC, CFLAGS...), specify them as VAR=VALUE. See below for descriptions of some of the useful variables. Installation directories: --prefix=PREFIX install architecture-independent files in PREFIX [/usr/local] --exec-prefix=EPREFIX install architecture-dependent files in EPREFIX [PREFIX] So we need to use the ``--prefix`` argument of the ``configure`` script - here we are going to build and install the software... CHANGE ``--prefix=/home/test1`` to your home directory ``--prefix=/home/`` Note - this may take 60 seconds or so to complete... :: [test1@gyra clustalw-2.0.12]$ ./configure --prefix=/home/test1 && make && make install make[1]: Leaving directory `/home/test1/src/clustalw-2.0.12' [test1@gyra clustalw-2.0.12]$ There was no error - all seemed to go well. The installation script created a ``bin`` directory (where binaries go) in your home directory and installed the clustalw programme there: [test1@gyra clustalw-2.0.12]$ ls ~/bin clustalw2 (It also created a ``~/share/aclocal`` directory with nothing in it - this appears to be a bug in the clustalw installation process - delete it.) Change to your ``work`` directory: :: [test1@gyra work]$ pwd /home/test1/work If we issue the command `clustalw2` from the ``work`` directory the command is not found: :: [test1@gyra work]$ clustalw2 -bash: clustalw2: command not found You could always refer to the programme by its full PATH: :: [test1@gyra work]$ ../bin/clustalw2 ************************************************************** ******** CLUSTAL 2.0.12 Multiple Sequence Alignments ******** ************************************************************** But its is better to add it to your PATH permenently. ``+++++++++++ Hidden files +++++++++++++++++++++++++++++++++++++++++++`` Some files are *hidden* - they start with a period ``.`` and do not show up with a normal ``ls`` command: :: [test1@gyra ~]$ ls bin src work But they do with the ``-a`` *all* flag: :: [test1@gyra ~]$ ls -al total 72 drwxr-x--- 8 test1 biouser 4096 Apr 18 13:55 . drwxr-xr-x 5 root root 0 Apr 18 13:38 .. -rw------- 1 test1 biouser 4716 Apr 18 13:55 .bash_history -rw-r--r-- 1 test1 biouser 33 Apr 17 14:36 .bash_logout -rw-r--r-- 1 test1 biouser 142 Apr 18 13:55 .bash_profile -rw-r--r-- 1 test1 biouser 210 Apr 17 14:36 .bashrc drwxr-xr-x 2 test1 biouser 4096 Apr 18 13:37 bin -rw-r--r-- 1 test1 biouser 515 Apr 17 14:36 .emacs -rw------- 1 test1 biouser 35 Apr 17 16:21 .lesshst drwxr-xr-x 4 test1 biouser 4096 Apr 17 14:36 .mozilla -rw-r--r-- 1 test1 biouser 157 Apr 17 14:37 .ncbirc drwxr-xr-x 3 test1 biouser 4096 Apr 18 13:19 src drwx------ 2 test1 biouser 4096 Apr 17 15:53 .ssh drwxr-xr-x 3 test1 biouser 4096 Apr 17 14:37 .t_coffee -rw------- 1 test1 biouser 5297 Apr 18 13:55 .viminfo drwxr-xr-x 2 test1 biouser 4096 Apr 18 13:48 work -rw------- 1 test1 biouser 232 Apr 18 13:55 .Xauthority [test1@gyra ~]$ Most of these files are written by the system and are configuration files. The file we are interested in is called ``.bash_profile`` ``++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++`` The commands in ``.bash_profile`` are executed when you login to your account: :: [test1@gyra ~]$ cat .bash_profile # .bash_profile # Get the aliases and functions if [ -f ~/.bashrc ]; then . ~/.bashrc fi # User specific environment and startup programs [test1@gyra ~]$ We are going to edit this file and adjust your $PATH enviroment variable to include the ``~/bin`` directory: :: [test1@gyra ~]$ nano .bash_profile Alter the file to read: :: # .bash_profile # Get the aliases and functions if [ -f ~/.bashrc ]; then . ~/.bashrc fi # User specific environment and startup programs PATH=~/bin:$PATH Press ``^X`` (CTRL-X), and the ``y`` and then . Confirm that you edited the file correctly. :: [test1@gyra ~]$ cat .bash_profile # .bash_profile # Get the aliases and functions if [ -f ~/.bashrc ]; then . ~/.bashrc fi # User specific environment and startup programs PATH=~/bin:$PATH [test1@gyra ~]$ Issue the command: :: [test1@gyra ~]$ source ~/.bash_profile This executes the commands in ``~/.bash_profile`` just as if you had newly logged in. You have now added ``~/bin`` to your PATH: note that it has been added to the front of your path so the shell will always look here first for a programme and, if found, use it rather than another programme of the same name elsewhere on your path... If you want to see your full $PATH: :: [test1@gyra ~]$ echo $PATH Change directory to ``~/work`` and download this `data file `_ - it's a fasta formatted file of unaligned *atpB* proteins from some cyanobacteria and plant chloroplasts. Take a look at the file contents using ``nano``. Options available with ``clustalw2``: :: [test1@gyra work]$ clustalw2 -help Here we align the data, outputing a NEXUS formatted alignment, writting additional statistics to ``stats.text`` and directing the standard out to a file called `log` (try it without ``> log`` if you like). :: [test1@gyra work]$ clustalw2 -INFILE=atpf_data.fasta -OUTPUT=NEXUS -STATS=stats.text > log [test1@gyra work]$ ls atpf_data.dnd atpf_data.fasta atpf_data.nxs log stats.text Inspect the output files using ``cat``, ``head``, ``nano`` or whatever you feel is appropriate. **Exit your interactive session**:: [test1@compute-0-3 ~]$ exit logout Connection to compute-0-3.local closed. [test1@gyra ~]$ 3. Submitting the analyses to the cluster queue =============================================== Previously, you installed and executed the ClustalW software while running an `interactive session `_ on the cluster - ie you were logged-in to one of the compute nodes where you manually entered commands. Here you are going to submit the ClustalW analysis directly to the cluster queuing system - a `batch submission `_. (*Please read*) Navigate to you ``~/work``, **delete the contents except atpf_data.fasta**, and download this `submission file `_. Edit the submission file in ``nano`` and replace the ``XXXXXXXX`` in the line that starts with ``#$ -M`` with your email address: e.g.:: #$ -M joe.bloggs@googlemail.com Save file and submit to the queue:: [test1@gyra work]$ qsub submit_to_cluster.sh Your job 5258 ("align-atpB") has been submitted [test1@gyra work]$ qstat queuename qtype resv/used/tot. load_avg arch states --------------------------------------------------------------------------------- all.q@compute-0-0.local BP 0/3/8 3.00 lx26-amd64 hl:mem_free=5.308G 5214 0.55500 l34iCV26 cymon r 04/11/2011 14:17:20 1 5215 0.55500 l34iCV24 cymon r 04/11/2011 14:28:35 1 5216 0.55500 l34iCV22 cymon r 04/11/2011 14:29:20 1 --------------------------------------------------------------------------------- all.q@compute-0-1.local BP 0/1/8 1.00 lx26-amd64 hl:mem_free=6.453G 5218 0.55500 lpe34free cymon r 04/11/2011 14:40:20 1 --------------------------------------------------------------------------------- all.q@compute-0-2.local BP 0/4/32 4.00 lx26-amd64 hl:mem_free=60.590G 3934 0.55500 q-run1 cymon r 03/13/2011 16:39:22 1 3935 0.55500 q-run2 cymon r 03/13/2011 16:55:37 1 3936 0.55500 q-run3 cymon r 03/13/2011 16:55:52 1 3937 0.55500 q-run4 cymon r 03/13/2011 16:55:52 1 --------------------------------------------------------------------------------- all.q@compute-0-3.local BIP 0/1/16 0.12 lx26-amd64 hl:mem_free=27.532G 5256 0.55500 QRLOGIN tiagom r 04/18/2011 12:37:22 1 --------------------------------------------------------------------------------- all.q@compute-0-4.local BIP 0/1/32 1.00 lx26-amd64 hl:mem_free=61.376G 5176 0.55500 l34iCV28 cymon r 04/05/2011 16:57:37 1 --------------------------------------------------------------------------------- all.q@gyra.local BP 0/1/16 0.03 lx26-amd64 hl:mem_free=29.692G 5258 0.55500 align-atpB test1 r 04/18/2011 15:16:53 1 In this case the queuing system has started the job (# 5258) on gyra.local the head node (last line above). When its is finished you the queue will send you an email and you should have the same result files as before:: [test1@gyra work]$ ls -l total 68 -rw-r--r-- 1 test1 biouser 1294 Apr 18 15:17 atpf_data.dnd -rw-r--r-- 1 test1 biouser 8008 Apr 18 13:45 atpf_data.fasta -rw-r--r-- 1 test1 biouser 11707 Apr 18 15:17 atpf_data.nxs -rw-r--r-- 1 test1 biouser 35736 Apr 18 15:17 log -rw-r--r-- 1 test1 biouser 486 Apr 18 15:17 stats.text -rw-r--r-- 1 test1 biouser 553 Apr 18 15:08 submit_to_cluster.sh [test1@gyra work]$ Inspect files:: [test1@gyra work]$ head -10 log I'm going to sleep for 30 seconds... and I'm back. CLUSTAL 2.0.12 Multiple Sequence Alignments Sequence format is Pearson Sequence 1: Chlorobium 176 aa [test1@gyra work]$ *Log out of gyra*:: [test1@gyra work]$ exit ... there is no more.