Table Of Contents

This Page

These exercises assume you have an account on gyra.ualg.pt

The first time you login to your account, ssh will create keys for you - do not enter a password - just keep pressing <RETURN> until you get a prompt.


When you use a command you have not seen before, look at the manual pages first:

$ man <command>

Exercises

1. Manupulating large numbers of files

Login to gyra.ualg.pt with the your account name (here I use “test1”):

[cymon@spiro ~]$ ssh gyra.ualg.pt -l test1
test1@gyra.ualg.pt's password:

Last login: Sun Apr 17 13:49:36 2011 from 10.10.40.121
Rocks 5.3 (Rolled Tacos)
Profile built 16:35 28-Oct-2010

Kickstarted 17:57 28-Oct-2010
[test1@gyra ~]$

Open an interactive session on one of the compute nodes of the cluster by issue the command:

[test1@gyra exercise1]$ qrsh

If you want to see where on the cluster you are working:

[test1@gyra ~]$ qstat

Make a new directory and change to it:

[test1@gyra ~]$ mkdir exercise1
[test1@gyra ~]$ cd exercise1
[test1@gyra exercise1]$

+++++++++++ Do not execute the next two commands +++++++++++++++++

Note that you can string commands together with a ; (semi-colon):

[test1@gyra ~]$ mkdir exercise1; cd exercise1
[test1@gyra exercise1]$

Even better is to use the && directive, which means execute the following only if the previous command finished without error:

[test1@gyra ~]$ mkdir exercise1 && cd exercise1
[test1@gyra exercise1]$

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Next we are going to download a compressed archive - called a tarball - of some data. A tarball often has the extension .tar.gz or .tgz if it is a gzip compressed archive, or .bz2 .tbz if its a bzip2 compressed archive. We are going to download the archive with the programme wget (web-get) which will fetch pages or files using the http protocol.

Right-click this link, and copy and paste the link location into your terminal:

[test1@gyra exercise1]$ wget http://gyra.ualg.pt/unix-course/_static/analysisA.tar.gz
--2011-04-17 14:57:55--  http://gyra.ualg.pt/unix-course/_static/analysisA.tar.gz
Resolving gyra.ualg.pt... 193.136.227.166
Connecting to gyra.ualg.pt|193.136.227.166|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3681427 (3.5M) [application/x-gzip]
Saving to: `analysisA.tar.gz'

100%[=================================================] 3,681,427   --.-K/s   in 0.01s

2011-04-17 14:57:55 (327 MB/s) - `analysisA.tar.gz' saved [3681427/3681427]

[test1@gyra exercise1]$ ls -l
total 3600
-rw-r--r-- 1 test1 biouser 3681427 Apr 17 14:46 analysisA.tar.gz

To expand the archive issue the following command, if the command completes sucessfully we are also going remove the compressed archive:

[test1@gyra exercise1]$ tar zxvf analysisA.tar.gz && rm analysisA.tar.gz
[test1@gyra exercise1]$ ls
analysisA

Change to the directory and list the contents:

[test1@gyra exercise1]$ cd analysisA
[test1@gyra analysisA]$ ls
accd_incl_2.seq  psaa_incl_2.seq   rpl19_inc <etc>

OK, so that’s a fair number of files - but how many exactly? Here we are going to list the files, one per line, and count the number of lines by piping the standard-out stream to wc (word count) with an -l option that means count the lines:

[test1@gyra analysisA]$ ls -1 | wc -l
234

Some of these files are NEXUS formatted data matrices, e.g: accd_incl_2.seq. Here we print just the first 6 lines of the file to standard-out using the command head:

[test1@gyra analysisA]$ head -n 6 accd_incl_2.seq
#NEXUS

begin data;
  dimensions ntax=34 nChar=726;
  format datatype=dna gap=- missing=?;
  matrix
[test1@gyra analysisA]$

If fact there are a whole bunch of NEXUS data matrices all with the file extension _incl_2.seq. How many exactly?

[test1@gyra analysisA]$ ls -1 *_incl_2.seq | wc -l
76
[test1@gyra analysisA]$

One of the files is log file called log.out, take a look at it using cat (which prints the file to standard-out):

[test1@gyra analysisA]$ cat log.out

So that wasnt very useful - everything just flew passed and off the screen. Its quite a big file:

[test1@gyra analysisA]$ wc -l log.out
7156 log.out
[test1@gyra analysisA]$ wc -c log.out
247246 log.out

over 7,000 lines and nearly a quarter million characters. Here we use a pager to view the file a page at a time. Press the space-bar to advance the pager and q or CTRL-C to stop the pager when you get bored:

[test1@gyra analysisA]$ more log.out
Reading accd_incl_2.seq...
Reading atpa_incl_2.seq...
Reading atpb_incl_2.seq...
Reading atpe_incl_2.seq...
Reading atpf_incl_2.seq...
Reading atph_incl_2.seq...
Reading atpi_incl_2.seq...
Reading ccsa_incl_2.seq...
Reading cema_incl_2.seq...
<etc>

Lets say we want to find out the fate of the matrix “rbcl_incl_2.seq” in the log file. We can use the programme grep to search the file for a string (sequence of characters). Any lines containing the search string will be printed, in addition the -n option prints the line number.

[test1@gyra analysisA]$ grep -n rbcl_incl_2.seq log.out
45:Reading rbcl_incl_2.seq...
7153:set006.seq (1 genes): ['rbcl_incl_2.seq']

Rename all the files with a .model extension removing the set prefix and replacing it with exercise1-:

[test1@gyra analysisA]$ rename set exercise1- *.model
[test1@gyra analysisA]$ ls -l *.model
-rw-r--r-- 1 test1 biouser 42 Mar 19 08:47 exercise1-000001.seq.model
-rw-r--r-- 1 test1 biouser 42 Mar 19 04:10 exercise1-000002.seq.model
-rw-r--r-- 1 test1 biouser 42 Mar 19 04:10 exercise1-000003.model
-rw-r--r-- 1 test1 biouser 40 Mar 19 04:10 exercise1-000004.model
-rw-r--r-- 1 test1 biouser 42 Mar 19 07:21 exercise1-000005.model
-rw-r--r-- 1 test1 biouser 42 Mar 19 04:10 exercise1-000006.model
-rw-r--r-- 1 test1 biouser 42 Mar 19 08:47 exercise1-001002.seq.model
-rw-r--r-- 1 test1 biouser 42 Mar 19 08:47 exercise1-001003.seq.model
-rw-r--r-- 1 test1 biouser 40 Mar 19 08:47 exercise1-001004.seq.model
-rw-r--r-- 1 test1 biouser 42 Mar 19 08:47 exercise1-001005.seq.model
-rw-r--r-- 1 test1 biouser 42 Mar 19 08:47 exercise1-001006.seq.model
-rw-r--r-- 1 test1 biouser 41 Mar 18 23:19 exercise1-002003.model
-rw-r--r-- 1 test1 biouser 39 Mar 18 16:36 exercise1-002004.model
-rw-r--r-- 1 test1 biouser 41 Mar 19 07:21 exercise1-002005.model
-rw-r--r-- 1 test1 biouser 41 Mar 18 16:36 exercise1-002006.model
-rw-r--r-- 1 test1 biouser 39 Mar 18 23:19 exercise1-003004.model
-rw-r--r-- 1 test1 biouser 41 Mar 19 07:21 exercise1-003005.model
-rw-r--r-- 1 test1 biouser 41 Mar 18 23:19 exercise1-003006.model
-rw-r--r-- 1 test1 biouser 39 Mar 19 07:21 exercise1-004005.model
-rw-r--r-- 1 test1 biouser 39 Mar 18 11:09 exercise1-004006.model
-rw-r--r-- 1 test1 biouser 41 Mar 19 07:21 exercise1-005006.model
[test1@gyra analysisA]$

Here we decide to discard all these analyses but to keep all the .model files in an archive just in case:

[test1@gyra analysisA]$ tar -zcvf mymodels.tar.gz ./exercise1-*
./exercise1-000001.seq.model
./exercise1-000002.seq.model
./exercise1-000003.model
./exercise1-000004.model
./exercise1-000005.model
./exercise1-000006.model
./exercise1-001002.seq.model
./exercise1-001003.seq.model
./exercise1-001004.seq.model
./exercise1-001005.seq.model
./exercise1-001006.seq.model
./exercise1-002003.model
./exercise1-002004.model
./exercise1-002005.model
./exercise1-002006.model
./exercise1-003004.model
./exercise1-003005.model
./exercise1-003006.model
./exercise1-004005.model
./exercise1-004006.model
./exercise1-005006.model
[test1@gyra analysisA]$ file mymodels.tar.gz
mymodels.tar.gz: gzip compressed data, from Unix, last modified: Mon Apr 18 11:53:43 2011
[test1@gyra analysisA]$ mv mymodels.tar.gz .. && cd ..
[test1@gyra exercise1]$ rm -rf analysisA
[test1@gyra exercise1]$ ls
mymodels.tar.gz
[test1@gyra exercise1]$

+++++++++++ rm -rf is very dangerous! ++++++++++++++++++++++++++++

Be very careful with the command rm -rf it means remove with recursion (-r i.e. sub-dirctories as well) and with force (-f). Consider these three commands:

  • $ rm -rf ./temp → delete temp
  • $ rm -rf ./ temp → delete everything in current dir, and temp
  • $ rm -rf . / temp → delete everything in current dir, and the entire filesystem!, and temp

Unless you are the system administrator you wont be able to delete the entire filesystem, but you could easily delete everything in your /home dir and files belonging to other users who are in your group - they may not be best please if you did this.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Delete exercise1:

[test1@gyra ~]$ rm -rf exercise1

2. Installing software in your home directory

Ordinarily, if you needed to use a particular piece of software that wasn’t installed, you would ask the administrator to install it system-wide. But there is nothing stopping you installing software in your /home directory and using it on the cluster.

Here we are going to build from source, install, and use ClustalW (an alignment software) in your /home directory. ClustalW is normally available system-wide but has been temporarily removed from your $PATH for this exercise:

[test1@gyra ~]$ clustalw2
-bash: clustalw2: command not found

Create 2 directories: src work:

[test1@gyra ~]$ mkdir src work
[test1@gyra ~]$

Change to the src directory and download the Clustalw source tarball. Untar the tarball, and change to the source directory.

[test1@gyra clustalw-2.0.12]$ pwd
/home/test1/src/clustalw-2.0.12
[test1@gyra clustalw-2.0.12]$ ls
aclocal.m4  clustalw_help  config.guess  config.sub  configure  configure.ac  depcomp  install-sh  LICENSE  m4  Makefile.am  Makefile.in  missing  README  src

The ClustalW sources are in the typical GNU build system format also refered to as the “configure;make;make install” system.

We want to install the software in your home directory. By default configure installs to /usr/local we need to change this setting with the prefix command:

[test1@gyra clustalw-2.0.12]$ ./configure -h
`configure' configures clustalw 2.0.12 to adapt to many kinds of systems.

Usage: ./configure [OPTION]... [VAR=VALUE]...

To assign environment variables (e.g., CC, CFLAGS...), specify them as
VAR=VALUE.  See below for descriptions of some of the useful variables.

<SKIP STUFF>


Installation directories:
--prefix=PREFIX         install architecture-independent files in PREFIX
                        [/usr/local]
--exec-prefix=EPREFIX   install architecture-dependent files in EPREFIX
                        [PREFIX]

<SKIP STUFF>

So we need to use the --prefix argument of the configure script - here we are going to build and install the software...

CHANGE --prefix=/home/test1 to your home directory --prefix=/home/<your account name>

Note - this may take 60 seconds or so to complete...

[test1@gyra clustalw-2.0.12]$ ./configure --prefix=/home/test1 && make && make install

<LOTS OF OUTPUT AS THE PROGRAMME IS COMPILED>

make[1]: Leaving directory `/home/test1/src/clustalw-2.0.12'
[test1@gyra clustalw-2.0.12]$

There was no error - all seemed to go well. The installation script created a bin directory (where binaries go) in your home directory and installed the clustalw programme there:

[test1@gyra clustalw-2.0.12]$ ls ~/bin clustalw2

(It also created a ~/share/aclocal directory with nothing in it - this appears to be a bug in the clustalw installation process - delete it.)

Change to your work directory:

[test1@gyra work]$ pwd
/home/test1/work

If we issue the command clustalw2 from the work directory the command is not found:

[test1@gyra work]$ clustalw2
-bash: clustalw2: command not found

You could always refer to the programme by its full PATH:

[test1@gyra work]$ ../bin/clustalw2



 **************************************************************
 ******** CLUSTAL 2.0.12 Multiple Sequence Alignments  ********
 **************************************************************

 <etc>

But its is better to add it to your PATH permenently.

+++++++++++ Hidden files +++++++++++++++++++++++++++++++++++++++++++

Some files are hidden - they start with a period . and do not show up with a normal ls command:

[test1@gyra ~]$ ls
bin  src  work

But they do with the -a all flag:

[test1@gyra ~]$ ls -al
total 72
drwxr-x--- 8 test1 biouser 4096 Apr 18 13:55 .
drwxr-xr-x 5 root  root       0 Apr 18 13:38 ..
-rw------- 1 test1 biouser 4716 Apr 18 13:55 .bash_history
-rw-r--r-- 1 test1 biouser   33 Apr 17 14:36 .bash_logout
-rw-r--r-- 1 test1 biouser  142 Apr 18 13:55 .bash_profile
-rw-r--r-- 1 test1 biouser  210 Apr 17 14:36 .bashrc
drwxr-xr-x 2 test1 biouser 4096 Apr 18 13:37 bin
-rw-r--r-- 1 test1 biouser  515 Apr 17 14:36 .emacs
-rw------- 1 test1 biouser   35 Apr 17 16:21 .lesshst
drwxr-xr-x 4 test1 biouser 4096 Apr 17 14:36 .mozilla
-rw-r--r-- 1 test1 biouser  157 Apr 17 14:37 .ncbirc
drwxr-xr-x 3 test1 biouser 4096 Apr 18 13:19 src
drwx------ 2 test1 biouser 4096 Apr 17 15:53 .ssh
drwxr-xr-x 3 test1 biouser 4096 Apr 17 14:37 .t_coffee
-rw------- 1 test1 biouser 5297 Apr 18 13:55 .viminfo
drwxr-xr-x 2 test1 biouser 4096 Apr 18 13:48 work
-rw------- 1 test1 biouser  232 Apr 18 13:55 .Xauthority
[test1@gyra ~]$

Most of these files are written by the system and are configuration files.

The file we are interested in is called .bash_profile

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

The commands in .bash_profile are executed when you login to your account:

[test1@gyra ~]$ cat .bash_profile
# .bash_profile

# Get the aliases and functions
if [ -f ~/.bashrc ]; then
    . ~/.bashrc
fi

# User specific environment and startup programs

[test1@gyra ~]$

We are going to edit this file and adjust your $PATH enviroment variable to include the ~/bin directory:

[test1@gyra ~]$ nano .bash_profile

Alter the file to read:

# .bash_profile

# Get the aliases and functions
if [ -f ~/.bashrc ]; then
        . ~/.bashrc
fi

# User specific environment and startup programs
PATH=~/bin:$PATH

Press ^X (CTRL-X), and the y and then <RETURN>. Confirm that you edited the file correctly.

[test1@gyra ~]$ cat .bash_profile
# .bash_profile

# Get the aliases and functions
if [ -f ~/.bashrc ]; then
    . ~/.bashrc
fi

# User specific environment and startup programs

PATH=~/bin:$PATH
[test1@gyra ~]$

Issue the command:

[test1@gyra ~]$ source ~/.bash_profile

This executes the commands in ~/.bash_profile just as if you had newly logged in.

You have now added ~/bin to your PATH: note that it has been added to the front of your path so the shell will always look here first for a programme and, if found, use it rather than another programme of the same name elsewhere on your path...

If you want to see your full $PATH:

[test1@gyra ~]$ echo $PATH

Change directory to ~/work and download this data file - it’s a fasta formatted file of unaligned atpB proteins from some cyanobacteria and plant chloroplasts. Take a look at the file contents using nano.

Options available with clustalw2:

[test1@gyra work]$ clustalw2 -help
<STUFF>

Here we align the data, outputing a NEXUS formatted alignment, writting additional statistics to stats.text and directing the standard out to a file called log (try it without > log if you like).

[test1@gyra work]$ clustalw2 -INFILE=atpf_data.fasta -OUTPUT=NEXUS -STATS=stats.text > log
[test1@gyra work]$ ls
atpf_data.dnd  atpf_data.fasta  atpf_data.nxs  log  stats.text

Inspect the output files using cat, head, nano or whatever you feel is appropriate.

Exit your interactive session:

[test1@compute-0-3 ~]$ exit
logout

Connection to compute-0-3.local closed.
[test1@gyra ~]$

3. Submitting the analyses to the cluster queue

Previously, you installed and executed the ClustalW software while running an interactive session on the cluster - ie you were logged-in to one of the compute nodes where you manually entered commands.

Here you are going to submit the ClustalW analysis directly to the cluster queuing system - a batch submission. (Please read)

Navigate to you ~/work, delete the contents except atpf_data.fasta, and download this submission file.

Edit the submission file in nano and replace the XXXXXXXX in the line that starts with #$ -M with your email address:

e.g.:

#$ -M joe.bloggs@googlemail.com

Save file and submit to the queue:

[test1@gyra work]$ qsub submit_to_cluster.sh
Your job 5258 ("align-atpB") has been submitted
[test1@gyra work]$ qstat
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@compute-0-0.local        BP    0/3/8          3.00     lx26-amd64
    hl:mem_free=5.308G
   5214 0.55500 l34iCV26   cymon        r     04/11/2011 14:17:20     1
   5215 0.55500 l34iCV24   cymon        r     04/11/2011 14:28:35     1
   5216 0.55500 l34iCV22   cymon        r     04/11/2011 14:29:20     1
---------------------------------------------------------------------------------
all.q@compute-0-1.local        BP    0/1/8          1.00     lx26-amd64
    hl:mem_free=6.453G
   5218 0.55500 lpe34free  cymon        r     04/11/2011 14:40:20     1
---------------------------------------------------------------------------------
all.q@compute-0-2.local        BP    0/4/32         4.00     lx26-amd64
    hl:mem_free=60.590G
   3934 0.55500 q-run1     cymon        r     03/13/2011 16:39:22     1
   3935 0.55500 q-run2     cymon        r     03/13/2011 16:55:37     1
   3936 0.55500 q-run3     cymon        r     03/13/2011 16:55:52     1
   3937 0.55500 q-run4     cymon        r     03/13/2011 16:55:52     1
---------------------------------------------------------------------------------
all.q@compute-0-3.local        BIP   0/1/16         0.12     lx26-amd64
    hl:mem_free=27.532G
   5256 0.55500 QRLOGIN    tiagom       r     04/18/2011 12:37:22     1
---------------------------------------------------------------------------------
all.q@compute-0-4.local        BIP   0/1/32         1.00     lx26-amd64
    hl:mem_free=61.376G
   5176 0.55500 l34iCV28   cymon        r     04/05/2011 16:57:37     1
---------------------------------------------------------------------------------
all.q@gyra.local               BP    0/1/16         0.03     lx26-amd64
    hl:mem_free=29.692G
   5258 0.55500 align-atpB test1        r     04/18/2011 15:16:53     1

In this case the queuing system has started the job (# 5258) on gyra.local the head node (last line above).

When its is finished you the queue will send you an email and you should have the same result files as before:

[test1@gyra work]$ ls -l
total 68
-rw-r--r-- 1 test1 biouser  1294 Apr 18 15:17 atpf_data.dnd
-rw-r--r-- 1 test1 biouser  8008 Apr 18 13:45 atpf_data.fasta
-rw-r--r-- 1 test1 biouser 11707 Apr 18 15:17 atpf_data.nxs
-rw-r--r-- 1 test1 biouser 35736 Apr 18 15:17 log
-rw-r--r-- 1 test1 biouser   486 Apr 18 15:17 stats.text
-rw-r--r-- 1 test1 biouser   553 Apr 18 15:08 submit_to_cluster.sh
[test1@gyra work]$

Inspect files:

[test1@gyra work]$ head -10 log
I'm going to sleep for 30 seconds...
and I'm back.



 CLUSTAL 2.0.12 Multiple Sequence Alignments


Sequence format is Pearson
Sequence 1: Chlorobium   176 aa
[test1@gyra work]$

Log out of gyra:

[test1@gyra work]$ exit

... there is no more.