So, You’ve Joined SIPBS-CompBiol as a Project Student…

7 minute read

Firstly, welcome to the group!

The SIPBS-CompBiol group is a purely computational biology and bioinformatics research group, in the Strathclyde Institute for Pharmacy and Biomedical Sciences. If you’re reading this, you’ve probably asked for (or been assigned) an undergraduate, Master’s, or internship project with us. We’re a friendly group and excited about computational biology so, even if this is your first time working in this area, we want you to do well and are all happy to help and answer questions.

Please start by reading the current group member bios, and our philosophy, responsibilities, and code of conduct.

What do you need to do?

Exactly what you need to do will depend on your project, and we’ll talk about that in project meetings. But some things are important for every project:

  • Ask questions: you are here to learn
    • Every day is a school day. We all started off knowing nothing, and are learning all the time. We don’t expect you to know everything so, if anything is confusing, unclear, or unfamiliar please talk to us.
    • Even if you do know the material well, we can help with advice on best practice and continual improvement - especially for coding, data management and analysis.
  • Be kind, collegiate, and helpful
    • Our projects are our own individual responsibilities, but that doesn’t mean we work in isolation. Your peers and the rest of the group are an excellent source of knowledge, support, and encouragement. We all benefit from each other’s insight and help.
  • Meet your course and examination requirements
    • please also submit your draft and final reports in time that they can be read and you can get good feedback
    • not sure how to get good feedback? we have some advice about that.
  • Attend and present at project meetings, group meetings, and journal clubs
  • Sign your learning agreement.

What skills can you develop?

Computational biology and bioinformatics are practical skills, like lab skills. Just like lab skills, we all start out a wee bit clumsy but get much better with practice.

Unfortunately, our degree curriculum doesn’t yet prepare you with computational skills in the same way we train you in laboratory techniques. How far you get in your project will depend on the skills you start with, and how much you are willing to learn independently. The good news is that there are many ways to practice these skills and learn new skills, yourself. We hope that the resources below will be helpful

Remember, you can ask anyone in the group - and your peers - for advice on these topics. You are not alone. We were all complete novices, once.

Your project will be short: six weeks practical work for an undergraduate project; eight weeks for a summer internship or Master’s project. You cannot learn everything below in that time. The links are provided to help you learn independently now, but also as a resource you can return to if you want to develop your skills further in future.

Programming

Computers are powerful but dumb. They do simple things very fast. To the extent computers appear intelligent, that intelligence is because some person told the computer to do a lot of simple things quickly in a specific order.

When you use a program you are working within a set of constraints someone else devised. But when you write a program, you can build and do something new.

A lot of bioinformatics can be done without programming, and your project may well use a single analysis package, or Galaxy to build a workflow. But a little programming ability can get you a very long way, even if only automating things like file-renaming, or running the same tool on a number of different inputs.

Shell scripting

Shell scripting is programming, and usually means a very specific kind of programming using a language like bash. This is a powerful language, and often used to accomplish tiresome manual tasks quickly, e.g.

# move all results from September into their own directory
mv my_project/results/2022-09* my_project/results/September

Python

Python is a high-level programming language, widely used in bioinformatics and computational biology. Python is the lingua franca of the group and we encourage students to code in Python to enable remixing and modular reuse of their work.

If you are just starting out with Python, we recommend using the cross-platform Anaconda distribution. This provides a powerful package manager that will make your life much easier. The Bioconda packages contain nearly all the bioinformatics tools you could need and work within Anaconda.

R

R is a programming language designed specifically for statistics and data analysis. It is very widely-used in bioinformatics and computational biology, with many specialised packages providing analytical and visualisation tools that can’t be found elsewhere. R is also extremely good for producing high-quality and attractive visualisations.

Editing code

This is a subjective topic. There is no one right way to work, and you should generally work the way that is productive and suits you best. That said, my (Leighton’s) experience is that Visual Studio Code (aka VS Code) is a great choice to start working in. It works on several operating systems, can handle Python, R, and shell code, give you a terminal window, and integrate with tools like git (see below). If you haven’t already decided on a programming workflow, there are many worse places to start.

Project and Data Management

It doesn’t matter how many programming skills you have if you can’t find your data, or where your code is. It takes a lot of practice, and trial-and-error (mostly error) to become very fluent in data management, but we can help shortcut this process with some recommendations.

Version control: git and GitHub

Version control is an important part of project and code management. The most popular tool - and the one used in the group - for managing datasets, code, and documentation (including writing reports and papers), is git - and like many bioinformaticians we keep our git projects on GitHub. This is a bit of an advanced topic, as git/GitHub can do far more than help you manage your project (this website is hosted on GitHub, for instance), but following the basic principles of version control with git will make your work on your project so much easier to manage.

Computing Resources

We do not have a laboratory, and we do not provide computing hardware for your project. However, computing and IT facilities for students are available at the university.

Most projects can be completed using your own laptop, either by accessing online resources, or installing and using bioinformatics tools on your own machine.

Operating Systems

Much of bioinformatics runs on - and assumes you can use - Linux and Linux-like operating systems, and will require that you use the terminal. If your computer already runs Linux or MacOS, then you can get started straight away.

If you are using Windows, you should install Windows Subsystem Linux, if you have not done so already.

Other Computing Resources

In some cases, you may need to use remote computing resources to complete your project. These will either be services such as Galaxy, which can be used via your browser, or will likely run some form of Linux.

We also make a small server available for undergraduate projects on the MRC CLIMB service. This is intended to make it easier to run lengthy jobs that might disrupt work on your own computer. To obtain an account for this server, please contact Leighton.

  • MRC CLIMB
  • CLIMB server specification
    • OS: Ubuntu
    • CPU cores: 4
    • RAM: 32GB
    • Storage: 200GB onboard, 1TB mounted

The university may provide access to the ARCHIE-West cluster for student projects. There are some restrictions and recommendations governing its use. Please see Leighton for further details.

tmux

When using remote computing resources, logging out of the machine or breaking connection can halt your job. To avoid this (and also gain other useful abilities), the tmux tool can be used to work in a persistent, long-running shell that is not stopped when you log out. It is highly recommended to use tmux or similar tools for remote work.