Getting started in bioinformatics can feel overwhelming. Several bioinformaticians at 10x Genomics started at the bench and have added bioinformatics to our scientific tool kits as we progressed through our training and careers. Just as lab tools like western blotting and PCR can be used to answer scientific questions, bioinformatics software tools also help us to answer scientific questions. The level of expertise in the bioinformatics tools themselves is up to the user, depending what they need out of them. We empathize greatly with everyone who is picking up this new skill. Below you will find some tips and tricks that have helped us on this journey.
When starting at the command line it is common to be afraid of failed runs. This is understandable since a failed run at the bench means a loss of precious samples and reagents. In a computational environment there is less need to worry because a failed run loses you some time and some compute resources, but your data remains to be usable over and over again. Our advice is to, generally, try things and experiment with commands and scripts. It is OK to get errors! As long as you keep a copy of your original data you can always start over. This is one of the great things about bioinformatics. You can always fix typos and re-run. Additionally, there is likely something to be learned.
However, please be careful of the
rm command. There are important exceptions to not hesitate running commands to see what will happen. The commands rm <file>, to delete a file, or
rm -r to delete a directory, cannot be undone. If you will be working with raw data or in a shared space where others also have data stored, make sure to have a backup copy available in another location in case something happens to the disk where you are working. With other peoples’ files, usually you do not have permission to delete or overwrite their files, and vice versa.
Otherwise, you should not be afraid to try commands. If you are not sure where to start with a software tool, running the command with
--help is a good place to start.
See example of an electronic lab notebook by Carl Boettiger. Just like you would document your experiments and observations in a written lab notebook, computational analyses should also be documented in an electronic notebook with commands tried and errors encountered. We recommend avoiding using word processing programs to keep notes for bioinformatics because there are hidden characters that can end up being copied over to the command line and cause errors.
Sanity check your data. Set up experiments just like you would in the lab to evaluate your results. Getting code or a command running is just the first step. Getting it running does not mean it is running correctly. Always carefully evaluate the results you get and ask questions of your data. Did what you expect to happen actually happen? Do the results make sense biologically? Sometimes it might make sense to try changing the computational parameters to see how this impacts the results and then assess those results to determine if they make biological sense.
Chances are, if you run into an error, someone else has also seen it and written about it somewhere. Searching the internet for the error message may lead you to message boards or other resources with suggestions for overcoming the error.
Places like StackOverflow and GitHub’s issue tracking on specific open source software packages are great for finding helpful answers to problems if you cannot find relevant information elsewhere. Don’t be afraid to post on public forums. This is also a great way to interact with the authors of open source software tools and could potentially lead to future collaborations.
Avoid rich text formatting controlled by most of the commonly-used word processing programs. Word processing programs are good for creating documents but not for writing code or keeping track of commands. Be practical when choosing text editors and programming languages to learn. Identify the people that you think are going to help you the most and find out what they use. Some people use emacs as a text editor not because it is the best, but because that is what the person in the next desk over in the lab was using. Some people use more than one text editor, one for the desktop with a friendly user interface like BBEdit, TextWrangler, VS Code, Sublime Text, or atom, and another for command line text editing, such as
Python and R are both open-source programming languages. Python is more of a general purpose language whereas R was developed specifically for statistical computing. Many bioinformaticians in single cell data analysis fields use R and/or Python because many popular third-party tools are written in these languages. Getting started in the R environment does not necessarily require an in-depth understanding of the R language. You can get started by following along, copying/pasting commands from vignettes of a specific tool. For example, the Orchestrating Single Cell Analysis with Bioconductor vignette by Amezquita, Lun, Hicks, and Gottardo or the Getting Started with Seurat Vignette by the Rahul Satija lab.
You can save a lot of time and frustration by taking advantage of the people around you. Having someone you can lean over to and ask questions is invaluable. This may be someone in your lab or in your department who has the skills you are working on mastering. This can also help prevent developing bad habits that are hard to break, such as using a word processing tool. You can pay it forward later.
At your home institution or in the area nearby, there may be a community of people working with similar data. There also might be an online community that is helpful to you. Reach out to this community and ask for help. In exchange, you may be able to provide advice on other matters.
Troubleshooting issues from large datasets can be time consuming. Testing pipeline commands with a subset of data before running the whole thing on your large dataset will almost always save time. Some tools have a built-in tiny test set you can use, for example cellranger testrun --id=tiny will run cellranger count with a very small test (mostly to check the installation). If the tool doesn't come with a small test set you can make your own.
Creating a copy of your data is essential to preventing accidental loss. Back-ups can be kept in a separate directory, an external drive or in long term storage, for example AWS Glacier, NCBI-SRA, EMBL-ENA, data dryad, figshare, open science framework, or zenodo.
Many journals now require authors to ensure that their analyses are reproducible, which means that pipelines, scripts and datasets are frequently published as supplemental material along with the main manuscript, or archived in a separate database. Running, editing, and troubleshooting someone else’s code and data is a great way to test the bioinformatics skills you have learned so far. Assess if the published results are truly reproducible, and if not, determine why. Likewise, when you publish your own results, make sure that you have provided enough detail so that other bioinformaticians can reproduce them exactly. This can help increase the citation index and broader impact of your work.
These are a few things that have helped us along our bioinformatics journey. We are not the only ones with advice to offer in this space. Below you can find links to other resources to help you get started. Best of luck in your continued journey as a scientist.