Depending on the type of project and the analysis you are doing, you may end up generating or otherwise dealing with a number of different data formats.
Understanding what type of data you generate, and how it can be formatted, is a key part of experimental design. Before you perform an experiment, you should already have a good idea of what kind of data it will generate, how you will record it, and how it will be formatted and stored. This is all part of a good data management strategy1. The type of data generated depends entirely on the experiments being performed, which can be formatted in a variety of ways.
Suppose that, in your project, you analyse the size of DNA fragments using gel electrophoresis, and take an image of the resulting gel using a gel doc. This produces an image, which could be saved as an image file, and/or printed out and pasted in your lab notebook, before being interpreted.
The raw data generated by this experiment is an image. This will likely be stored in an image file, such as a .tif
, .jpg
, .bmp
, or .png
file. The analysis might also generate a list of DNA fragment sizes, obtained by comparing the sizes of your DNA fragments to standards of known size (the DNA ladder). This list of numbers is also data, and it might be stored in one of a number of different formats. On the computer this might be in plain text as a .txt
, or .csv
file, or in a proprietary form as a .xlsx
or .docx
file. You might even keep the data as a handwritten list in your lab notebook. Your choice might based on what is customary and convenient for the project, what is most compatible with the downstream analyses that will be performed on these data, and/or what is most compatible with FAIR principles.
GEO
2, you might download a set of reads in the .fastq.gz
format - an open standard for storing sequence data (.fastq
), compressed to save space (.gz
). You might instead download normalised transcript level data as a table in .txt.gz
format (plain text tabular format, compressed to save space). When you analyse the data locally, you might convert the data to a comma-separated variable .csv
format set of gene names and transcript levels, save plots you produce as .png
(open, non-proprietary) files for sharing, or .pdf
(open, proprietary) files for publication, and the scripts you generate as .Rscript
or .py
files containing code as plain-text.
Regardless of any downstream analyses or data formatting/reformatting decisions, you must always save an unmodified original copy of the raw data - this is a fundamental principle of good scientific practice.
You must always be able to return to the original data in the form it was originally collected/recorded.
Data are often associated with metadata - data that provide information about the data.
Metadata may be in a different format to the data itself (e.g. a text file describing the sample characteristics and other metadata, which might paired with the sequencing reads obtained from that sample; or the date and GPS coordinates at which a photograph were taken, and the identity of the photographer). Without metadata, data may lack informative context and may even be uninterpretable or unusable. It is therefore important that your plan for data management includes a plan for how you will accurately record, format, and store the metadata for your experiments.
In this workshop we will cover some of the data formats you are likely to encounter when doing your honours project. You may also meet other, specialised data formats that are outwith the scope of this workshop, but the general rules for good data management still apply.
Proprietary data formats are defined and/or controlled by an individual or organisation, often to support their own software. Proprietary formats may even only be readable or writable by that provider’s software. Examples of proprietary software include: Applied BioSystems’ .ab1
genetic analysis data file format3; Adobe’s .psd
files; Nikon’s .nef
files; and the .mp3
audio format. The key feature of a proprietary format is that it is - or was - not intended to be publicly known, or to be used without a licence.
Proprietary formats may be closed proprietary, in the sense that their specification and definition is a “trade secret” (like Adobe’s .psd
files), or open proprietary, where the specification is published but maintained by a private organisation, like the .mp3
file format.
Open data formats are defined by published, and public, specifications, often under control of a public community or standards organisation. Open formats include HTML
, .png
image files, plain text formats, and the .odf
OpenDocument format (an alternative to Microsoft’s .docx
files).
Open formats are independent of any particular software tool or operating system, and are machine-readable, but may or may not be human-readable.
We saved a simple project README
file - intended to introduce the repository containing these course materials - in four formats, to demonstrate some of the practical differences between proprietary and open formats. In each case, we’re looking at the first few lines of the output file, to compare readability and data content.
Markdown (open)
The first file is written in Markdown, an open format with multiple different standards. It is plain-text, human readable, and can describe both metadata and data content. It is intended primarily to represent the content of a document, with some limited guidance about its presentation.
$ head README.md
---
output:
word_document: default
html_document: default
pdf_document: default
---
# BM432
Welcome to the BM432 computational biology repository!
HTML (open)
HTML (HyperText Markup Language) is in some ways the “base language” of the web. It is plain text, human-readable, and can describe both metadata and data content. It contains elements intended to be machine-readable, and to guide browser presentation.
$ head README.html
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="generator" content="pandoc" />
<meta http-equiv="X-UA-Compatible" content="IE=EDGE" />
PDF (proprietary/open)
The PDF format has multiple versions, some of which are open, and some are proprietary. It can be considered as a hybrid open/proprietary format, and its main disadvantages are that it was designed to be read by specific software, not humans, and is often stored in a compressed form, making it more vulnerable to errors in file-copying that can render the file unusable.
$ head README.pdf
%PDF-1.5
%????
15 0 obj
<<
/Length 1915
/Filter /FlateDecode
>>
stream
xڽX]w?6}???I?(~???i7M?ԻY?m?>P$,aMZ?????0E2??8??3??{/@?m@??/??????XL?8???}?? c??(
??????u???w???IP?"?R??9#9ǡ???h%?J??:??h?+?ܹ?
??+u????-???Q?#6?8?C??T'{??3&??B??"?pѻ???~???j??ח?u??e/?,???6v??}?{?y\??C?S7Ft??&R??ƨ1?߽??$@??$J????]|3
*|?E
???ﴲ?GT???-?b/ЧK?m?
?߰??B??_??t?(&i?t??v??eX?9v?{~??o???Ґ??
?0lYኧ???Ԡ+?ҫK?T???P??3P?????o[~?]?Y??D??k?????AvC??ή??O?We#??????@???????_hl?]???
Word .docx
(proprietary/open)
The Microsoft Word .docx
format is also a mixture of proprietary and open formats. The specification has been made public, but the format again is not human-readable, prioritises commercial software and is largely under the control of Microsoft. It is also stored in a compressed form, vulnerable to copy errors.
$ head README.docx
u*Se
???I?8???2??}.^?6??L{!?^R?Ft???-Bb???k?t?+??+?w??=?? ???`??3?,0z???F??,zj?*?? ?
%R??U~ȍ??؋?%??{?Z?p{?x?(?? ?
?oB?Z??-,??b I??#?6ϕ???R$??B >]$?????R/?S??}???.?`a???@??h??}?4??9),IJ??_???tA???.pR??T??A?t?ފK????a?*???e@?/?rCs?z3?/`?I?y??[??;h+Z(?oDZ]uŁ???h,?=Z????{?ou*S?=?^??
_rels/.rels???N1
E?|E?}?S
?f?A??C?|@?xj??B?{B@Q??2?????b??81h?V5(
?O4L?rx妠֯?NaǶ,?G???Ȉ_?B6ܑhx???}?????u?ΩC{???M?~???S?y(?&?Q???Jo=?qF
??4.??5?>??.K??}d????8?7u*Sa?)?J&word/document.xml?Z[W?8~?_??3&΅[???圲?Sh?????Zd?G????wF?)?&?dӗ8??????7??ۯ?$3n??jt???p?t"?t|??
???????͊?+G`????(H?ˇ??e)Ϩ??9W?l?MF?4??\?$7?qkA^&;?(:?dT????#FO&??w???6B?K?.????T䶑?GAaf?m?ąLg?JJ}if̞?1?d3nލ??53?:"+K
Much of the data we work with can be represented in tabular format (see the What is a Dataset? notebook). This kind of data can be handled intuitively in a spreadsheet application, like Microsoft Excel, Apple’s Numbers application, or Google Sheets. However, spreadsheets enable some bad data practices that are best avoided (see below).
In general, so long as care is taken it is reasonable to collect, explore and examine data using spreadsheets, but the data itself is best stored as platform-independent plain text formats, such as CSV
(Comma-Separated Variable) or TSV
(Tab-Separated Variable) files, rather than the proprietary file formats of a spreadsheet program. This maximises shareability and reuse, and has an element of future-proofing against changes to the file specification of commercial tools, like Excel.
Spreadsheets are powerful and useful tools and, when handled with care, can add value to your work quickly.
However, they can enable and encourage bad practice and, sometimes, can make your results invalid without you noticing.
It is possible, and unfortunately quite common, for people to read their raw data into a spreadsheet, modify the data in-place (data cleaning), and analyse the data in exactly the same worksheet. This goes against good data management principles (see the Data Analysis notebook):
It is convenient to be able to click on a cell and change its format, or its value. It’s convenient to be able to move data around by clicking, and dragging the mouse pointer. But these are data bad practices.
By default, spreadsheet software will save your data as a proprietary format for quick reading, writing, and preservation of graphs. These formats are typically tied to the application and can vary between application releases, making them brittle against version changes, or unreadable across different operating systems. They are not reliable for long-term storage. Although these formats may preserve colour formatting, graphs, and other annotations, those features are not usually able to be extracted easily by other software tools, limiting the data’s reusability.
Many biological experiments involve collecting image data (e.g. photographs of samples; gel pictures; micrographs captured by light or electron microscopy).
Common image data formats include: .tiff/.tif
, .gif
, .jpg
, .png
, .bmp
, and so on. There are also a large number of proprietary image data formats (e.g. associated with a particular camera or software.)
There are very stringent rules governing the acceptable practices for handling image data.5
Manipulating an image (even seemingly harmless, “artistic” adjustments to scale the brightness or contrast) changes the data and can affect how these data are perceived and interpreted.
Therefore, you must always: - Save a copy of the original, unedited image - Record any adjustments that were made, and how they were made (some software will do this automatically). - Simple manipulations (e.g., cropping to remove irrelevant parts of the image, careful adjustments of brightness and contrast applied to the entire image) are usually acceptable. - Manipulations specific to one part of the image, or that duplicate part of an image, are questionable at best - and usually completely unethical. - Be sure to compare the original and processed image, to ensure that the manipulations do not alter the data.
See the Image integrity and standards, Nature journals for an example of the rules that journals set out to ensure that image data is acquired and processed appropriately. If you do not follow such guidelines, your data will be unpublishable.
Elisabeth Bik’s Peter Wildy prize lecture6 is an interesting introduction to image manipulation and scientific integrity in the biological literature.
Image files can be very large - and some experiments (e.g., time-lapse microscopy) involve the acquisition of hundreds of thousands of images. There are algorithms which can compress these files into smaller files - these generally fall into two categories, lossless and lossy. Lossy compression, as the name implies, results in the loss of some of the original data (often resulting in a smaller image file than lossless compression, which preserves all of the image data)7
For example, micrograph files are commonly captured as .tiff/.tif
files. It may be tempting to compress these, for example, to .jpg
files in order to save space. However, this is an example of lossy compression and therefore the temptation should be avoided! Always retain the original .tiff/.tif
files from your experiments. (Your data management plan should take into account the storage requirements for acquiring and preserving such large amounts of data.)
There are a number of specialised data formats that you may encounter during the course of your project or your further studies. The following is not an exhaustive account of all possible file formats used in biological experiments, but is intended as a guide to some of the common formats that you may encounter.
.fasta
, .fa
, .fas
, .faa
, .fna
, .ffn
, etc.)The de facto standard for sequence data files is FASTA (.fasta
). FASTA files are plain text files, with few formatting rules. There is a simple header - announcing the start of a new sequence - which begins with a right-angled bracket (>
) followed immediately by a sequence identifier string, some whitespace, and a free-form description string:
>SEQID_0000001 The rest of this line is a description of the sequence in some way; GN=gene0001; EXPN=False; ORG=E. coli
atcgatcgatctagctgcggagcgactacgacgactagctagcta
atcgatgctagtcgtacggctagctagctgatgctgcgactgcat
atcgatgctag
>SEQID_0000002 This is a different sequence, and the description format does not need to follow the same rules to be valid
gctagtcgagcatgcatgtcgatgctagctgatcgatgtgctacg
cgatcatcgacgac
The format rules are identical for nucleotide and protein sequences, but it is not usual to mix the two types in the same file. The format is extremely flexible, and you will likely see many variations. Some services are strict about the formatting of the accession and the description string8.
By convention, the contents of a FASTA file may be indicated by the file extension. It is common to see .faa
for amino acids, .fna
for genome sequences, and .ffn
for nucleotide sequences describing gene features. Other extensions, like .fasta
, .fa
, .fas
are generic and considered uninformative about the content.
It is possible to associate some file extensions with particular programs (e.g. .docx
files are likely word processor documents, readable by Microsoft Word), but it is possible to change file extension without changing file format. There is no rule that says a file extension must correspond to the file format and, in particular you cannot change a file’s format just by changing its extension.
.gbk
, .gbff
, .genbank
, etc.)GenBank files were created to store annotation and metadata alongside sequence information, in a standardised format that is both human and machine-readable. GenBank files contain much more detailed information (annotation, references, remarks, etc.) than FASTA files can hold.
Like many bioinformatics data files, information is stored in a “key:value” format. For instance, a GenBank header may read:
REFERENCE 1 (bases 1 to 5028)
AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.
TITLE Cloning and sequence of REV7, a gene whose function is required for
DNA damage-induced mutagenesis in Saccharomyces cerevisiae
JOURNAL Yeast 10 (11), 1503-1509 (1994)
There are four key:value pairs. The key states what the information type is, and the value records the information itself. For instance, in the example above, the key JOURNAL
states that the information is a journal reference, and the value is the reference itself: Yeast 10 (11), 1503-1509 (1994)
.
The actual GenBank format becomes more complicated as it describes annotations. For instance, both the coding sequence, and the gene for which it codes may be described as, for instance:
gene 687..3158
/gene="AXL2"
CDS 687..3158
/gene="AXL2"
/note="plasma membrane glycoprotein"
/codon_start=1
/function="required for axial budding pattern of S.
cerevisiae"
/product="Axl2p"
/protein_id="AAA98666.1"
/db_xref="GI:1293615"
/translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESF
TFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFN
VILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNE
VFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPE
TSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYV
YLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYG
DVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQ
DHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSA
NATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIA
CGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLN
NPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQ
SQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDS
YGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTK
HRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRL
VDFSNKSNVNVGQVKDIHGRIPEML"
There are two keys: CDS
(indicating a coding sequence), and gene
(indicating the corresponding gene) with values that identify the location of that feature on the genome. Each of the features may contain further key:value pairs, indicating feature-specific values; these keys begin with a forward slash /
and describe some aspect of the feature. GenBank format files can hold a rich variety of information9, but a full explanation is beyond the scope of this workshop.
The current standard format for biological macromolecule structure files is the PDBx/mmCIF format (before 2012, it was the PDB format; older structures may still be in this format.) Documentation and FAQ are available.
If you are doing experiments that involve DNA sequencing, your data will be generated in different formats depending on the sequencing technology you use. Two commonly used formats are FASTQ and ABI files.
As the name suggests, FASTQ files are similar to FASTA files; however, FASTQ files include quality scores that reflect the quality of the base call. Quality scores are indicated as a string of characters, with one quality score per sequence, written below the read sequence itself.
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
If you sequence any samples using Sanger sequencing, the raw sequence data is usually in the ABI file format: this contains the chromatogram showing the peaks generated during the sequencing run.
You will likely also receive a FASTA file containing the bases called during the sequencing run - however, you should always examine the raw data (the chromatogram) carefully.
In the figure below, you can see an example of heterozygosity - the sequencing read is relatively clean and most peaks are well-defined and correctly called, but in some cases there are mixed peaks (for example, peak 358 with A/G peaks). Peak-calling software are not well-equipped to deal with mixed peaks, and so it is important that you examine the sequence chromatograms manually.
There is no hard-and-fast rule for the most appropriate format(s) for your research data, though there are a number of considerations that can guide your choice (see below).
For openness, transparency, and longevity, it is best to prioritise:
Each academic field tends to reach some kind of consensus on which formats are most commonly or widely used. This may be, for example, because the data format expresses the data well, or because one or other software tool (and its associated file format) is very common.
These often may be codified, for example by journal or database policies specifying which file formats are acceptable for submission. If you do not pay attention to these policies, your data will not be correctly formatted for publication.
To maximise your ability to exchange data, you should choose data formats that are appropriate and common in your own field of research.
A critical question is: how long do you need your data to be readable and exchangeable? If you do not require ever to share your data, and it exists for a very short time, the format choice is governed entirely by the ability to store efficiently all the data you need. But the longer you need to keep your data, and the greater the need for exchangeability, the more you will want to consider open, standardised and well-documented file formats.
Open, standardised formats are preferred for persistent storage because proprietary and closed formats may change specification without documentation, and so can become obsolete. It happens quite frequently that later versions of tools introduce features that cannot be read in earlier versions, or discontinue support for earlier features (e.g. Microsoft Office compatibility changes).
Talk with your supervisor about making a data management plan↩︎
https://projects.nfstc.org/workshops/resources/articles/ABIF_File_Format.pdf↩︎
Excel is known to change gene names without warning↩︎
Cromey D. W. (2013). Digital images are data: and should be treated as such. Methods in molecular biology (Clifton, N.J.), 931, 1–27. https://doi.org/10.1007/978-1-62703-056-4_1↩︎
https://microbiologysociety.org/blog/peter-wildy-prize-lecture-2021-dr-elisabeth-bik.html↩︎
Note that some image file formats, like .jpg
, .gif
, and png
are lossy formats and discard information; for recording original scientific images, use lossless formats like .tif
instead.↩︎
NCBI’s FASTA format specification: https://www.ncbi.nlm.nih.gov/genbank/fastaformat/↩︎