• Data formats and file formats determine how data are stored and read
  • Community or other international standards exist for some kinds of data
  • Data and file formats may be closed or open; either format may be proprietary
    • It is best practice to use open and non-proprietary data formats for exchange and storage of data
  • When requesting data, ask for it to be provided in an open format

1 Introduction

Depending on the type of project and the analysis you are doing, you may end up generating or otherwise dealing with a number of different data formats.

Understanding what type of data you generate, and how it can be formatted, is a key part of experimental design. Before you perform an experiment, you should already have a good idea of what kind of data it will generate, how you will record it, and how it will be formatted and stored. This is all part of a good data management strategy1. The type of data generated depends entirely on the experiments being performed, which can be formatted in a variety of ways.

Click to expand an example of how data can be generated and stored in different formats

Suppose that, in your project, you analyse the size of DNA fragments using gel electrophoresis, and take an image of the resulting gel using a gel doc. This produces an image, which could be saved as an image file, and/or printed out and pasted in your lab notebook, before being interpreted.

The raw data generated by this experiment is an image. This will likely be stored in an image file, such as a .tif, .jpg, .bmp, or .png file. The analysis might also generate a list of DNA fragment sizes, obtained by comparing the sizes of your DNA fragments to standards of known size (the DNA ladder). This list of numbers is also data, and it might be stored in one of a number of different formats. On the computer this might be in plain text as a .txt, or .csv file, or in a proprietary form as a .xlsx or .docx file. You might even keep the data as a handwritten list in your lab notebook. Your choice might based on what is customary and convenient for the project, what is most compatible with the downstream analyses that will be performed on these data, and/or what is most compatible with FAIR principles.

If your project involves analysis of data in a public repository, such as transcriptome data stored at GEO2, you might download a set of reads in the .fastq.gz format - an open standard for storing sequence data (.fastq), compressed to save space (.gz). You might instead download normalised transcript level data as a table in .txt.gz format (plain text tabular format, compressed to save space). When you analyse the data locally, you might convert the data to a comma-separated variable .csv format set of gene names and transcript levels, save plots you produce as .png (open, non-proprietary) files for sharing, or .pdf (open, proprietary) files for publication, and the scripts you generate as .Rscript or .py files containing code as plain-text.

Regardless of any downstream analyses or data formatting/reformatting decisions, you must always save an unmodified original copy of the raw data - this is a fundamental principle of good scientific practice.

You must always be able to return to the original data in the form it was originally collected/recorded.

Data are often associated with metadata - data that provide information about the data.

Metadata may be in a different format to the data itself (e.g. a text file describing the sample characteristics and other metadata, which might paired with the sequencing reads obtained from that sample; or the date and GPS coordinates at which a photograph were taken, and the identity of the photographer). Without metadata, data may lack informative context and may even be uninterpretable or unusable. It is therefore important that your plan for data management includes a plan for how you will accurately record, format, and store the metadata for your experiments.

In this workshop we will cover some of the data formats you are likely to encounter when doing your honours project. You may also meet other, specialised data formats that are outwith the scope of this workshop, but the general rules for good data management still apply.

2 Proprietary and Open Formats

2.1 Proprietary data formats

Proprietary data formats are defined and/or controlled by an individual or organisation, often to support their own software. Proprietary formats may even only be readable or writable by that provider’s software. Examples of proprietary software include: Applied BioSystems’ .ab1 genetic analysis data file format3; Adobe’s .psd files; Nikon’s .nef files; and the .mp3 audio format. The key feature of a proprietary format is that it is - or was - not intended to be publicly known, or to be used without a licence.

Proprietary formats may be closed proprietary, in the sense that their specification and definition is a “trade secret” (like Adobe’s .psd files), or open proprietary, where the specification is published but maintained by a private organisation, like the .mp3 file format.

2.2 Open data formats

Open data formats are defined by published, and public, specifications, often under control of a public community or standards organisation. Open formats include HTML, .png image files, plain text formats, and the .odf OpenDocument format (an alternative to Microsoft’s .docx files).

Open formats are independent of any particular software tool or operating system, and are machine-readable, but may or may not be human-readable.

(Click to toggle) A comparison of proprietary vs open data

We saved a simple project README file - intended to introduce the repository containing these course materials - in four formats, to demonstrate some of the practical differences between proprietary and open formats. In each case, we’re looking at the first few lines of the output file, to compare readability and data content.

Markdown (open)

The first file is written in Markdown, an open format with multiple different standards. It is plain-text, human readable, and can describe both metadata and data content. It is intended primarily to represent the content of a document, with some limited guidance about its presentation.

$ head README.md
---
output:
  word_document: default
  html_document: default
  pdf_document: default
---
# BM432

Welcome to the BM432 computational biology repository!

HTML (open)

HTML (HyperText Markup Language) is in some ways the “base language” of the web. It is plain text, human-readable, and can describe both metadata and data content. It contains elements intended to be machine-readable, and to guide browser presentation.

$ head README.html
<!DOCTYPE html>

<html>

<head>

<meta charset="utf-8" />
<meta name="generator" content="pandoc" />
<meta http-equiv="X-UA-Compatible" content="IE=EDGE" />

PDF (proprietary/open)

The PDF format has multiple versions, some of which are open, and some are proprietary. It can be considered as a hybrid open/proprietary format, and its main disadvantages are that it was designed to be read by specific software, not humans, and is often stored in a compressed form, making it more vulnerable to errors in file-copying that can render the file unusable.

$ head README.pdf
%PDF-1.5
%????
15 0 obj
<<
/Length 1915      
/Filter /FlateDecode
>>
stream
xڽX]w?6}???I?(~???i7M?ԻY?m?>P$,aMZ?????0E2??8??3??{/@?m@??/??????XL?8???}?? c??(
                                                                                ??????u???w???IP?"?R??9#9ǡ???h%?J??:??h?+?ܹ?
 ??+u????-???Q?#6?8?C??T'{??3&??B??"?pѻ???~???j??ח?u??e/?,???6v??}?{?y\??C?S7Ft??&R??ƨ1?߽??$@??$J????]|3
                                                                                                       *|?E
                                                                                                           ???ﴲ?GT???-?b/ЧK?m?
?߰??B??_??t?(&i?t??v??eX?9v?{~??o???Ґ??
                                      ?0lYኧ???Ԡ+?ҫK?T???P??3P?????o[~?]?Y??D??k?????AvC??ή??O?We#??????@???????_hl?]???

Word .docx (proprietary/open)

The Microsoft Word .docx format is also a mixture of proprietary and open formats. The specification has been made public, but the format again is not human-readable, prioritises commercial software and is largely under the control of Microsoft. It is also stored in a compressed form, vulnerable to copy errors.

$ head README.docx 
u*Se݋
???I?8???2??}.^?6??L{!?^R?Ft???-Bb???k?t?+??+?w??=?? ???`??3?,0z???F??,zj?*??   ?
%R??U~ȍ??؋?%??{?Z?p{?x?(?? ?
                            ?oB?Z??-,??b    I??#?6ϕ???R$??B >]$?????R/?S??}???.?`a???@??h??}?4??9),IJ??_???tA???.pR??T??A?t?ފK????a?*???e@?/?rCs?z3?/`?I?y??[??;h+Z(?oDZ]uŁ???h,?=Z????{?ou*S?=?^??
                                                                            _rels/.rels???N1
                                                                                            E?|E?}?S
?f?A??C?|@?xj??B?{B@Q??2?????b??81h?V5(
?O4L?rx妠֯?NaǶ,?G???Ȉ_?B6ܑhx???}?????u?ΩC{???M?~???S?y(?&?Q???Jo=?qF
                                                                  ??4.??5?>??.K??}d????8?7u*Sa?)?J&word/document.xml?Z[W?8~?_??3&΅[???圲?Sh?????Zd?G????wF?)?&?dӗ8??????7??ۯ?$3n??jt???p?t"?t|??
                                                                      ???????͊?+G`????(H?ˇ??e)Ϩ??9W?l?MF?4??\?$7?qkA^&;?(:?dT????#FO&??w???6B?K?.????T䶑?GAa԰f?m?ąLg?JJ}if̞?1?d3nލ֐??53?:"+K

3 Tabular Data

Much of the data we work with can be represented in tabular format (see the What is a Dataset? notebook). This kind of data can be handled intuitively in a spreadsheet application, like Microsoft Excel, Apple’s Numbers application, or Google Sheets. However, spreadsheets enable some bad data practices that are best avoided (see below).

In general, so long as care is taken it is reasonable to collect, explore and examine data using spreadsheets, but the data itself is best stored as platform-independent plain text formats, such as CSV (Comma-Separated Variable) or TSV (Tab-Separated Variable) files, rather than the proprietary file formats of a spreadsheet program. This maximises shareability and reuse, and has an element of future-proofing against changes to the file specification of commercial tools, like Excel.

3.1 Spreadsheet Bad Practices

Spreadsheets are powerful and useful tools and, when handled with care, can add value to your work quickly.

However, they can enable and encourage bad practice and, sometimes, can make your results invalid without you noticing.

3.1.1 No separation of raw, cleaned and analysed data

It is possible, and unfortunately quite common, for people to read their raw data into a spreadsheet, modify the data in-place (data cleaning), and analyse the data in exactly the same worksheet. This goes against good data management principles (see the Data Analysis notebook):

  • keep “raw” data separate from “cooked” (cleaned/analysed) data: being able to readily distinguish raw from processed data makes your project workflow more transparent; combining the two in the same worksheet makes analyses harder to follow; modifying data in-place destroys the original data
  • keep data logically separate from the analysis “code”: if your analysis takes a data file as input, and produces an output file, it is clear what the workflow is doing, and there is the potential for reusability of the analysis with a new or modified dataset. Spreadsheets combine the dataset with the analysis, making it harder to substitute in a new dataset and rerun the analysis, and sometimes even hard to understand the flow of the analysis itself.
  • Some spreadsheet software will allow the use of multiple worksheets within a workbook, but require saving those files in a proprietary, format that is tied to the specific spreadsheet package, limiting exchange and possibly restricting future use.

3.1.2 Point-and-click interface

It is convenient to be able to click on a cell and change its format, or its value. It’s convenient to be able to move data around by clicking, and dragging the mouse pointer. But these are data bad practices.

  • raw data should not be modified: the spreadsheet interface makes it easy to modify raw data deliberately without any record, and possible even to modify it accidentally - again without any record. Most disturbingly, some spreadsheet software, like Excel, can change your data silently, making your analyses invalid4(http://ziemann-lab.net/public/gene_name_errors/Report_2021-08.html).
  • annotation and metadata should be explicit and transparent: it’s tempting to use spreadsheet colours, or fonts, to “annotate” your data - to indicate high, low, or “faulty” values, for instance, or to categorise groups. Unless there is a clear record of the meanings of those colours, the annotations may not be understandable to others. Some colour choices may be indistinguishable to a reader, perhaps because of colour-blindness. If the file is saved in any way other than the proprietary spreadsheet format, it may be impossible to indicate those colours, potentially locking the data into a short-lived, proprietary format that is hard to share.

3.1.3 Closed, proprietary, compressed file formats

By default, spreadsheet software will save your data as a proprietary format for quick reading, writing, and preservation of graphs. These formats are typically tied to the application and can vary between application releases, making them brittle against version changes, or unreadable across different operating systems. They are not reliable for long-term storage. Although these formats may preserve colour formatting, graphs, and other annotations, those features are not usually able to be extracted easily by other software tools, limiting the data’s reusability.

4 Image Data

Many biological experiments involve collecting image data (e.g. photographs of samples; gel pictures; micrographs captured by light or electron microscopy).

Common image data formats include: .tiff/.tif, .gif, .jpg, .png, .bmp, and so on. There are also a large number of proprietary image data formats (e.g. associated with a particular camera or software.)

There are very stringent rules governing the acceptable practices for handling image data.5

Manipulating an image (even seemingly harmless, “artistic” adjustments to scale the brightness or contrast) changes the data and can affect how these data are perceived and interpreted.

Therefore, you must always: - Save a copy of the original, unedited image - Record any adjustments that were made, and how they were made (some software will do this automatically). - Simple manipulations (e.g., cropping to remove irrelevant parts of the image, careful adjustments of brightness and contrast applied to the entire image) are usually acceptable. - Manipulations specific to one part of the image, or that duplicate part of an image, are questionable at best - and usually completely unethical. - Be sure to compare the original and processed image, to ensure that the manipulations do not alter the data.

See the Image integrity and standards, Nature journals for an example of the rules that journals set out to ensure that image data is acquired and processed appropriately. If you do not follow such guidelines, your data will be unpublishable.

Elisabeth Bik’s Peter Wildy prize lecture6 is an interesting introduction to image manipulation and scientific integrity in the biological literature.

4.1 Image Compression

Image files can be very large - and some experiments (e.g., time-lapse microscopy) involve the acquisition of hundreds of thousands of images. There are algorithms which can compress these files into smaller files - these generally fall into two categories, lossless and lossy. Lossy compression, as the name implies, results in the loss of some of the original data (often resulting in a smaller image file than lossless compression, which preserves all of the image data)7

For example, micrograph files are commonly captured as .tiff/.tif files. It may be tempting to compress these, for example, to .jpg files in order to save space. However, this is an example of lossy compression and therefore the temptation should be avoided! Always retain the original .tiff/.tif files from your experiments. (Your data management plan should take into account the storage requirements for acquiring and preserving such large amounts of data.)

5 Specialised Data Formats

There are a number of specialised data formats that you may encounter during the course of your project or your further studies. The following is not an exhaustive account of all possible file formats used in biological experiments, but is intended as a guide to some of the common formats that you may encounter.

5.1 Nucleotide and amino acid sequence files

5.1.1 FASTA (.fasta, .fa, .fas, .faa, .fna, .ffn, etc.)

The de facto standard for sequence data files is FASTA (.fasta). FASTA files are plain text files, with few formatting rules. There is a simple header - announcing the start of a new sequence - which begins with a right-angled bracket (>) followed immediately by a sequence identifier string, some whitespace, and a free-form description string:

>SEQID_0000001 The rest of this line is a description of the sequence in some way; GN=gene0001; EXPN=False; ORG=E. coli
atcgatcgatctagctgcggagcgactacgacgactagctagcta
atcgatgctagtcgtacggctagctagctgatgctgcgactgcat
atcgatgctag
>SEQID_0000002 This is a different sequence, and the description format does not need to follow the same rules to be valid
gctagtcgagcatgcatgtcgatgctagctgatcgatgtgctacg
cgatcatcgacgac

The format rules are identical for nucleotide and protein sequences, but it is not usual to mix the two types in the same file. The format is extremely flexible, and you will likely see many variations. Some services are strict about the formatting of the accession and the description string8.

By convention, the contents of a FASTA file may be indicated by the file extension. It is common to see .faa for amino acids, .fna for genome sequences, and .ffn for nucleotide sequences describing gene features. Other extensions, like .fasta, .fa, .fas are generic and considered uninformative about the content.

It is possible to associate some file extensions with particular programs (e.g. .docx files are likely word processor documents, readable by Microsoft Word), but it is possible to change file extension without changing file format. There is no rule that says a file extension must correspond to the file format and, in particular you cannot change a file’s format just by changing its extension.

5.1.2 GenBank format (.gbk, .gbff, .genbank, etc.)

GenBank files were created to store annotation and metadata alongside sequence information, in a standardised format that is both human and machine-readable. GenBank files contain much more detailed information (annotation, references, remarks, etc.) than FASTA files can hold.

Like many bioinformatics data files, information is stored in a “key:value” format. For instance, a GenBank header may read:

REFERENCE   1  (bases 1 to 5028)
  AUTHORS   Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.
  TITLE     Cloning and sequence of REV7, a gene whose function is required for
            DNA damage-induced mutagenesis in Saccharomyces cerevisiae
  JOURNAL   Yeast 10 (11), 1503-1509 (1994)

There are four key:value pairs. The key states what the information type is, and the value records the information itself. For instance, in the example above, the key JOURNAL states that the information is a journal reference, and the value is the reference itself: Yeast 10 (11), 1503-1509 (1994).

The actual GenBank format becomes more complicated as it describes annotations. For instance, both the coding sequence, and the gene for which it codes may be described as, for instance:

     gene            687..3158
                     /gene="AXL2"
     CDS             687..3158
                     /gene="AXL2"
                     /note="plasma membrane glycoprotein"
                     /codon_start=1
                     /function="required for axial budding pattern of S.
                     cerevisiae"
                     /product="Axl2p"
                     /protein_id="AAA98666.1"
                     /db_xref="GI:1293615"
                     /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESF
                     TFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFN
                     VILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNE
                     VFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPE
                     TSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYV
                     YLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYG
                     DVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQ
                     DHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSA
                     NATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIA
                     CGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLN
                     NPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQ
                     SQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDS
                     YGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTK
                     HRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRL
                     VDFSNKSNVNVGQVKDIHGRIPEML"

There are two keys: CDS (indicating a coding sequence), and gene (indicating the corresponding gene) with values that identify the location of that feature on the genome. Each of the features may contain further key:value pairs, indicating feature-specific values; these keys begin with a forward slash / and describe some aspect of the feature. GenBank format files can hold a rich variety of information9, but a full explanation is beyond the scope of this workshop.

5.2 Structure files

The current standard format for biological macromolecule structure files is the PDBx/mmCIF format (before 2012, it was the PDB format; older structures may still be in this format.) Documentation and FAQ are available.

5.3 Raw sequencing reads

If you are doing experiments that involve DNA sequencing, your data will be generated in different formats depending on the sequencing technology you use. Two commonly used formats are FASTQ and ABI files.

5.3.1 FASTQ files

As the name suggests, FASTQ files are similar to FASTA files; however, FASTQ files include quality scores that reflect the quality of the base call. Quality scores are indicated as a string of characters, with one quality score per sequence, written below the read sequence itself.

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

5.3.2 ABI files

If you sequence any samples using Sanger sequencing, the raw sequence data is usually in the ABI file format: this contains the chromatogram showing the peaks generated during the sequencing run.

You will likely also receive a FASTA file containing the bases called during the sequencing run - however, you should always examine the raw data (the chromatogram) carefully.

In the figure below, you can see an example of heterozygosity - the sequencing read is relatively clean and most peaks are well-defined and correctly called, but in some cases there are mixed peaks (for example, peak 358 with A/G peaks). Peak-calling software are not well-equipped to deal with mixed peaks, and so it is important that you examine the sequence chromatograms manually.

An example of a DNA chromatogram showing heterozygosity (Figure from Mallet 2019, 10.11646/zootaxa.4679.3.11)

Figure 5.1: An example of a DNA chromatogram showing heterozygosity (Figure from Mallet 2019, 10.11646/zootaxa.4679.3.11)

6 How Should I Choose Formats?

There is no hard-and-fast rule for the most appropriate format(s) for your research data, though there are a number of considerations that can guide your choice (see below).

For openness, transparency, and longevity, it is best to prioritise:

  • formats that are common and widely-used in your field of work
  • open, standardised data formats
  • plain-text human readable formats (depending on data type)
  • non-proprietary formats
  • unencrypted data formats
  • uncompressed data formats (though compression may be essential for large files)
  • formats that strongly associate data with corresponding metadata
  • formats that can be stored and accessed conveniently
  • formats that are compatible with downstream analyses
  • formats that are acceptable to scientific publishers (e.g., for journal deposition requirements)

6.1 Choose data and file formats common in your field

Each academic field tends to reach some kind of consensus on which formats are most commonly or widely used. This may be, for example, because the data format expresses the data well, or because one or other software tool (and its associated file format) is very common.

These often may be codified, for example by journal or database policies specifying which file formats are acceptable for submission. If you do not pay attention to these policies, your data will not be correctly formatted for publication.

To maximise your ability to exchange data, you should choose data formats that are appropriate and common in your own field of research.

6.2 Choose a data and file format that will preserve your data for an appropriate time

A critical question is: how long do you need your data to be readable and exchangeable? If you do not require ever to share your data, and it exists for a very short time, the format choice is governed entirely by the ability to store efficiently all the data you need. But the longer you need to keep your data, and the greater the need for exchangeability, the more you will want to consider open, standardised and well-documented file formats.

Open, standardised formats are preferred for persistent storage because proprietary and closed formats may change specification without documentation, and so can become obsolete. It happens quite frequently that later versions of tools introduce features that cannot be read in earlier versions, or discontinue support for earlier features (e.g. Microsoft Office compatibility changes).


  1. Talk with your supervisor about making a data management plan↩︎

  2. The Gene Expression Omnibus↩︎

  3. https://projects.nfstc.org/workshops/resources/articles/ABIF_File_Format.pdf↩︎

  4. Excel is known to change gene names without warning↩︎

  5. Cromey D. W. (2013). Digital images are data: and should be treated as such. Methods in molecular biology (Clifton, N.J.), 931, 1–27. https://doi.org/10.1007/978-1-62703-056-4_1↩︎

  6. https://microbiologysociety.org/blog/peter-wildy-prize-lecture-2021-dr-elisabeth-bik.html↩︎

  7. Note that some image file formats, like .jpg, .gif, and png are lossy formats and discard information; for recording original scientific images, use lossless formats like .tif instead.↩︎

  8. NCBI’s FASTA format specification: https://www.ncbi.nlm.nih.gov/genbank/fastaformat/↩︎

  9. GenBank Feature Table Definition↩︎