During Analysis

During this crucial period, the organisation and maintenance of the dataset is particularly important, as team members are working on the data and making changes, additions, and deletions to the dataset. It is therefore good practice to maintain one read-only dataset with limited change possibilities. Any changes to this version should be carefully checked, verified, named and documented. The final dataset reflecting published analyses is the version to be deposited at the archive. Issues to consider include version control, storage, file structures, back-ups, security and formats.

Version Control

It is advisable to store various discrete versions of the data file (with individual version numbers). It is important to ensure that different copies of files, materials held in different formats, and information that is cross-referenced between files are all subject to version control. Checks and procedures must be put in place to ensure that if information in one file is altered, the corresponding information in other files is also altered. Perhaps also consider:

Storage

The suitable storage of data as raw data files or as statistical system files needs to be considered. For instance, the Digital Preservation Coalition (DPC) provides details of good practice regarding data storage.

File Structures

Decisions as to file structures (flat or hierarchical) with documentation of these.

Back-ups

A good back-up procedure will protect against a range of mishaps such as accidental changes to data, accidental deletion of data, loss of data due to media or software faults, virus infections, hackers, or catastrophic events such as fire or flood. Issues that should be given specific attention are:

Security

As digital data can be copied, altered or deleted very easily, it is important to demonstrate the authenticity of, and prevent unauthorised access to, data for ethical, legal and quality reasons. Keeping a master file, a formalised and checked master copy of the data (and other materials) is critical. Moreover, copies may be preserved at certain stages of development - this is distinct from temporary working versions of data and other files.

Format Conversion

Statistical, spreadsheet and database packages all have slightly different data handling limits, which may also differ from the limits imposed by CAI software such as Blaise.

A summary of the limits of the three most widely used statistical packages (SPSS, Stata and SAS) can be found at the UCLA's Stata pages. These differential limits mean that some data or internal metadata (missing value definitions, variable labels, etc.) will inevitably be lost upon translation. Data should be exported by data managers or other team members familiar with the data who should check for errors or inadvertent changes to the data that may occur in the export process. Contact the individual CESSDA data archive with queries regarding format conversion.

Whatever method of conversion is chosen, whether it be an 'export' function of the package or third party data translation software such as StatTransfer, the results should be extensively tested, and the same tested method should be used throughout the data conversion process, since new or different methods may introduce errors. Back-ups of master copies should always be in formats that are suitable for long-term digital preservation by the data centre or archive that will subsequently preserve the data. This typically means open as opposed to proprietary formats. Data formats preferred by data archives and centres for data management, submission of data and preservation are summarised in the chart below.

All text should be encoded as ASCII or Unicode. When data may contain non-ASCII characters (generally, any non-Latin characters) it should always be encoded as Unicode. Newer versions of software are likely to use Unicode by default. Note that XML requires the use of Unicode. All recommended formats are subject to change over time as new archival and interchange formats are developed. In particular an XML schema for statistical datasets (tabular data with extensive metadata) would be extremely useful. The closest development so far, though still not able to store all the internal metadata and variable format information of a typical SPSS, SAS or Stata file, is the Triple-S data model. Recent versions of SAS, SPSS and Stata have their own XML data models, which may become useful intermediate formats for conversion (via XSLT) to a new common XML standard.

Preferred Data Formats

Type of data Preferred format for deposit Acceptable formats for deposit
Tabular data with minimal metadata

i.e. a matrix of data with or without column headings/variable names, but no other metadata
Delimited text of given character set, with SQL data definition statements where appropriate Delimited text of given character set, with SQL data definition statements where appropriate

Delimiters such as commas or tabs (or pipe) are most commonly used, and most widely recognised by import 'wizards'

Only characters not present in the data should be used as delimiters, or if unavoidable, data should be surrounded by inverted commas to distinguish between delimiters and characters in the data

Widely-used proprietary formats e.g. Excel, Access, dBase, are acceptable but offer less long-term security
Tabular data with extensive metadata

e.g. a survey dataset with variable labels, code labels, and missing values, in addition to the matrix of data
Delimited text and command file - SPSS, Stata, SAS, etc.

Other structured text/markup file containing metadata information, e.g. DDI XML file
SPSS portable (.por) or delimited text and command file (SPSS, Stata, SAS, etc.) containing metadata information

Binary formats of statistical packages (SPSS, Stata, SAS, etc.) are acceptable, but offer less long-term security

It may not be possible to accept very old formats created with certain versions of software
GIS and CAD data

e.g. vector and raster
Arcinfo export format (.e00) for vector data

MapInfo Interchange Format (MIF) for vector data

TIFF (version 6) for raster data

DXF or SVG for CAD data

GIS attribute data as per 'tabular data with minimal metadata'
Arcinfo export format (.e00) for vector data

MapInfo Interchange Format (MIF) for vector data

TIFF for raster data

Adobe Illustrator, DXF or SVG for CAD data

GIS attribute data as per 'tabular data with minimal metadata'

Binary formats of GIS and CAD packages may be acceptable
Qualitative (textual) data XML marked-up text according to an appropriate DTD or schema

RTF
Plain text

RTF or HTML

Software specific formats such as NUD*IST, NVivo and ATLAS.ti may be acceptable, but offer less long-term security
Digital audio data MS Waveform

Audio Interchange File Format (.aiff)
Microsoft Waveform
MPEG-1 Audio Layer 3 (MP3)
Digital video data MPEG-2

JPEG 2000
MPEG-2

JPEG 2000
Digital image data TIFF (version 6) TIFF (most formats, though CCITT Group 4 is generally considered to be the most straightforward)

PDF or PDF/A
Documentation Plain text

PDF, RTF, HTML

XML marked-up text according to an appropriate DTD or schema e.g. XHMTL 1.0
PDF, RTF, HTML

(Source: ESDS Data Formats and Software web page, www.esds.ac.uk/aandp/create/data.asp [17.9.2008])