During Analysis
During this crucial period, the organisation and maintenance of the dataset is particularly important, as team members are working on the data and making changes, additions, and deletions to the dataset. It is therefore good practice to maintain one read-only dataset with limited change possibilities. Any changes to this version should be carefully checked, verified, named and documented. The final dataset reflecting published analyses is the version to be deposited at the archive. Issues to consider include version control, storage, file structures, back-ups, security and formats.
Version Control
It is advisable to store various discrete versions of the data file (with individual version numbers). It is important to ensure that different copies of files, materials held in different formats, and information that is cross-referenced between files are all subject to version control. Checks and procedures must be put in place to ensure that if information in one file is altered, the corresponding information in other files is also altered. Perhaps also consider:
- Unique identification of files, preferably using a Unique Resource Name (URN) convention.
- Recording version and status e.g. draft, interim, final, internal.
- Maintaining single master files in a suitable format to remove version control problems associated with multiple working versions being developed in parallel.
- Recording relationships between items: in many cases the information contained in a single file is supported by information held in other files, e.g. between code and the data file it is run against, or between the data file and documentation or metadata that relates to it.
- Tracking the location of all items if stored in a variety of locations.
Storage
The suitable storage of data as raw data files or as statistical system files needs to be considered. For instance, the Digital Preservation Coalition (DPC) provides details of good practice regarding data storage.
File Structures
Decisions as to file structures (flat or hierarchical) with documentation of these.
Back-ups
A good back-up procedure will protect against a range of mishaps such as accidental changes to data, accidental deletion of data, loss of data due to media or software faults, virus infections, hackers, or catastrophic events such as fire or flood. Issues that should be given specific attention are:
- Frequency of back-up - back up at appropriate intervals.
- Rolling back-up copies - do not automatically overwrite old back-ups with new.
- Offsite back-up copies - ideally at least one back-up copy should be stored off site.
- Institutional back-up policy - be aware of what this is, maintain independent back-up of critical files.
- Validation of back-up copies, including checksums, and storing the checksum results.
- Choosing suitably robust and reliable back-up media - typically tape or CD-R.
- Refreshing back-up media - regularly replacing old media.
- Storage conditions for physical media - follow manufacturer's recommendations.
Security
As digital data can be copied, altered or deleted very easily, it is important to demonstrate the authenticity of, and prevent unauthorised access to, data for ethical, legal and quality reasons. Keeping a master file, a formalised and checked master copy of the data (and other materials) is critical. Moreover, copies may be preserved at certain stages of development - this is distinct from temporary working versions of data and other files.
- Assign responsibility for master files to individual members of the project team.
- Restrict write access to master versions to specific members of the project team.
- Create formal procedure for destruction of master files.
- Record changes to master files.
- Maintain old master files (in case later ones contain errors).
Format Conversion
Statistical, spreadsheet and database packages all have slightly different data handling limits, which may also differ from the limits imposed by CAI software such as Blaise.
A summary of the limits of the three most widely used statistical packages (SPSS, Stata and SAS) can be found at the UCLA's Stata pages. These differential limits mean that some data or internal metadata (missing value definitions, variable labels, etc.) will inevitably be lost upon translation. Data should be exported by data managers or other team members familiar with the data who should check for errors or inadvertent changes to the data that may occur in the export process. Contact the individual CESSDA data archive with queries regarding format conversion.
Whatever method of conversion is chosen, whether it be an 'export' function of the package or third party data translation software such as StatTransfer, the results should be extensively tested, and the same tested method should be used throughout the data conversion process, since new or different methods may introduce errors. Back-ups of master copies should always be in formats that are suitable for long-term digital preservation by the data centre or archive that will subsequently preserve the data. This typically means open as opposed to proprietary formats. Data formats preferred by data archives and centres for data management, submission of data and preservation are summarised in the chart below.
All text should be encoded as ASCII or Unicode. When data may contain non-ASCII characters (generally, any non-Latin characters) it should always be encoded as Unicode. Newer versions of software are likely to use Unicode by default. Note that XML requires the use of Unicode. All recommended formats are subject to change over time as new archival and interchange formats are developed. In particular an XML schema for statistical datasets (tabular data with extensive metadata) would be extremely useful. The closest development so far, though still not able to store all the internal metadata and variable format information of a typical SPSS, SAS or Stata file, is the Triple-S data model. Recent versions of SAS, SPSS and Stata have their own XML data models, which may become useful intermediate formats for conversion (via XSLT) to a new common XML standard.
Preferred Data Formats
| Type of data | Preferred format for deposit | Acceptable formats for deposit |
|---|---|---|
| Tabular data with minimal metadata i.e. a matrix of data with or without column headings/variable names, but no other metadata |
Delimited text of given character set, with SQL data definition statements where appropriate | Delimited text of given character set, with SQL data definition statements where appropriate Delimiters such as commas or tabs (or pipe) are most commonly used, and most widely recognised by import 'wizards' Only characters not present in the data should be used as delimiters, or if unavoidable, data should be surrounded by inverted commas to distinguish between delimiters and characters in the data Widely-used proprietary formats e.g. Excel, Access, dBase, are acceptable but offer less long-term security |
| Tabular data with extensive metadata e.g. a survey dataset with variable labels, code labels, and missing values, in addition to the matrix of data |
Delimited text and command file - SPSS, Stata, SAS, etc. Other structured text/markup file containing metadata information, e.g. DDI XML file |
SPSS portable (.por) or delimited text and command file (SPSS, Stata, SAS, etc.) containing metadata information Binary formats of statistical packages (SPSS, Stata, SAS, etc.) are acceptable, but offer less long-term security It may not be possible to accept very old formats created with certain versions of software |
| GIS and CAD data e.g. vector and raster |
Arcinfo export format (.e00) for vector data MapInfo Interchange Format (MIF) for vector data TIFF (version 6) for raster data DXF or SVG for CAD data GIS attribute data as per 'tabular data with minimal metadata' |
Arcinfo export format (.e00) for vector data MapInfo Interchange Format (MIF) for vector data TIFF for raster data Adobe Illustrator, DXF or SVG for CAD data GIS attribute data as per 'tabular data with minimal metadata' Binary formats of GIS and CAD packages may be acceptable |
| Qualitative (textual) data | XML marked-up text according to an appropriate DTD or schema RTF |
Plain text RTF or HTML Software specific formats such as NUD*IST, NVivo and ATLAS.ti may be acceptable, but offer less long-term security |
| Digital audio data | MS Waveform Audio Interchange File Format (.aiff) |
Microsoft Waveform MPEG-1 Audio Layer 3 (MP3) |
| Digital video data | MPEG-2 JPEG 2000 |
MPEG-2 JPEG 2000 |
| Digital image data | TIFF (version 6) | TIFF (most formats, though CCITT Group 4 is generally considered to be the most straightforward) PDF or PDF/A |
| Documentation | Plain text PDF, RTF, HTML XML marked-up text according to an appropriate DTD or schema e.g. XHMTL 1.0 |
PDF, RTF, HTML |
(Source: ESDS Data Formats and Software web page, www.esds.ac.uk/aandp/create/data.asp [17.9.2008])