Creating Dataset
Below is a set of summary descriptions of the best practices in the creation of data files. Following these suggestions from the very beginning will ensure that your dataset is going to be clean and easy to use, and the analysis and documentation processes, and later the archival process, are easier.
Error Detection and Correction
It is important to identify and correct any of the inevitable errors which creep into the initial stages of the dataset. Suggestions for dealing with these:
- Use a data-entry programme which catches typing errors
- Use a double-entry system
- Introduce random quality control checks
- Separate the coding and data-entry tasks
- Centralise particularly complex tasks, such as occupation coding, with trained staff
- Use computer coding when possible
- Check wild codes and out-of-range values
- Check for consistency across variables
- Undertake record matches and counts
Variable Names
- Devise standard variable names and choose methods of construction and length of variable names.
Variable Labels
- Devise labels to include item or questions number, indication of variable content, and whether constructed or derived from other items.
Variable Groups
- Variable groups and corresponding variable group lists in the codebook are an effective way of organising a dataset and are especially recommended if a collection contains a large number of variables.
Codes and Coding
- Codes should assure that all statistical software packages will be able to handle the data and should promote greater measurement comparability. Principles are available in various guides for most coding situations (see, for example, the ICPSR Guide (pp. 8-10)).
Missing Data
- Careful planning on methods for handling and identifying missing data should be made at the very start, to ensure that imputation and missing data handling will be possible during analysis.