Documentation
Preparing high-quality technical documentation, or codebook, can be a time-consuming task. An outline of the ideal elements of clear and comprehensive technical documentation occurs below. Details are available in the UK Data Archive web pages Documenting your data or the ICPSR Guide (pp. 13-16). Other data archives have published similar guides in their national languages.
Document Using Structured Formats
Many CESSDA archives now encourage users to generate documentation that is "marked up" according to the Data Documentation Initiative (DDI) metadata specification, an emerging international standard for the content, presentation, transport, and preservation of documentation about datasets in the social and behavioural sciences. This specification allows the mark-up of documentation elements and facilitates the creation of coherent and comprehensive technical documentation of a dataset. Using this system means that all the information the analyst needs is available in one document, which can be used to produce other products (such as set-up files). The files are amenable to web display and navigation, and because the documentation is prepared from the onset, deposit in the archive or data centre will be possible immediately after data collection is complete.
More information on DDI and a list of tools and other XML resources is available at the DDI pages, at the UKDA metadata page and at your local or national archive.
If it is not possible for a project to produce documentation that is in DDI-format, using a uniform, structured format with integrated question text is the best alternative, as it will enable the archive to convert the files to XML format easily. In most cases, CESSDA archives produce the DDI records from information provided by the data depositor.
Minimum Level of Information
Good documentation is an essential part of any dataset and there are minimal levels of information which are required to make the data suitable for sharing with other researchers. Three types of information must be provided: explanatory, contextual and cataloguing information.
Explanatory Material
Explanatory material is essential for informed use of dataset. Much of this information is likely to be available from reports, working papers and other publications.
- Information about the data collection methods. This should describe the data collection process, whether a survey, collection of administrative information, transcription of a document source, experimental model etc., as well as the data collection method. Details on how the methods were developed are also useful. If applicable, details of the sampling design, sampling frame and sampling methods should be included. It is also extremely useful to include information on any pilot research or monitoring processes undertaken during the main data collection exercise, and details of any other quality controls used during the data creation stage. Details of the geographic and temporal coverage of the dataset, together with a record of the software and operating system on which the files were generated, the medium on which the data are stored and a complete list of all data files that make up the dataset should also be included.
- Information about the structure of the dataset. This should include information about the relationships between individual files and records within a dataset. It should also include the number of cases and variables in each numeric data file and the number of files comprising the dataset:
- For relational databases, a diagram showing the structure and the relations between the records and elements of the dataset should be constructed (i.e. an entity-relationship diagram).
- For qualitative data the relationship between interviews and observations etc., should be made clear.
- For mixed methods data comprising e.g. a survey and in-depth interviews, any linkages should be documented.
- Technical information.
- Variables and values, coding and classification schemes. For numeric data, this should comprise a complete variable list that describes all the variables (or fields) in the dataset and full details of all coding and classifications used. It is helpful to identify variables to which standard coding classifications (e.g. SIC and SOC codes) apply and to record the version of the classification scheme used, preferably with a bibliographic reference to that code. For qualitative data, a data list should be compiled that details background information for each interview/observation and so on.
- Information about derived variables. Derived new variables from original data may be as simple as grouping interval age data to groups of years, to more complex derivations using algorithms. For whatever method employed, it is important that the logic of each derivation is made clear. For simple groupings, e.g. grouped age, variable and value labels can be used to explain them. For complex derivations, the best method of describing these is to provide the logical statements that created these derived variables, e.g. the SPSS or Stata command files. If the command language is not likely to be understood by end users, each derivation should be described using simple logical statements, and these should be sent in addition to the actual command files.
- Weighting and grossing. Weighting and grossing variables must be fully documented, explaining the construction of the variables with a clear indication of the circumstances in which they should be used.
- Data source. Details of the source from which the data are derived should be well documented. If data have been summarised from records or archived sources, the provenance of the materials should be made clear.
- Confidentiality and anonymisation. It is important to record if the data contain any confidential information concerning individuals, households, organisations or institutions. Such information should normally be removed or anonymised prior to submission of the dataset. Where secondary analysis requires confidential or otherwise sensitive information to remain in the dataset, agreements about any special access conditions to end users should be discussed with the archive or data service. See the Rights & Confidentiality section.
- Validation and other checks. This should comprise details of any known data errors and any data checking and cleaning performed as part of the data collection checks. For qualitative data, if transcriptions have been proofed or quality controlled this is useful to know. For data gathered by scientific instruments, these terms have slightly different meanings. Validation refers to checking for equipment as well as transcription errors, while verification is the checking of the truth of the record by an expert or by taking multiple samples.
Contextual Information
Information about the context in which the data were collected and information about the uses to which the data were put:
- Description of the originating project. This should comprise details of the history of the project or process that gave rise to the dataset, in terms of the intellectual, financial and organisational origins and developments over time including, for example, the aims and objectives of the undertaking, publications arising, policy developments to which it contributed and any other relevant contextual information.
- Provenance of the dataset
- Serial and time series datasets, new editions. For ongoing projects such as repeated cross-sectional surveys, panel or time series datasets, additional information describing any changes in the variable content, question text, variable labelling or sampling procedures, is enormously helpful.
Cataloguing Information
Cataloguing information allows the archive to create a formal catalogue record, or study description, for the study. The study description serves two purposes - first, it is a bibliographic record of the dataset for proper acknowledgment and citation, and second, it is the principal instrument used for resource discovery. A formal catalogue record for every archived study is created by most archives.
This includes information such as the title of the dataset, principal investigator, sponsors, data collectors, dates of data collection, temporal and geographic coverage, methods of data collection, and sampling design and frames. This information is often gathered by the archive at the time of deposit through the use of a special form prepared for this purpose. For a sample form, see the ESDS Deposit Data web page. For a discussion of and further information on the elements of information required for a structured catalogue record, or study description see the UKDA metadata page.
In addition, you should provide:
- A copy of the data collection instruments or questionnaire if available, annotated with variable names and including any derived variables, a copy of any interviewer's instructions, and a flowchart of the data collection instrument. For complex questionnaires it is sometimes useful to produce a graphic guide to the data, showing which respondents were asked which questions and how various items link to each other. This is particularly useful when no hardcopy questionnaire is available.
- Index or table of contents. For large datasets, this is essential. An alphabetised list of variables with associated page numbers to detailed variable information is also extremely helpful.
- List of abbreviations and other conventions. Both variable names and variable labels will contain abbreviations. Ideally, these should be standardised.
- Recode logic. It is important to provide an audit trail of the steps involved in creating recoded variables. This information is sometimes provided in a separate document.
- Coding instrument. Rules and definitions used for coding the data are helpful to data analysts.
- Any reports or publications that provide additional information.
Qualitative Data Documentation
Documentation requirements for qualitative data are essentially similar to those outlined above, but particulars will vary. The essential items of information are noted below. For more information, see the ESDS Qualidata web page or contact your local archive.
Documentation of qualitative files provides a context for the data and the research investigation, and if detailed enough can help the raw qualitative data be more usable by secondary analysts who have not previously been directly involved with the data collection.
Common examples of documentation include:
- Original grant application, or initial outline of data collection plans
- For academic projects, an end of award report
- Description of methodology
- Interview schedule(s)/topic guide
- Questionnaire
- Observation checklist
- Interviewer instructions/prompt cards
- Communication with informants relating to confidentiality
- Written consent forms
- Matrices
- Tree diagrams
- Information of equipment used (e.g. recording equipment)
- Other background information
- Details of missing information
- Correspondence
- Speaker markers in text, typically associated with internal metadata; question or thematic markers in text; cross-reference of text to audio material
- References to publications and reports based on the study
In most cases, depositors are asked to provide as much documentation and metadata as possible, although it is recognised that this can vary widely from data collection to data collection. The data archive or service will use this information to provide a user guide or standardised documentation to assist the secondary user.