Research Data Management

Depositing research data in a data archive or data library ensures the future availability of the data for secondary analysis both to the principal investigator(s), as well as to other user communities.

Why archive research data

Original research data represent a valuable resource in terms of both human and monetary investment which ought to be preserved, and as appropriate, reutilized: for validation of research through replication, for other secondary analysis testing other hypotheses or using statistical techniques not available at the time of the original research, and for teaching purposes. Deposit of data is especially important when time-sensitive data are at issue - the opinions collected from respondents in the past can never be replicated, once lost or destroyed.  (Laine Ruus, Data Librarian Emeritus, University of Toronto).

Depositing research data in a data archive or data library ensures the future availability of the data for secondary analysis both to the principal investigator(s), as well as to other user communities.

The Social Science and Humanities Research Council of Canada (SSHRC) Research Data Archiving Policy has been in place, with minor variations, since about 1981. It reads:

The purpose of this policy is to facilitate the advancement of knowledge in the social sciences and humanities by encouraging researchers to share research data. Sharing data strengthens our collective capacity to meet academic standards of openness by providing opportunities to further analyze, replicate, verify and refine research findings. Such opportunities enhance progress within fields of research as well as support the expansion of inter-disciplinary research. In addition, greater availability of research data will contribute to improved training for graduate and undergraduate students, and, through the secondary analysis of existing data, make possible significant economies of scale. Finally, researchers whose work is publicly funded have a special obligation to openness and accountability.

All research data collected with the use of SSHRC funds must be preserved and made available for use by others within a reasonable period of time. SSHRC considers "a reasonable period" to be within two years of the completion of the research project for which the data was collected. Costs associated with preparing research data for deposit are considered eligible expenses in SSHRC research grant programs. Research data includes quantitative social, political and economic data sets; qualitative information in digital format; experimental research data; still and moving image and sound data bases; and other digital objects used for analytical purposes.

Conditions of deposit

Depositors are asked to sign a release agreement, specifying:

  • a definitive, unique title by which the data file will be known;
  • the date after which the data file may be made available to other users;
  • the categories of users who may access the data file;
  • the categories of users or institutions or organizations to whom copies of the data file may be disseminated;
  • whether or not the data files may in turn be redisseminated by third party recipients;
  • any special access, dissemination or redissemination, or other conditions.

The Map & Data Library in return accepts responsibility for:

  • adhering to the terms and conditions of access, dissemination, and redissemination as defined in the release agreement signed by the principal investigator(s);
  • ensuring that all data files available for secondary analysis are appropriately anonymized and have appropriate metadata;
  • maintaining copies of the data file and all relevant metadata, consistent with good archiving standards and practices;
  • providing user services to other researchers using the data file;
  • publicizing as appropriate, the availability of the data file for secondary analysis;
  • disseminating copies of the data file to researchers elsewhere, as permitted in the release agreement and consistent with good archiving standards and practices;
  • in addition, the Map & Data Library may undertake to clean the data file(s) if necessary, and compile standardized computer-readable metadata, or assist the principal investigator(s) in compiling such documentation.

Data file and documentation guidelines for social science data

The following should be regarded as guidelines for the formatting and documentation of computer-readable data files, rather than as absolute and rigid standards.  Staff at the Map & Data Library can provide advice and assistance as necessary.

Metadata should:

  • Consist of a codebook or user guide which must match the data it purports to describe;
  • Include a study description, with the following elements:
    • the title of the data file, which should be brief, distinctive, and descriptive of the data file, but which should not consist of a number or date (nor the filename);
    • a statement of the purpose or objectives of the study, or the hypothesis being tested;
    • acknowledgement of the persons and/or institutions responsible for the intellectual content, the design and execution, of the project, the data collection, coding, and cleaning, the documentation, and the financial support;
    • a detailed description of the universe and sampling frame, as appropriate;
    • a description of the data collection methodology;
    • fieldwork observations, if applicable;
    • completion or response rates, if applicable;
    • weighting procedures, as applicable;
    • a description of post-processing, i.e. cleaning, recoding, reformatting, anonymization, etc. done in the course of data preparation;
  • instructions to interviewers or other data collection staff, as appropriate;
  • a copy of the original questionnaire, or data collection instrument (e.g. a text file containing the CATI/CAPI script, with skip patterns, etc.), as appropriate;
  • in the case of numeric data, include for each variable:
    • the position, size, and format of each variable, i.e. record and column locations, number of decimals, presence or absence of a sign, and whether numeric, alphabetic, packed decimal, binary, etc.;
    • a detailed description of the variable, including the actual question text (including interviewer instructions, skip patters, etc.);
    • a specification of the source variables and algorithm used to generate derived variables, where appropriate;
    • variable specific notes as to sources of the data, if taken from other sources (published sources, data files, etc.);
    • a list and description of all code categories, where applicable;
    • variable-specific notes regarding recoding, reformatting, etc.
    • Include a bibliography of printed sources, software, etc., on which the data file is based (as applicable), e.g. the published text edition used in the creation of a text file, or the sources of standard indices;
    • in the case of text files, a description of all special coding to denote parts of speech, breathing, structure of the original text, non-Latin characters, etc.;

Data files should:

  • be anonymized consistent with accepted contemporary standards, to preserve respondent privacy and confidentiality
  • consist of 'raw' unaggregated data, as close to the unit of observation used for data collection as possible. I.e. survey data should consists of microdata rather than aggregated tabular cell counts; text files should consist of the full text file, rather than (or in addition to) the output products of a concordance program, word frequency list, etc.;
  • consist of flat, character ASCII data, rather than files which are dependent for their use on a particular software or statistical package. I.e. preferably a flat ASCII data file, with SAS or SPSS control commands rather than a SAS for Windows system file. Data Library Service staff can perform conversions from some common statistical software formats, where necessary;
  • not be multipunched;
  • be in character format, rather than packed decimal, column binary, or other special format, where feasible;
  • not contain pluses ('+'), minuses ('-'), or blanks as valid codes;
  • where appropriate, use numeric rather than alphabetic codes;
  • have the sign of signed interval scale variables in the left-most position of the field;
  • have all multi-column numeric variables right-justified in the column range;
  • have all ordinal and interval variables (e.g. Likert scales) consistently coded in the same direction;
  • have consistent missing data codes throughout;
  • have a unique respondent number assigned as part of each physical record;
  • in the case of multiple physical records per case (logical record) have a sequential record number coded on each record.

For additional information on data management practices and guidelines also see:

Create and Manage Data from the UK Data Archive

ICPSR's guide for Data Management

JISC's Managing Research Data

MIT's Data Management and Publishing guide

Please contact the Map & Data Library for consultation on research data management practices.