Skip to Main Content

Formatting Guidelines for data sent to YCAS

How to Prepare Data for Submission to the Yale Center for Analytical Sciences (YCAS)

I. De-identified data, if indicated

To ensure compliance with HIPAA’s Privacy Rule, data you provide to YCAS should be de-identified; i.e., it should not include protected health information (PHI) unless ycas staff is listed on the protocol or the protocol generically states that statisticians or study personnel will be able to see the data. Per HIPAA at Yale:

Information is de-identified if none of the following 18 types of identifiers is contained in the information and if no one accessing the information has actual knowledge that the information could be used – alone or in combination with other information – to identify any individual who is the subject of the information. Note that this does not prohibit coding records so that they may later be re-identified, so long as the code does not contain information about the subject of the information (for example, the code may not be a derivative of the individual’s Social Security Number) and is not used or disclosed for any other purpose, and so long as the re-linking mechanism (e.g., the subject log or coding algorithm) is not disclosed to any persons or organizations outside the [institution].

  1. Names
  2. All geographic subdivisions smaller than a State, including: - street address - city - county - precinct - zip codes and their equivalent geocodes, except for the initial three digits of a zip code if, according to the current publicly-available data from the Bureau of the Census: (1) the geographic unit formed by combining all zip codes with the same three initial digits contains more than 20,000 people, and (2) the initial three digits of a zip code for all such geographic units containing 20,000 or fewer people is changed to 000.
  3. Telephone numbers
  4. Fax numbers
  5. E-mail addresses
  6. Social Security numbers
  7. Medical record numbers
  8. Health plan beneficiary numbers
  9. Account numbers
  10. All elements of dates (except year) for dates related to an individual, including: - birth date - admission date - discharge date - date of death - all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older
  11. Certificate/license numbers
  12. Vehicle identifiers and serial numbers, including license plate numbers
  13. Device identifiers and serial numbers
  14. Web Universal Resource Locators (URLs)
  15. Internet Protocol (IP) address numbers
  16. Biometric identifiers, including finger and voice prints
  17. Full face photographic images and any comparable images
  18. Any other unique identifying numbers, characteristics, or codes

5039EXA.doc (

If you have questions about whether a data element might constitute PHI, contact before sending your data to YCAS.

II. Sufficient documentation to understand the variables and values

Please provide a data dictionary which lists the variables in your data set. For each variable in the data dictionary, include the:

  • variable name that you used in the dataset (e.g., age),
  • variable label (e.g., “Age in years”), v
  • variable type (e.g., numeric, date, text etc.)
    • For categorial and binary variables, include the value for each category/level of the variable: e.g., 0=never smoke, 1=current smoker, 2=past smoker, 3=passive smoke); 0=female, 1=not female.
    • For missing data, include the specific code you used (e.g. “-9999”).

For additional guidance, see “Variable Naming Conventions,” below.

A copy of the research protocol may help us understand how the data were collected.

III. An acceptable file type

The following file types are acceptable for transmission to YCAS:

  • Excel/csv
  • SAS
  • R
  • Stata
  • SPSS

The following file types are unacceptable for transmission to YCAS:

  • Word
  • pdf

If you have a question about whether YCAS will accept a file type, please contact us.

IV. A dataset in an acceptable format

A. Non-Longitudinal (one observation per participant)

A typical dataset might look like this:

patid age female enrolldt deathdt dthtm painbl smoke ... wgtlb
1001 51 1 12/22/2020 12/31/2021 0 375 3 191
1002 31 1 01/14/2021 04/04/2021 1 81 2 205
1003 29 0 01/21/2021 06/22/2021 1 153 5 126
1004 1 01/29/2021 12/31/2021 0 337 6 177
1005 59 0 02/02/2021 07/11/2021 1 160 1 105
1150 1 03/30/2021 12/31/2021 0 277 5 183

Each record (row) generally corresponds to data for a single patient. Each variable (column) corresponds to a patient characteristic (e.g., age, weight, gender, etc.).

The first row of the data file should contain the names of the variables unless data is extracted from a database together with a SAS or R code to read it in.

The first column of the data file often includes the patient/record identifier/id. In this example, we denote the first variable, “patid” as a patient identifier.

  • The ID variable should be unique. For a patient, the ID variable should allow the data programmer or analyst to uniquely link a patient’s information from different data sources. If patients’ data are organized in different worksheets (e.g., demographics, blood samples, events) the ID variable will be used to match merge each patient’s records into a single worksheet.
  • The preferred format for the ID variable is alphanumeric. If you use a numeric id, begin with a number larger than your anticipated sample size (e.g., 10001) to ensure that the ID remains unique if the format is accidentally changed between numeric and character formats.

B. Longitudinal (Repeated Measures) Data Formats

For situations where the sampling units (e.g., patients, clinics) can have multiple records (e.g., daily pain scores for a week), it is imperative that there is (a) a sampling unit identifier (e.g., patient ID) AND an index variable to uniquely identify the repeated measurements (e.g., time).

A longitudinal dataset can be structured in long format or wide format.

  • Long format – in this example, each patient has between 1-3 records or repeated measurements. The variable “time” is the index variable that identifies the repeated measures for a given patient.
patid time age female painbl pain smoke ... wgtlb
1001 1 51 1 3 6 0 191
1001 2 51 1 3 4 0 191
1002 1 31 1 2 5 0 205
1003 1 29 0 5 2 1 126
1003 2 29 0 5 4 1 126
1004 1 44 1 6 7 0 177
1004 2 44 1 6 4 0 177
1005 1 59 0 1 8 1 105
1005 2 59 0 1 6 1 105
1005 3 59 0 1 7 1 105

  • Wide format – the same dataset as before, but instead of formatting the repeated measures for time in a vertical (long) format, we format the data horizontally:
patid pain1 pain2 pain3 female age smoke painbl
1001 6 4 1 51 0 3
1002 5 1 31 0 2
1003 2 4 0 29 1 5
1004 7 4 1 44 0 6
1005 8 6 7 0 59 1 1

Each patient has a single record (row). The repeated measurements are captured by adding a suffix (an index) to the end of the variables that vary over time; e.g., the suffix allows us to distinguish the three timepoints at which pain was measured.

C. Variable Naming Conventions

  • Restrict the length of variable names to 25 characters or less.
  • Use mnemonic names (age for “Age in years”, mme for milligrams of morphine equivalents”, smoke for “Smoking status”, etc.)
  • Use lower case letters
  • Always start names with letters.
  • Use either numeric (1,2) or character values (yes, no) for a given variable; do not use numeric and character values for a given variable (e.g. do not use 0 and no). Most dataset sets code yes/no as 1/0. If your dataset includes character variables, these are often converted to numeric at the analysis stage.
  • Never start names with numbers or special characters (e.g., $, ^, <, +, >, etc.)
  • Date variables: use the format dd/mm/yyyy. Leave missing date records blank.
  • Name (multiple) date variables mnemonically (e.g., deathdt, enrolldt, etc.).

D. Additional Data Formatting Guidance

  • If you use copy/paste to merge different spreadsheet, then crosscheck the spreadsheets to make sure the pasted values remain same. If you copy a cell that is calculated by a formula, paste the value rather than the formula.
  • Do not:
    • Use color coding or different fonts as an indicator of a variable. Instead, create a column with a variable indicating the grouping variable.
    • Use separate worksheets to distinguish between groups. Instead, create a column with a variable indicating the grouping variable. You can use different worksheets for data collected at different time points.
    • Include summary information (e.g. means, standard deviations etc.) in the data file
    • Include notes rather than data in data fields. All fields should conform to the variable format type.
  • If the data includes many variables, one might create separate data files for example: “demographics” for baseline characteristics, “imaging” for ultrasound, MRI, CT, “diet” for dietary variables. If you provide separate data files, it is imperative that the same record ID variable be present in each.