Formatting Guidelines for data sent to YCAS
How to Prepare Data for Submission to the Yale Center for Analytical Sciences (YCAS)
I. De-identified data, if indicated
To ensure compliance with HIPAA’s Privacy Rule, data you provide to YCAS should be de-identified; i.e., it should not include protected health information (PHI) unless ycas staff is listed on the protocol or the protocol generically states that statisticians or study personnel will be able to see the data. Per HIPAA at Yale:
Information is de-identified if none of the following 18 types of identifiers is contained in the information and if no one accessing the information has actual knowledge that the information could be used – alone or in combination with other information – to identify any individual who is the subject of the information. Note that this does not prohibit coding records so that they may later be re-identified, so long as the code does not contain information about the subject of the information (for example, the code may not be a derivative of the individual’s Social Security Number) and is not used or disclosed for any other purpose, and so long as the re-linking mechanism (e.g., the subject log or coding algorithm) is not disclosed to any persons or organizations outside the [institution].
- Names
- All geographic subdivisions smaller than a State, including: - street address - city - county - precinct - zip codes and their equivalent geocodes, except for the initial three digits of a zip code if, according to the current publicly-available data from the Bureau of the Census: (1) the geographic unit formed by combining all zip codes with the same three initial digits contains more than 20,000 people, and (2) the initial three digits of a zip code for all such geographic units containing 20,000 or fewer people is changed to 000.
- Telephone numbers
- Fax numbers
- E-mail addresses
- Social Security numbers
- Medical record numbers
- Health plan beneficiary numbers
- Account numbers
- All elements of dates (except year) for dates related to an individual, including: - birth date - admission date - discharge date - date of death - all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older
- Certificate/license numbers
- Vehicle identifiers and serial numbers, including license plate numbers
- Device identifiers and serial numbers
- Web Universal Resource Locators (URLs)
- Internet Protocol (IP) address numbers
- Biometric identifiers, including finger and voice prints
- Full face photographic images and any comparable images
- Any other unique identifying numbers, characteristics, or codes
If you have questions about whether a data element might constitute PHI, contact hipaa@yale.edu before sending your data to YCAS.
II. Sufficient documentation to understand the variables and values
Please provide a data dictionary which lists the variables in your data set. For each variable in the data dictionary, include the:
- variable name that you used in the dataset (e.g., age),
- variable label (e.g., “Age in years”), v
- variable type (e.g., numeric, date, text etc.)
- For categorial and binary variables, include the value for each category/level of the variable: e.g., 0=never smoke, 1=current smoker, 2=past smoker, 3=passive smoke); 0=female, 1=not female.
- For missing data, include the specific code you used (e.g. “-9999”).
For additional guidance, see “Variable Naming Conventions,” below.
A copy of the research protocol may help us understand how the data were collected.
III. An acceptable file type
The following file types are acceptable for transmission to YCAS:
- Excel/csv
- SAS
- R
- Stata
- SPSS
The following file types are unacceptable for transmission to YCAS:
- Word
If you have a question about whether YCAS will accept a file type, please contact us.
IV. A dataset in an acceptable format
A. Non-Longitudinal (one observation per participant)
A typical dataset might look like this:
patid | age | female | enrolldt | deathdt | dthtm | painbl | smoke | ... | wgtlb |
---|---|---|---|---|---|---|---|---|---|
1001 | 51 | 1 | 12/22/2020 | 12/31/2021 | 0 | 375 | 3 | 191 | |
1002 | 31 | 1 | 01/14/2021 | 04/04/2021 | 1 | 81 | 2 | 205 | |
1003 | 29 | 0 | 01/21/2021 | 06/22/2021 | 1 | 153 | 5 | 126 | |
1004 | 1 | 01/29/2021 | 12/31/2021 | 0 | 337 | 6 | 177 | ||
1005 | 59 | 0 | 02/02/2021 | 07/11/2021 | 1 | 160 | 1 | 105 | |
1150 | 1 | 03/30/2021 | 12/31/2021 | 0 | 277 | 5 | 183 |
Each record (row) generally corresponds to data for a single patient. Each variable (column) corresponds to a patient characteristic (e.g., age, weight, gender, etc.).
The first row of the data file should contain the names of the variables unless data is extracted from a database together with a SAS or R code to read it in.
The first column of the data file often includes the patient/record identifier/id. In this example, we denote the first variable, “patid” as a patient identifier.
- The ID variable should be unique. For a patient, the ID variable should allow the data programmer or analyst to uniquely link a patient’s information from different data sources. If patients’ data are organized in different worksheets (e.g., demographics, blood samples, events) the ID variable will be used to match merge each patient’s records into a single worksheet.
- The preferred format for the ID variable is alphanumeric. If you use a numeric id, begin with a number larger than your anticipated sample size (e.g., 10001) to ensure that the ID remains unique if the format is accidentally changed between numeric and character formats.
B. Longitudinal (Repeated Measures) Data Formats
For situations where the sampling units (e.g., patients, clinics) can have multiple records (e.g., daily pain scores for a week), it is imperative that there is (a) a sampling unit identifier (e.g., patient ID) AND an index variable to uniquely identify the repeated measurements (e.g., time).
A longitudinal dataset can be structured in long format or wide format.
- Long format – in this example, each patient has between 1-3 records or repeated measurements. The variable “time” is the index variable that identifies the repeated measures for a given patient.
patid | time | age | female | painbl | pain | smoke | ... | wgtlb |
---|---|---|---|---|---|---|---|---|
1001 | 1 | 51 | 1 | 3 | 6 | 0 | 191 | |
1001 | 2 | 51 | 1 | 3 | 4 | 0 | 191 | |
1002 | 1 | 31 | 1 | 2 | 5 | 0 | 205 | |
1003 | 1 | 29 | 0 | 5 | 2 | 1 | 126 | |
1003 | 2 | 29 | 0 | 5 | 4 | 1 | 126 | |
1004 | 1 | 44 | 1 | 6 | 7 | 0 | 177 | |
1004 | 2 | 44 | 1 | 6 | 4 | 0 | 177 | |
1005 | 1 | 59 | 0 | 1 | 8 | 1 | 105 | |
1005 | 2 | 59 | 0 | 1 | 6 | 1 | 105 | |
1005 | 3 | 59 | 0 | 1 | 7 | 1 | 105 |
-
Wide format – the same dataset as before, but instead of formatting the repeated measures for time in a vertical (long) format, we format the data horizontally:
patid | pain1 | pain2 | pain3 | female | age | smoke | painbl |
---|---|---|---|---|---|---|---|
1001 | 6 | 4 | 1 | 51 | 0 | 3 | |
1002 | 5 | 1 | 31 | 0 | 2 | ||
1003 | 2 | 4 | 0 | 29 | 1 | 5 | |
1004 | 7 | 4 | 1 | 44 | 0 | 6 | |
1005 | 8 | 6 | 7 | 0 | 59 | 1 | 1 |
Each patient has a single record (row). The repeated measurements are captured by adding a suffix (an index) to the end of the variables that vary over time; e.g., the suffix allows us to distinguish the three timepoints at which pain was measured.
C. Variable Naming Conventions
- Restrict the length of variable names to 25 characters or less.
- Use mnemonic names (age for “Age in years”, mme for milligrams of morphine equivalents”, smoke for “Smoking status”, etc.)
- Use lower case letters
- Always start names with letters.
- Use either numeric (1,2) or character values (yes, no) for a given variable; do not use numeric and character values for a given variable (e.g. do not use 0 and no). Most dataset sets code yes/no as 1/0. If your dataset includes character variables, these are often converted to numeric at the analysis stage.
- Never start names with numbers or special characters (e.g., $, ^, <, +, >, etc.)
- Date variables: use the format dd/mm/yyyy. Leave missing date records blank.
- Name (multiple) date variables mnemonically (e.g., deathdt, enrolldt, etc.).
D. Additional Data Formatting Guidance
- If you use copy/paste to merge different spreadsheet, then crosscheck the spreadsheets to make sure the pasted values remain same. If you copy a cell that is calculated by a formula, paste the value rather than the formula.
-
Do not:
- Use color coding or different fonts as an indicator of a variable. Instead, create a column with a variable indicating the grouping variable.
- Use separate worksheets to distinguish between groups. Instead, create a column with a variable indicating the grouping variable. You can use different worksheets for data collected at different time points.
- Include summary information (e.g. means, standard deviations etc.) in the data file
- Include notes rather than data in data fields. All fields should conform to the variable format type.
- If the data includes many variables, one might create separate data files for example: “demographics” for baseline characteristics, “imaging” for ultrasound, MRI, CT, “diet” for dietary variables. If you provide separate data files, it is imperative that the same record ID variable be present in each.