Rules to format a data set

From Wikipedia

Jump to: navigation, search

In order to have a quick turnover on your analysis, please follow these guidelines:

Contents

General spreadsheet disposition

  • Place the name of the variable on the first row
  • Enter the information about individual patients in a single row. Do not split the information about a single patient into multiple rows and do not leave blank rows between patients.
  • Do not enter names of variables in intermediate rows
  • Do not split the variable names into two rows.
  • Always include a common prefix to variable names that share a common characteristic. For example, if a series of variables (score1, score2, etc) are all from the preop period, they should be labelled with the appropriate suffix (e.g., preop_score1, preop_score2)
  • As you keep entering more patients, the first row will be hidden and you will have difficulty identifying which column corresponds to which variable. To facilitate data entry, simply freeze the first row by selecting the first row and then selecting window/freeze panes.

Variable names

  • Names should be meaningful so that the statistical programmer can immediately understand its meaning. Meaningful names will also ensure that researchers themselves remember the definition of variables a long time after the study has been conducted. For example, "pr-sg" is a bad variable name, while "previous_surgery" is a good one
  • Variable names should not have any of the following characters: blank spaces between names (replace them by "_"), #, /, ", $, (, ), +, -, &, `, ', <, >, ?, @, =
  • Do not start variable names with numbers. For example, instead of "7_day_fu" use "_7_day_fu"

Variable coding

  • Do not mark patients by color or any other marks, since they are not read by statistical software. If you need to add information, add a new variable. For example, instead of marking in red patients who had an infection, create a new variable (column) called "infection" and then mark all patients with infection with "1" and all without an infection with "0"
  • Do not transform continuous variables (e.g., age, weight, time to recovery, etc) into categorical variables (e.g., 0-5, 5-18, etc). Categorization will be performed by the programmers if necessary.
  • If you don't have data for a variable, leave it blank. Do not enter a value of zero, since zero means that you have the information and that it is zero rather than absent
  • Do not insert multiple values in a single cell.
  • Instead, keep information about different variables in different cells. For example, instead of having a single variable with a list of all comorbidities on a single cell (diabetes, hypertension, renal failure) - see picture - create one variable per comorbidity and then code it as 1/0

Consistency between variables and research question

  • Make sure that all variables necessary to answer your research question are present in the database by writing a mock abstract (see example). For example, if your mock abstract talks about the the degree of agreement among raters of different image modalities for knee evaluation, your data set should have variables for each of the raters as well as the image modalities.