||Same code used consistently throughout the database to indicate missing / unknown values.
||Blank cells should not be used as a substitute, as these can be misleading.
||Suggestions for missing value codes: “NA” or “999” etc.
||An accompanying data codebook should be provided.
||This is a record of all variables within the database, with an explanation for what each of them means, the range of possible values, and the column type (e.g., integer, numeric, categorical etc.)
||For a variable named ‘ConsumSatisf’, an explanatory entry would be provided in the codebook to say that this variable represents a measure of consumer satisfaction with product ‘X’, and that possible values range from 0 to 9 (with column type as integer).
||In case multiple datasets are shipped to the analyst, clear indications should be given for whether there is a linking key between these, and what this is.
||If two datasets are sent, one containing GP referral data, and the other containing specialist care medical data, then the two datasets should contain a common key for patient IDs, which would share the same meaning between dataset (i.e., patient ‘1234567’ should refer to the same person across the two datasets).
||Removing duplicated rows, i.e., rows that are perfectly identical within the database (from the first, through to the final column).
||In more sensitive cases, it may not be enough to remove just perfect duplicates, but special care should also be taken when two rows are identical - with only the exception of a few columns.
||In a database with patient records, a pair of rows is found that share the same patient ID, the same assessment date, the same prescription, and yet one row suggest the patient was discharged, whereas the other suggests the patient was not discharged. Wherever possible, such situations should be avoided before ever shipping the data, as they are misleading.
||Removing trailing whitespaces.
||There should be no spaces padding the actual values within cells.
||A database cell should only contain values such as “3”, rather than “3 ” or “ 3 ” etc.
||There should only be one ‘atomic’ value per cell.
||That is, a value coding only one specific characteristic.
||A single ‘Name’ variable should be replaced by two atomic variables: ‘Surname’ and ‘Forename’. Similarly, a variable called ‘HotelPreferences’ (with a value such as ‘Hilton;Marriot;Ibis’) should be divided into as many separate columns as necessary, e.g., ‘HotelPreference1’ (value = ‘Hilton’), ‘HotelPreference2’ (value = ‘Marriot’), and so on.
||The same unit of measuremement should be used consistently throughout the same column. If this is not possible, alternatively, the units of measurement can also be mixed, however this can be allowed only if one variable contains the numeric values, and another vartiable specifies the associated unit of measurement.
||For medication dosages per patient, a variable could be called ‘DailyIbuprofenDosageMg’, and the values in that column should be restricted to only numeric values, e.g., ‘0.5’, ‘1.5’ - which would be known to be in milligrams. Alternatively, if mixing units of measurement, this format is also possible: for a variable called, ‘DailyIbuprofenDosage’, with values ‘0.5’, ‘1.5’ etc, another variable should be created, e.g., ‘DailyIbuprofenUnit’, with values: ‘mg’, ‘mcg’ etc. Of course, the units of measurement would need to be typed in consistentently (e.g., using either ‘mcg’ or ‘micrograms’, but never both interchangeably).
||Related variables should follow the same naming scheme.
||Reusing the hotel preferences example above, variables should be named consistently, e.g., ‘HotelPref1’, ‘HotelPref2’ etc, rather than ‘HotelPreference1’, ‘HotelPref2’, ‘Pref3’ etc.
||Variable names should not contain spaces.
||A variable name such as ‘Date of birth’ should be replaced with ‘DateOfBirth’ or ‘date_of_birth’ etc.
||For dates, the same convention should be used consistently and mentioned in the data codebook.
||If the format ‘dd/mm/yyyy’ is preferred, this should be maintainted everywhere within the date variable, without cases such as ‘mm/dd/yyyy’, or ‘dd-mm-yyyy’ ever occurring.
||Use spelling and capitalisation consistently within columns.
||In a ‘Gender’ variable, values could be coded as either 0 and 1 (with the explanation provided in the database codebook for which is female and which is male), or if the genders are typed in as words, then ‘male’ should not get mixed with ‘Male’ etc.