Data Cleaning

Data Cleaning Tasks

The suggestions below are adapted from our longer guide to data management, developed by Michael L. Berbaum, Ph.D., of the CCTS Biostatistics Core. For more information on this topic, including advice about how to apply it in your research, consider scheduling a consultation with a biostatistician. Please contact us with any other comments or corrections.

Clean the Data

When you have collected your data and read it into your preferred analysis software, the next task is always called cleaning. First we run the frequencies command, requesting detailed output (e.g., a selection of Min, Max, Mean, Median, Quartiles, Standard Deviation, Inter-Quartile Range or IQR). Look for Weirdness. Look for the Impossible. Take notes on everything you notice.

  1. Look at the Extremes of each variable. With a sample of children, there should not be any aged 43; with a sample of adults, there should not be any 5-year-olds. If you have coded your rating scales 1-to-5, there should not be any 0’s or 6’s and 7’s. If there are a lot of cases (a spike) piled up at the Min or Max, that’s an indication of floor and ceiling effects (the allowed range of response was too narrow, or respondents adopted a response style of going to one extreme or the other).

  2. Watch out for miscoding or mishandling of missing data. Some statistical programs (SPSS) conventionally used -9 for missing data, or -97, -98, and -99 for different kinds of missing data (e.g., could not contact respondent, respondent refused to answer, etc.). If you don’t declare these codes to signify missingness, the program will just assume they are valid data and use them with all the rest of the data in the analysis. The result of that is not good! Some statistical programs have special functions to examine the patterns of missing data (e.g., R) that are quite helpful.

Repair the Data

What if you find a mistake? The fix is usually to replace (selected portions of) the bad data with correct data, if you know it. If you don’t know what the respondent actually said or what the correct measurement was, the replacement will likely have to be the missing data code (e.g., in R it’s NA, in SPSS it might be -9, in SAS it’s a period .).

At the end of this task you should have a data file that is correct, or nearly so.

Resources

The resource list below includes publications and tools that our consultants have found useful. While we hope this bibliography serves as a helpful starting point for other researchers, we provide no guarantee of its comprehensiveness or of the accuracy or reliability of the works cited. If you have concerns or suggestions to improve this page, please contact us.

Assessment Capacities Project (ACAPS) (2016). Data Cleaning. 19 pp. PDF, many links. https://web.archive.org/web/20230628185233/https://www.acaps.org/fileadmin/user_upload/acaps_technical_brief_data_cleaning_april_2016_0.pdf.

Benini, Aldo (2013). How to Approach a Dataset, Part 1: Data Preparation. https://web.archive.org/web/20160318211302/https://aldo-benini.org/Level2/HumanitData/Acaps_How_to_approach_a_dataset_Part_1_Data_preparation.pdf.

Bonner, Anne (2019). The complete beginner’s guide to data cleaning and preprocessing. Python, simple. https://towardsdatascience.com/the-complete-beginners-guide-todata-cleaning-and-preprocessing-2070b7d4c6d.

de Jonge, Edwin and Mark van der Loo (2013). An introduction to data cleaning with R. PDF, extensive. https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf.

Elgabry, Omar (2019). The ultimate guide to data cleaning. https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4.

Kantarci, Atakan (2020). Data cleaning: What it is, why it matters, best practices & tools. Business case, overview, colorful charts. https://research.aimultiple.com/data-cleaning.

Pomerantseva, Vera (2009). Clinical data cleaning and validation steps. Not language-specific. https://pharmaceuticalprocessingworld.com/clinical-data-cleaning-and-validation-steps/.

Regional Educational Laboratory Central (2021). Common Sources of Data Errors and Error-Checking Techniques. Many resources for data management. https://ies.ed.gov/ncee/rel/Products/Region/central/Resource/100644/26.

Sciforce (2019). Data cleaning and processing for beginners. Python, simple. https://medium.com/sciforce/data-cleaning-and-processing-for-beginners-25748ee00743.

Society for Clinical Data Management (SCDM) (2013). Good Clinical Data Management Practices (GCDMP). PDF, 524 pages. https://scdm.org/wp-content/uploads/2019/10/21117-Full-GCDMP-Oct-2013.pdf.

Van den Broeck, Jan, Solveig Argeseanu Cunningham, Roger Eeckels, et al. (2005). “Data Cleaning: Detecting, Diagnosing, and Editing Data Abnormalities”. In: PLoS Medicine 2.10, p. e267. DOI: 10.1371/journal.pmed.0020267. https://doi.org/10.1371/journal.pmed.0020267.

Willems, Karlijn (2017). An introduction to cleaning data in R. R, simple. https://www.datacamp.com/community/blog/an-introduction-to-cleaning-data-in-r.

Attachments