Missing Data — Definition & Types

Time to read: 5 Minutes
Missing-data-Definition

Even well-designed and controlled studies include missing data. Missing values reduces a study’s statistical power and causes erroneous estimates and inaccurate findings. This manuscript discusses missing data concerns, types, and ways of dealing with them.

Missing Data – In a Nutshell

  • All variables presenting potential mechanisms to explain missing data must be included, even if they are not included in the analysis.
  • A trial’s efficacy is weakened by missing data.
  • The optimal approach for missing values is to maximize data collection during the design and collection of the study protocol and data.
  • The application of complex statistical analysis tools should only occur after all possible measures have been taken to minimize missing values in the design and preventative methods.

Definition: Missing data

Missing data or missing values arise when you do not have data stored for particular variables or participants. Data can be lost for numerous causes, including incomplete data entry, device failure, and misplaced files.

Types of missing data

Missing data are errors because they do not represent the actual values of what was intended to be measured.

Consideration of the reason for the missing values is essential, as it enables you to establish the type of missing values and the necessary course of action.

Missing values fall into three categories:

MCAR Missing completely at random (MCAR) occurs when the probability of missing data is unrelated to the expected value or observed responses. MCAR is an ideal but impractical assumption for many anesthesia studies. MCAR data are missing by design due to instrument failure or because samples are lost in transit or are technically unacceptable.

MCAR data ensures unbiased analysis. The design may lose power, but missing value doesn't influence estimated parameters.
MAR MAR is a better assumption for anesthetic studies. MAR data are missing when the probability of missing replies relies on the observed responses but not the expected missing values.

We may think MAR isn't a concern because randomness isn't biased. Missing data can't be ignored under MAR. If a dropout variable is MAR, the probability of a dropout in each case is conditionally independent of the variable obtained currently and expected to be obtained in the future, given the history of the obtained variable before that case.
MNAR If data characters don't meet MCAR or MAR, they're missing, not at random (MNAR).

MNAR data is problematic. Modeling the missing data is the only approach to getting unbiased parameter estimates. The model is then used to estimate missing values.1

How to prevent missing data

Common causes of missing values include attrition, non-response, and poorly constructed study techniques. While planning a study, it is advisable to make it simple for participants to contribute data.

Here are tips to minimize missing values:

Limit follow-ups

Minimize data collected

Make forms user-friendly

Incorporate methods of data validation.

Give rewards

How to deal with missing data

Typically, you have the choice of accepting, eliminating, or reconstructing missing values to organize your data.

Determine how to handle each instance of missing values depending on your evaluation of the missing value’s cause:

  • Are these missing data due to random or non-random causes?
  • Are missing data zero or null?
  • Was the query or measurement ill-conceived?

If your information is MCAR or MAR, it can be accepted or left unchanged. However, MNAR data may necessitate a more intricate approach.

Missing data: Acceptance

Accepting missing data is the most prudent course; leave these cells blank. This is best for MCAR or MAR values. When you have a small sample, save as much data as possible to maintain statistical power.

To make your dataset consistent, recode any missing values as “N/A.” These steps let you preserve as much research data as possible without alterations.

Missing data: Deletion

Listwise or pairwise deletion can be used to eliminate missing values from analyses.

Listwise deletion

Listwise deletion eliminates all cases (participants) with missing data for any variable. You’ll have the entire participant data. This strategy may result in a smaller, biased sample. If data are lacking from some variables or measurements, those who offer them may differ from those who don’t.

Your sample may not be representative of the population, making it biased.2

Example of listwise deletion:

You opt to eliminate all survey respondents with missing data from your dataset. This decreases your sample from 114 to 77 people.

You note that most missing-data participants didn’t answer a question about their opinions. Many of them were women; therefore, your sample is now mostly guys.

Pairwise deletion

Pairwise deletion removes data only if a needed data point is missing. The existing values are used if missing values exist in the data set. Pairwise deletion maintains more information than listwise deletion, which deletes absent cases.

Pairwise deletion is less biased for MCAR or MAR data, provided relevant mechanisms are covariates. Missing observations will degrade the analysis.3

Example of pairwise deletion:

You eliminate missing values but keep other participant data. This doesn’t reduce the sample size.

  • 12 persons didn’t answer the gender question, lowering the sample size to 102.
  • 3 individuals didn’t answer the age question, limiting the sample size from 114 to 111.

This retains more values, but the sample size varies among variables.

Missing data: Imputation

Imputation replaces missing values with an estimate. Use other data to form a comprehensive dataset.

You have numerous imputation options. The easiest way of imputation is to use the mean or median of a variable.

Hot-deck imputation

Hot-deck imputation replaces missing values with values from related cases or participants. A “donor” value is used for each situation with missing values based on data from other variables.

Example of hot-deck imputation:

In a poll, you ask participants how they rank a new shopping app from 1 to 5. You note that two people ignored Question 3; thus, these cells are unfilled.

You arrange the data depending on other variables and find individuals who reacted similarly to other questions as your missing value participants. You utilize a donor’s answer to Question 3 to fill in blank cells.

Cold-deck imputation

In cold-deck imputation, missing values are substituted with existing values from similar cases in other datasets. The new values are derived from an independent sample.

Example of cold-deck imputation

Instead of filling in the missing values with responses from participants in the same sample, you access an alternative dataset from a colleague. They did an identical poll but with different sample sizes.

You look for participants whose responses to other questions were comparable to those of your participants with missing values. You use the answer to Question 3 from the second dataset to fill the empty cell for each missing value.

FAQs

Missing values arise when you do not have data stored for particular variables or participants.

Missing data are important as they can influence results depending on the kind. Because of an unrepresentative sample, your results may not be generalizable.

Typically, you have the choice of accepting, eliminating or reconstructing missing data to organize it.

Sources

1 Kang, Hyun. “The prevention and handling of the missing data.” National Library of Medicine. May 24, 2013.  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3668100/.

2 Ogunbiyi, Ibrahim Abayomi . “How to Handle Missing Data in a Dataset.” FreeCodeCamp. June 24, 2022.
https://www.freecodecamp.org/news/how-to-handle-missing-data-in-a-dataset/.

MastersInDataScience. “How to Deal with Missing Data.” Accessed December 02, 2022. https://www.mastersindatascience.org/learning/how-to-deal-with-missing-data/.