Abstract
Background: The NASA Metrics Data Program (MDP) data sets have been heavily used in software defect prediction research.
Aim: To highlight the data quality issues present in these data sets, and the problems that can arise when they are used in a binary classification context.
Method: A thorough exploration of all 13 original NASA data sets, followed by various experiments demonstrating the potential impact of duplicate data points when data mining.
Conclusions:
One: Researchers need to analyse the data that forms the basis of their findings in the context of how it will be used.
Two: The bulk of defect prediction experiments based on the NASA MDP data sets may have led to erroneous findings. This is mainly due to repeated/duplicate data points potentially causing substantial amounts of training and testing data to be identical.
Aim: To highlight the data quality issues present in these data sets, and the problems that can arise when they are used in a binary classification context.
Method: A thorough exploration of all 13 original NASA data sets, followed by various experiments demonstrating the potential impact of duplicate data points when data mining.
Conclusions:
One: Researchers need to analyse the data that forms the basis of their findings in the context of how it will be used.
Two: The bulk of defect prediction experiments based on the NASA MDP data sets may have led to erroneous findings. This is mainly due to repeated/duplicate data points potentially causing substantial amounts of training and testing data to be identical.
Original language | English |
---|---|
Pages (from-to) | 549-558 |
Number of pages | 10 |
Journal | IET Software |
Volume | 6 |
Issue number | 6 |
DOIs | |
Publication status | Published - 2012 |