Reflections on the NASA MDP data sets

David Gray, Yi Sun, N. Davey, B. Christianson, David Bowes

Research output: Contribution to journalArticlepeer-review

17 Citations (Scopus)

Abstract

Background: The NASA Metrics Data Program (MDP) data sets have been heavily used in software defect prediction research.
Aim: To highlight the data quality issues present in these data sets, and the problems that can arise when they are used in a binary classification context.
Method: A thorough exploration of all 13 original NASA data sets, followed by various experiments demonstrating the potential impact of duplicate data points when data mining.
Conclusions:
One: Researchers need to analyse the data that forms the basis of their findings in the context of how it will be used.
Two: The bulk of defect prediction experiments based on the NASA MDP data sets may have led to erroneous findings. This is mainly due to repeated/duplicate data points potentially causing substantial amounts of training and testing data to be identical.
Original languageEnglish
Pages (from-to)549-558
Number of pages10
JournalIET Software
Volume6
Issue number6
DOIs
Publication statusPublished - 2012

Fingerprint

Dive into the research topics of 'Reflections on the NASA MDP data sets'. Together they form a unique fingerprint.

Cite this