Strategy: creating artificially missing values from a complete dataset

Strategy: creating artificially missing values from a complete dataset

Estimation Models

Regression techniques

A significant proportion of research on software estimation has focused on linear regression analysis; however, this is not the unique technique that can be used to develop estimation models. An integrated work about these estimation techniques has been published by (Gray et MacDonell, 1997) who presented a detailed review of each category of models.
The least squares method is the most commonly used method for developing software estimation models: it generates a regression model that minimizes the sum of squared errors to determine the best estimates for the coefficients – (de Barcelos Tronto, da Silva et Sant’Anna, 2007) and (Mendes et al., 2005).
(Gray et MacDonell, 1997): “Linear least squares regression operates by estimating the coefficients in order to minimize the residuals between the observed data and the model’s prediction for the observation. Thus all observations are taken into account, each exercising the same extent of influence on the regression equation, even the outliers”.
Linear least squares regression also gets its name from the way the estimates of the unknown parameters are computed. The technique of least squares that is used to obtain parameter estimates was independently developed in (Stigler, 1988), (Harter, 1983) and (Stigler, 1978).
Linear regression is a popular method for expressing an association as a linear formula, but this does not mean that the determined formula will fit the data very well. Regression is based on a scatter plot, where each pair of attributes (xi, yi) corresponds to one data point when looking at a relationship between two variables. The line of best fit among the points is determined by the regression. It is called the least-squares regression line and is characterized by having the smallest sum of squared vertical distances between the data points and the line (Fenton et Pfleeger, 1998).

Estimation models: evaluation criteria

There are a number of criteria to evaluate the predictability of the estimation model (Conte, Dunsmore et Shen, 1986b):
1- Magnitude Relative Error (MRE) = | Estimate value – Actual value | / Actual value.
The MRE values are measured for each project in the data set, while the mean magnitude of relative error (MMRE) computes the average over N projects in the data set. The MRE value is calculated for each observation i for which effort is estimated at that observation.
2- Mean Magnitude Relative Error for n projects (MMRE) = 1/n*Σ(MREi) where i = 1…n.
This MMRE measures the percentage of the absolute value of the relative errors, averaged over the N projects in the data set. As the mean is calculated by taking into account the value of every estimated and actual from the data set, the result may give a biased assessment of imputation predictive power when there are several projects with large MREs.
3- Measure of prediction – Pred(x/100): percentage of projects for which the estimate is within x% of the actual. PRED (q) = k/n, out of n total projects observations, k number of projects observations which have mean magnitude of relative error less than 0.25. The estimation models generally considered good are when PRED (25) ≥ 75% of the observations. When the MRE x% in set at 25% for 75% of the observations: this, pred(25) gives the percentage of projects which were predicted with a MMRE less than or equal to 0.25 (Conte, Dunsmore et Shen, 1986b).
The evaluation criterion most widely used to assess the performance of software prediction models is the Mean Magnitude of Relative Error (MMRE). The MMRE is computed from the relative error, or (RE), which is the relative size of the difference between the actual and estimated value. If it is found that the results of MMRE have small values, the results should be precise or very close to the real data. The purpose of using MMRE is to assist in selecting the best model (Conte, Dunsmore et Shen, 1986b).

Summary

The International Software Benchmarking Standards Group (ISBSG) data repository of the ISBSG (ISBSG, 2013) is a publicly available multi-company data set which contains software project data collected from various organizations around the world from 1989 to 2013. This data set has been used in many studies focusing on software effort estimation, and this in spite of the diversity of its data elements.
The ISBSG has made available to the public a questionnaire to collect data about projects, including software functional size measured with any of the measurement standards recognized by the ISO (i.e. COSMIC functional size – ISO 19761, and so on). However data is collected and analyzed according to the ISBSG Standard, The standard defines the type of data to be collected (attributes of the project or application) and how the data is to be collected, validated, stored and published. The ISBSG recognizes the imperative of guaranteeing the anonymity of the organizations that submit data to its repositories.
The ISBSG assembles this data in a repository and provides a sample of the data fields to practitioners and researchers in an Excel file, referred to hereafter as the ISBSG MS-Excel data extract.
However, this repository contains a large number of missing data, thereby often reducing considerably the number of data points available for building productivity models and for building estimation models, for instance. There exists however a few techniques to handle missing values, but they must be handled in an appropriate manner; otherwise inferences may be made that are biased and misleading.
Data analysis with ISBSG repository should have a clearly stated and justified rationale, taking into account software engineering domain knowledge as well as indicators of statistical importance. There are some weaknesses in this dataset: for instance, questions over data quality and completeness have meant that much of the data potentially available may have not actually been used in the analyses performed.
Missing data are a part of almost all research and a common problem in software engineering datasets used for the development of estimation models. The most popular and simple teachniques of handling missing values is to ignore either the projects or the attributes with missing observations. This teachnique causes the loss of valuable information and therefore may lead to inaccurate estimation models. Missing data are techniques such as listwise deletion, pairwise deletion, hot-deck Imputation, cold deck imputation, mean imputation, and regression imputation.
Therefore, this empirical study will select the most attractive method for general purpose handling of missing data in multivariate analysis, the Multiple Imputation technique, which can be used by researchers on many analytic levels. Many research studies have used multiple imputation and good general reviews on multiple imputation have been published.
In addition, there are several studies introduced a set of techniques to deal with the problem of outliers in the dataset, the outliers may as well be the most interesting observations in themselves, because they can give hints about certain structures in the data or about special events during the sampling period. The appropriate methods for the detection of outliers are needed. The identification of outliers is an important step to verify the relevance of the values of the data in input.
This chapter has presented the evaluation criteria most widely used to assess the performance of software prediction models: the Mean Magnitude of Relative Error (MMRE), computed from the relative error, or (RE).

RESEARCH ISSUES AND RESEARCH OBJECTIVES

Research issues

Chapter 1 has presented a review of related works on the use of the ISBSG repository by researchers and how they have tackled – or not – these issues of outliers, missing values and
data quality.
In summary, the ISBSG repository is not exempt of the issues that have been identified in other repositories (i.e. outliers, missing values and data quality). For instance, the ISBSG repository contains a large number of missing values for a significant amount of variables, as not all the fields are required at the time of data collection.
The ISBSG repository also contains a number of outliers in some of the numerical data fields, thus making it use rather challenging for research purposes when attempting to analyze concurrently a large subset of data fields as parameters in statistical analyses.
Therefore, researchers using this multi-organizational repository in multi variables statistical analyses face a number of challenges, including:
• there are often statistical outliers in the numerical fields;
• the data are contributed voluntarily: therefore, the quality of the data collected may vary and should be taken into account prior to statistical analysis;
• there is only a handful of the over +100 data fields mandatory in the ISBSG data collection process: therefore, there is a very large number of missing values in the non mandatory fields.
Often, missing values are just ignored for reasons of convenience, which might be acceptable when working with a large dataset and a relatively small amount of missing data. However, this simple treatment can yield biased findings if the percentage of missing data is relatively large, resulting in lost information on the incomplete cases. Moreover, when dealing with relatively small datasets, it becomes impractical to just ignore missing values or to delete incomplete observations from the dataset. In these situations, more reliable imputation methods must be pursued in order to perform meaningful analyses.
This research focuses on the issues of missing values and outliers in the ISBSG repository, and proposes and empirical number of techniques for pre-processing the input data in order to increase their quality for detailed statistical analysis.

Research motivation

Up until recently, most of the empirical studies on the performance of estimation models were made using samples of very small size (less than 20 projects) while only a few researchers used samples of a larger size (between 60 and 90 projects). With the set-up of the repository of software projects by the International Software Benchmarking Standards Group – ISBSG – there exists now a much larger data repository available for building estimation models, thereby providing a sounder basis for statistical studies. Researchers from around the world have started to use this repository (See Appendix XXIX on the CD attached to this thesis), but they have encountered new challenges. For instance, there is a large number of outliers as well as missing values for a significant number of variables for each project (eg. only 10% of the data fields are mandatory at the data collection time), making its uses rather challenging for research purposes.
Furthermore, several problems arise in the identifying and justifying of the pre-processing of the ISBSG data repository, including clustering groups of projects that share similar value characteristics, discarding and retaining data, identifying in a systematic manner the outliers and investigating causes of such outliers’ behaviors.

Le rapport de stage ou le pfe est un document d’analyse, de synthèse et d’évaluation de votre apprentissage, c’est pour cela chatpfe.com propose le téléchargement des modèles complet de projet de fin d’étude, rapport de stage, mémoire, pfe, thèse, pour connaître la méthodologie à avoir et savoir comment construire les parties d’un projet de fin d’étude.

Table des matières

INTRODUCTION
CHAPTER 1 LITERATURE REVIEW
1.1 ISBSG data repository
1.2 ISBSG data collection
1.2.1 The ISBSG data collection process
1.2.2 Anonymity of the data collected
1.2.3 Extract data from the ISBSG data repository
1.3 Literature Review of ISBSG-based studies
1.4 Methods for treating missing values
1.4.1 Deletion Methods for treatment of missing values
1.4.2 Imputation methods
1.5 Techniques to deal with outliers
1.6 Estimation Models
1.6.1 Regression techniques
1.6.2 Estimation models: evaluation criteria
1.7 Summary
CHAPTER 2 RESEARCH ISSUES AND RESEARCH OBJECTIVES
2.1 Research issues
2.2 Research motivation
2.3 Research goal and objectives
2.4 Research scope
CHAPTER 3 RESEARCH METHODOLOGY
3.1 Research methodology
3.2 Detailed methodology for phase I: Collection and synthesis of lessons learned
3.3 Detailed methodology for phase II: Data preparation and identification of outliers
3.4 Detailed methodology for phase III: Multiple Imputation technique to deal withmissing values
3.5 Multiple imputation Overviews
3.6 Detailed methodology for phase IV: Handling Missing values in effort estimation with and without Outliers
3.7 Detailed methodology for phase V: Verification the contribution of the MI technique on effort estimation
CHAPTER 4 DATA PREPARATION AND IDENTIFICATION OF OUTLIERS
4.1 Data preparation
4.2 Data preparation for ISBSG repository
4.2.1 Data preparation effort by project phases
4.3 Technique to deal with outliers in ISBSG data repository
4.4 Summary
CHAPTER 5 MULTIPLE IMPUTATION TECHNIQUE TO DEAL WITH MISSING VALUES IN ISBSG REPOSITORY
5.1 Multiple imputation method in SAS soffware
5.2 Implement the (MI) technique for effort by project phases with missing values
5.2.1 Step 1 Creating the imputed data sets (Imputation)
5.3 Step 2 analyzing the completed data sets
5.3.1 Analysis strategy
5.3.2 Implement effort estimation model (using the 62 imputed Implement values)
5.3.3 Plan effort estimation models (built using the 3 imputed Plan values)
5.4 Step 3 Combining the analysis results (combination of results)
5.4.1 Strategy and statistical tests to be used
5.4.2 The strategy for combining results
5.4.3 Average parameter estimates for MI of the full imputed dataset (N= 106 and N=103)
5.5 Summary
CHAPTER 6 VERIFICATION OF THE CONTRIBUTION OF THE MI TECHNIQUE ON EFFORT ESTIMATION
6.1 Introducation
6.2 Strategy: creating artificially missing values from a complete dataset
6.2.1 Strategy steps
6.2.2 Impact on parameter estimates with outliers – N=41 and 21 projects with values deleted
6.2.3 Impact on parameter estimates without outliers – N = 40 and 20 projects with values deleted
6.2.4 Analysis of the variance of the estimates
6.2.5 Additional investigation of effort estimation for N=20 projects with imputed
values for the Effort Implement phase
6.3 Sensitivity analysis of relative imputation of effort estimation for N=40 projects without outliers for the Effort Implement phase
6.4 Comparing the estimation performance of MI with respect to a simpler imputationtechnique based on an average
6.4.1 The (Déry et Abran, 2005) study
6.4.2 Imputation based on an absolute average, %average, and MI with (absolute seeds and relative seeds Min & Max)
6.4.3 Estimation model from Imputation based on an average.
6.4.4 Comparisons between MI and imputation on averages (Absolute and relativeseeds excluding outliers)
6.5 Summary
CONCLUSION
LIST OF BIBLIOGRAPHICAL REFERENCES