Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2015, Computational Statistics & Data Analysis
…
23 pages
1 file
Missing data is an important issue in almost all fields of quantitative research. A nonparametric procedure that has been shown to be useful is the nearest neighbor imputation method. We suggest a weighted nearest neighbor imputation method based on L q -distances. The weighted method is shown to have smaller imputation error than available NN estimates. In addition we consider weighted neighbor imputation methods that use selected distances. The careful selection of distances that carry information on the missing values yields an imputation tool that outperforms competing nearest neighbor methods distinctly. Simulation studies show that the suggested weighted imputation with selection of distances provides the smallest imputation error, in particular when the number of predictors is large. In addition, the selected procedure is applied to real data from different fields.
Information Sciences, 2005
Imputation of missing data is of interest in many areas such as survey data editing, medical documentation maintaining and DNA microarray data analysis. This paper is devoted to experimental analysis of a set of imputation methods developed within the so-called least-squares approximation approach, a non-parametric computationally effective multidimensional technique. First, we review global methods for least-squares data imputation. Then we propose extensions of these algorithms based on the nearest neighbours approach. An experimental study of the algorithms on generated data sets is conducted. It appears that straight algorithms may work rather well on data of simple structure and/or with small number of missing entries. However, in more complex cases, the only winner within the least-squares approximation approach is a method, INI, proposed in this paper as a combination of global and local imputation algorithms.
2021
Missing value or sometimes synonym as missing data, is an unavoidable issue when collecting data. It is uncontrollable and happen in almost any research fields. Hence, this study focused on identifying the current publications trend on missing data imputation techniques (1991- 2021) specifically in classification problems using bibliometric analysis. Most importantly, this research aims to uncover the potential missing data imputation methods. Two software were used; VOSViewer and Harzing Publish or Perish. Based on the Scopus database extracted in June 2021, the findings indicate an emerging trend in missing data imputation research to date, while there are two imputation methods that get the most attention; the random forest and the nearest neighbor methods.
Turkiye Klinikleri Journal of Biostatistics
In a research, it is not desirable that the dataset to be used contains missing value (s) and researchers try to cope with this situation. The main purpose of this research is to develop new user-friendly web-based software that uses various techniques to handle missing value(s). Material and Methods: In this study, to assess the performance of the software, various scenarios were tested: 5 variables were normally distributed, different sample sizes (n=1000, 1500, 2000 and 2500), high (r <-0.70 or r> 0.70) and low correlations (-0.30 <r <0.30) among between variables, different number of missing value in variables (5%, 10% and 20% missing data). The missing values were imputed by the developed web software and the results were compared. Thus, the performance of the software under different conditions was evaluated. Shiny, an open source R package was used to develop the web tool. In the developed software, linear regression (LR), random forest (RF), classification and regression trees (CART) and predictive mean matching (PMM) methods were used to impute missing values. In order to achieve more unbiased and reliable results, the 'number of repetitions' and 'number of multiple imputations' sections were used in the software. The normalized root mean squared error (NRMSE) metric was used to assess performance of imputation techniques. The developed web-based application can be accessed free of charge at http://biostatapps.inonu.edu.tr/KDAY/. Results: According to the outputs of the developed web-based application, better results were obtained by LR and PMM models for missing value imputation in datasets with high correlation. For missing value imputation in low-correlated data sets, the models showed similar imputation performances. Conclusion: For the datasets used in this study, when the correlation between the variables is high, the best imputation performance is obtained with the DR and PMM models regardless of the size of the dataset and the percentage of missing values.
Data & Knowledge Engineering, 2013
The substitution of missing values, also called imputation, is an important data preparation task for data mining applications. Imputation algorithms have been traditionally compared in terms of the similarity between imputed and original values. However, this traditional approach, sometimes referred to as prediction ability, does not allow inferring the influence of imputed values in the ultimate modeling tasks (e.g., in classification). Based on an extensive experimental work, we study the influence of five nearest-neighbor based imputation algorithms (KNNImpute, SKNN, IKNNImpute, KMI and EACImpute) and two simple algorithms widely used in practice (Mean Imputation and Majority Method) on classification problems. In order to experimentally assess these algorithms, simulations of missing values were performed on six datasets by means of two missingness mechanisms: Missing Completely at Random (MCAR) and Missing at Random (MAR). The latter allows the probabilities of missingness to depend on observed data but not on missing data, whereas the former occurs when the distribution of missingness does not depend on the observed data either. The quality of the imputed values is assessed by two measures: prediction ability and classification bias. Experimental results show that IKNNImpute outperforms the other algorithms in the MCAR mechanism. KNNImpute, SKNN and EACImpute, by their turn, provided the best results in the MAR mechanism. Finally, our experiments also show that best prediction results (in terms of mean squared errors) do not necessarily yield to less classification bias.
Proceedings of The International Conference on Data Science and Official Statistics, 2022
Missing value can cause bias and makes the dataset not represent the actual situation. The selection of methods for handling missing values is important because it will affect the estimated value generated. Therefore, this study aims to compare three imputation methods to handle missing values—Hot-Deck Imputation, K-Nearest Neighbor Imputation (KNNI), and Predictive Mean Matching (PMM). The difference in the way the three methods work causes the estimation results to be different. The criteria used to compare the three methods are the Root Mean Squared Error (RMSE), Unsupervised Classification Error (UCE), Supervised Classification Error (SCE), and the time used to run the algorithm. This study uses two pieces of analysis, comparison analysis, and scoring analysis. The comparative analysis applying a simulation that pays attention to the mechanism of missing value. The mechanism of the missing value used in the simulation is Missing Completely at Random (MCAR), Missing at Random (MA...
Computational Statistics & Data Analysis, 2006
Methods for imputation of missing data in the so-called least-squares approximation approach, a non-parametric computationally efficient multidimensional technique, are experimentally compared. Contributions are made to each of the three components of the experiment setting: (a) algorithms to be compared, (b) data generation, and (c) patterns of missing data. Specifically, "global" methods for least-squares data imputation are reviewed and extensions to them are proposed based on the nearest neighbours (NN) approach. A conventional generator of mixtures of Gaussian distributions is theoretically analysed and, then, modified to scale clusters differently. Patterns of missing data are defined in terms of rows and columns according to three different mechanisms that are referred to as Random missings, Restricted random missings, and Merged database. It appears that NN-based versions almost always outperform their global counterparts. With the Random missings pattern, the winner is always the authors' two-stage method INI, which combines global and local imputation algorithms.
Argentine Symposium on Artificial Intelligence, 2001
University of Sao Paulo - USP Institute of Mathematics Sciences and Computer - ICMC Department of Computer Science and Statistics - SCE Laboratory of Computational Intelligence - LABIC PO Box 668, 13560-970 - Sao Carlos, SP, Brazil {gbatista, mcmonard}@icmc.sc.usp.br
Missing data are often encountered in many areas of research. Complete case analysis and indicator method can lead to serious bias. One of the comforting methods is implementation of imputation methods. The main purpose of this paper is to review the agreement of imputation methods as the most widely used method for filling missing observations. Single and multiple imputations had certain criteria to be satisfied before adoption. Single imputation methods works excellently in short gap length of missing data. Embracing single imputation method to the long gap of missing data will cause systematically error since the reflection of uncertainty is not covered. Multiple imputations were recognized as the superior method for missing-at-random (MAR) data set. Although the dominance of multiple imputations was known, the adoption of these imputations needs thorough understanding on the algorithms especially in designing a suitable method to perform the imputations. The reviews on assessment of available imputation software were also presented to compare the practicality of the software.
Journal of Hunan University Natural Sciences, 2022
Copious data are collected and put away each day. That information can be utilized to extricate curiously designs. However, the information that we collect is ordinarily inadequate. Presently, utilizing that information to extricate any data may allow deceiving comes about. Utilizing that, we pre-process the information to exterminate the variations from the norm. In case of a low rate of lost values, those occurrences can be overlooked, but, in the case of huge sums, overlooking them will not allow wanted results. Many lost spaces in a dataset could be a huge issue confronted by analysts because it can lead to numerous issues in quantitative investigations. So, performing any information mining procedures to extricate a little good data out of a dataset, a few pre-processings of information can be done to dodge such paradoxes and, in this manner, move forward the quality of information. For handling such lost values, numerous methods have been proposed since 1980. The best procedure is to disregard the records containing lost values. Another method is ascription, which includes supplanting those lost spaces with a few gauges by doing certain computations. This would increment the quality of information and would extemporize forecast comes about. This paper gives an audit on methods for handling lost information like median imputation (MDI), hot (cold) deck imputation, regression imputation, expectation maximization (EM), support vector machine imputation (SVMI), multivariate imputation by chained equation (MICE), SICE technique, reinforcement programming, nonparametric iterative imputation algorithms (NIIA), and multilayer perceptrons. This paper also explores some good options of methods to estimate missing values to be used by other researchers in this field of study. Also, it aims to help them to figure out what method is commonly used now. The overview may also provide insight into each method and its advantages and limitations to consider for future research in this field of study. It can be a baseline to answer the questions of which techniques have been used and which is the most popular.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Electronics, 2023
Survey Methodology, 2001
Information Sciences, 2022
Proceedings of the World Congress on …, 2007
Indonesian Journal of Electrical Engineering and Informatics (IJEEI), 2014
Journal of Machine Learning Research, 2017
Mathematical Modelling and Applications
Journal of Computational and Applied Mathematics, 2009
Applied Artificial Intelligence, 2003
Journal of Statistical Computation and Simulation, 2011
Evaluation Journal of Australasia, 2002
Statistical applications in genetics and molecular biology, 2017
BMC Bioinformatics, 2017
Journal of Big Data
Advanced Engineering Informatics, 2021
Journal of Statistical Computation and Simulation, 2018
Information Processing & Management, 2022
Sensors, 2021
Information Sciences, 2022