PREPROCESSING – Where do you go if you don’t have the data?!

imputazione multivariata, imputazione univariata, preprocessing

Everyone always hopes to have a complete data-frame, whose features boast a homogeneous depth of data, in order to be able to lay a solid foundation in order to start developing a Machine Learning model. Unfortunately, in the real world what tends to happen is precisely the opposite and it is customary to have to manipulate droves of datasets containing missing values.

Usually, the first solution at hand seems to be to remove lines that report blank or NaN values. However, this choice can often be counterproductive, as it can result in the loss of valuable information, especially in cases where the dataset already has little historical depth. Therefore, given the frequency of occurrence and the weight of this issue on the performance of the results, it seemed appropriate to bring to the attention of the new ideas to be taken into consideration in the Preprocessing phase. In particular, a successful solution, as an alternative to classical approaches such as PCA (Principal Component Analysis) and Feature Selection, is the imputation of missing data, that is, obtaining the latter from known data.

The “univariate” type imputation involves the completion of the missing values of a specific feature using the dimensions available to the feature itself. In Python, in these cases the SimpleImputer package of the sklearn.impute library is used, which by default provides for the imputation of the features by averaging the known data. Alternatively, you can set the mediana, the most_frequent (str) and constant (str) as the method.

Finally, a further approach used for filling the dataset, of the “multivariate” type, foresees the features with missing values as a function of the other features by way of Round Robin type scheduling. In Python, the IterativeImputer is used for this approach, again from the sklearn.impute library.

The multivariate approach is certainly more sophisticated than the univariate one. However, both approaches, both SimpleImputer and IterativeImputer, can be used in a pipeline as a way to build a composite estimator that supports imputation.

Author: Francesca Giannella | Senior Data Scientist DMBI Photo by Marisa Morton on Unsplash

ACEA DATA COMMUNITY WORKSHOP – Exploring the potential of Machine Learning through the development of technical skills using Python

On Tuesday, July 28th, 2020, we concluded a two-day Workshop, in partnership with Peekaboo Startup Community, is aimed at investigating the approaches used for the development of a defined Machine Learning process and the application of the latter through the use of Python.

October 27 2020

Telemedicine and AI: How Healthcare Data Management is Changing in 2025

The digital revolution in healthcare is proceeding at a rapid pace, with telemedicine and artificial intelligence completely redefining the landscape of healthcare data management. In

January 27 2025

Federico Faggin and artificial intelligence: the views of a technology pioneer

Federico Faggin is an Italian engineer and inventor, known for inventing the microchip, an essential component of modern computers. Faggin believes that artificial intelligence has the potential to be a positive force in the world, but that it is important to be aware of its limitations.

January 12 2024

DMBI consultants

via Candido Galli, 5 – Frascati
00044 – Roma
info@dmbi.org
Fax | Tel +39 06 9422 421
Part. IVA 09913981008

PREPROCESSING – Where do you go if you don’t have the data?!

Related content

ACEA DATA COMMUNITY WORKSHOP – Exploring the potential of Machine Learning through the development of technical skills using Python

Telemedicine and AI: How Healthcare Data Management is Changing in 2025

Federico Faggin and artificial intelligence: the views of a technology pioneer

DMBI consultants

Company

Services

News

Work with us