Machine Learning — Data preprocessing

2 min readJan 26, 2020

As part of me trying to complete Machine Learning A-Z Udemy course, this series of posts, starting with this one, will contain note I gather from it.

Dependent vs independent variables

Dependent — variable being tested and measured — predicted result
Independent — variable being changed or controlled — features

Used libraries(python):

numpy, library containing mathematical tool
matplotlib.pyplot, plotting library
pandas — importing datasets
sklearn.preproceing — library for processing data

Importing dataset with pandas:

import pandas as pd
pd.read_csv(FILE_NAME)

Missing data

Option 1:

remove rows with missing data
dangerous because we might be losing valuable information

Option 2:

set missing values to mean of that feature

Library used:

sklearn.preprocessing.Imputer

Categorical data

Labels need to be converted into numbers — Euclidean distance can’t be calculated on labels

Library:

sklearn.preprocessing.LabelEncoder

Problem with LabelEncoder: converting label into numbers can lead to problem as numbers can be ordered. Labels not necessary

Solution: Creating feature per label

Library:

sklearn.preproessing.OneHotEncoder

Splitting data

For creating model, data needs to be split into two sets, train and test. Train is one we use for creating model and test is one we use to evaluate correctness of that mode.

Library:

sklearn.model_selection.train_test_split

Usual ration: 70–80% for train data

Feature scaling

One feature, because of large values, can dominate smaller number value feature. Which is why all features should be scaled to same scale

Option 1, standardization:

Each value reduced by mean and divided by standard deviation.

Option 2, normalization:

Reduce each x by minimal x value, then divide by difference of maximum and minimum value of x.

Library:

sklearn.preprocessing.StandardScaler

For more, you can follow me on Twitter, LinkedIn, or GitHub.

Machine Learning — Data preprocessing

Written by Kristijan

No responses yet