As part of me trying to complete Machine Learning A-Z Udemy course, this series of posts, starting with this one, will contain note I gather from it.
Dependent vs independent variables
- Dependent — variable being tested and measured — predicted result
- Independent — variable being changed or controlled — features
Used libraries(python):
- numpy, library containing mathematical tool
- matplotlib.pyplot, plotting library
- pandas — importing datasets
- sklearn.preproceing — library for processing data
Importing dataset with pandas:
import pandas as pd
pd.read_csv(FILE_NAME)
Missing data
Option 1:
- remove rows with missing data
- dangerous because we might be losing valuable information
Option 2:
- set missing values to mean of that feature
Library used:
sklearn.preprocessing.Imputer
Categorical data
Labels need to be converted into numbers — Euclidean distance can’t be calculated on labels
Library:
sklearn.preprocessing.LabelEncoder
Problem with LabelEncoder: converting label into numbers can lead to problem as numbers can be ordered. Labels not necessary
Solution: Creating feature per label
Library:
sklearn.preproessing.OneHotEncoder
Splitting data
For creating model, data needs to be split into two sets, train and test. Train is one we use for creating model and test is one we use to evaluate correctness of that mode.
Library:
sklearn.model_selection.train_test_split
Usual ration: 70–80% for train data
Feature scaling
One feature, because of large values, can dominate smaller number value feature. Which is why all features should be scaled to same scale
Option 1, standardization:
Each value reduced by mean and divided by standard deviation.
Option 2, normalization:
Reduce each x by minimal x value, then divide by difference of maximum and minimum value of x.
Library:
sklearn.preprocessing.StandardScaler