Label encoding encodes categories to numbers in a data set that might lead to comparisons between the data , to avoid that we use one hot encoding.
Brief about video How to implement One Hot Encoding on Categorical Data | Dummy Encoding :
Simple approach is to use interger or label encoding but when categorical variables are nominal, using simple label encoding can be problematic. One hot encoding is the technique that can help in this situation. In this tutorial, we will use pandas get_dummies method to create dummy variables that allows us to perform one hot encoding on given dataset. Alternatively we can use sklearn.preprocessing OneHotEncoder as well to create dummy variables.
in this video we will discuss how we can convert our categorical variables to integer.
at the end we will also see how we can save the encoder object to file using joblib library in python and reuse it.
code for this video:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
data = pd.read_csv('titanic.csv')
data_cat = data[['Sex','Embarked']]
pd.get_dummies(data_cat, dummy_na=True, drop_first=True)
df_2 = data_cat
ohe = OneHotEncoder(categories='auto', drop = 'first')'Missing'))
df_3 = ohe.transform(df_2.fillna('Missing')).toarray()
pd.DataFrame(df_3, columns=ohe.get_feature_names())
df_3 = pd.DataFrame(df_3, columns=ohe.get_feature_names(['Gender','Embarked']))
import joblib
joblib.dump(ohe, filename='ohe.pkl')
saved_imp = joblib.load('ohe.pkl')
