Want to learn more? Take the full course at https://learn.datacamp.com/courses/fe... at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work.
---
Categorical variables are used to represent groups that are qualitative in nature. Some examples are colors, such as blue, red, black etc. or country of birth, such as Ireland, England or USA. While these can easily be understood by a human, you will need to encode categorical features as numeric values to use them in your machine learning models.
As an example, here is a table which consists of the country of residence of different respondents in the Stackoverflow survey. To get from qualitative inputs to quantitative features, one may naively think that assigning every category in a column a number would suffice, for example India could be 1, USA 2 etc. But these categories are unordered, so assigning this order may greatly penalize the effectiveness of your model.
Thus, you cannot allocate arbitrary numbers to each category as that would imply some form of ordering in the categories.
Instead, values can be encoded by creating additional binary features corresponding to whether each value was picked or not as shown in the table on the right.
In doing so your model can leverage the information of what country is given, without inferring any order between the different options.
There are two main approaches when representing categorical columns in this way, one hot encoding and dummy encoding. These are very similar and often confused. In fact, by default, pandas performs one-hot encoding when you use the get_dummies() function.
One-hot encoding converts n categories into n features as shown here. You can use the get_dummies() function to one-hot encode columns. The function takes a DataFrame and a list of categorical columns you want converted into one hot encoded columns, and returns an updated DataFrame with these columns included.
Specifying a prefix with the prefix argument can improve readability like the letter C for country has been used here.
On the other hand, dummy encoding creates n-1 features for n categories, omitting the first category. Notice that this time there is no feature for France, the first category. In dummy encoding, the base value, France in this case, is encoded by the absence of all other countries as you can see on the last row here and its value is represented by the intercept. For dummy encoding, you can use the same get_dummies() function with an additional argument, drop_first set to True as shown here.
Both these methods have different advantages. One-hot encoding generally creates much more explainable features, as each country will have its own weight that can be observed after training. But one must be aware that one hot encoding may create features that are entirely collinear due to the same information being represented multiple times.
Take for example a simpler categorical column recording the sex of the survey takers. By recording a 1 for male the information of whether the person is female is already known when the male column is 0. This double representation can lead to instability in your models and dummy values would be more appropriate.
However, both one-hot encoding and dummy encoding may result in a huge number of columns being created if there are too many different categories in a column. In these cases, you may want to only create columns for the most common values. You can check the number of occurrences of different features in a column using the value_counts() method on a specific column.
Once you have your counts of occurrences, you can use it to limit what values you will include by first creating a mask of the values that occur less than n times. A mask is a list of booleans outlining which values in a column should be affected. First we find the categories that occur less than n times using the index attribute and wrap this inside the isin() method.
After you create the mask, you can use it to replace these categories that occur less than n times with a value of your choice as shown here.
Lets put what has been learned into practice and work with some categorical variables.
#PythonTutorial #Python #DataCamp #Engineering #MachineLearning #categorical #features