Tag Archives for " Feature Encoding Techniques "

Categorical Feature Encoding in SAS (Bayesian Encoders)

What is Bayesian Encoding?

Bayesian Encoding is a type of encoding that takes into account intra-category variation and the target mean when encoding categorical variables. It is a type of targeted encoding that comes with several advantages. For example, Bayesian Encoding requires minimal effort compared to other encoding methods.

In this blog post, we talk about the different Bayesian encoding techniques and how they work.

1. Target/Mean Encoding

Target or Mean Encoding is one of the most commonly used encoding techniques in Kaggle competitions.

Target encoding is where each class value of the categorical variable is replaced by the mean value of the target variable, with respect to the categorical class in the training dataset.

Hence, we have to specify the target variable in the SAS Mean Encoding Macro, as shown in the code below.

Check out this link to know more information about categorical variable encoding.

SAS Macro for Target/Mean Encoding
%macro mean_encoding(dataset,var,target);
   proc sql;
     create table mean_table as
     select distinct(&var) as gr, round(mean(&target),00.1) As mean_encode
     from &dataset
     group by gr;

     create table new as
     select d.* , m.mean_encode 
     from &dataset as d 
     left join mean_table as m
       on &var=m.gr;
   quit;
%mend;
2. Weight of Evidence Encoding

“Weight of Evidence (WoE) is a measure of the “strength” of a grouping technique that is used to separate good and bad. This method was developed primarily to build a predictive model to evaluate the risk of loan default in the credit and financial industry.

WoE will be 0 if the P(Goods) / P(Bads) = 1. That is, if the outcome is random for that group. If P(Bads) > P(Goods), the odds ratio will be < 1, and the WoE will be < 0. If, on the other hand, P(Goods) > P(Bads) in a group, then WoE > 0.

WoE is well suited for Logistic Regression because the logit transformation is simply the log of the odds, i.e. in(P(Goods)/P(Bads)). Therefore, by using WoE-coded predictors in Logistic Regression, the predictors are all prepared and coded to the same scale. The parameters in the linear logistic regression equation can be directly compared.

SAS Macro for Weight of Evidence Encoding
%macro woe_encoding(dataset,var,target);
   proc sql noprint;
     create table stats as
     select distinct(&var) as gr, round(mean(&target),00.1) as mean_encode 
     from &dataset
     group by gr;
   quit;

   data stats;
     set stats;
     bad_prob=1-mean_encode;
     if bad_prob=0 then bad_prob=0.0001;
     me_by_bp=mean_encode/bad_prob;
     woe_encode=log(me_by_bp);
   run;

   proc sql noprint;
     create table new as
     select d.* , s.woe_encode 
     from &dataset as d
     left join stats as s
       on &var=s.gr;
   quit;
 %mend;
3. Probability Ratio Encoding

“Probability Ratio Encoding” is similar to Weight Of Evidence, the only difference is the ratio of good and bad probability being used. For each label, we calculate the mean of target=1, that is, the probability of being 1 ( P(1) ), and also the probability of the target=0 ( P(0) ). Then, we calculate the ratio P(1)/P(0) and replace the labels by that ratio.

We need to add a minimal value with P(0) to avoid any divide by zero scenarios where for any particular category, there is no target=0. Check out this link for more information.

SAS Macro for Probability Ratio Encoding
%macro probability_encoding(dataset,var,target);
   proc sql noprint;
     create table stats as
     select distinct(&var) as gr, round(mean(&target),00.1) as mean_encode 
     from &dataset
     group by gr;
   quit;

   data stats;
     set stats;
     bad_prob=1-mean_encode;
     if bad_prob=0 then bad_prob=0.0001;
     prob_encode=mean_encode/bad_prob;
   run;

   proc sql noprint;
     create table new as 
     select d.* , s.prob_encode 
     from &dataset as d
     left join stats as s
       on &var=s.gr;
   quit;
 %mend;
Wrapping Up

Categorical Feature Encoding is an important part of cleaning up data for machine learning models. However, each method works in different circumstances so it is important to know about different techniques that fall under the Bayesian category.

If you want to take a look at how the coding operates in a SAS environment, you can find all the SAS Macro Definition code on my GitHub page here.

Dark Mode

blog-categorical-feature-encoding-bayesian (this link opens in a new window) by Selerity (this link opens in a new window)

SAS Macro examples for the Blog Post “Categorical Feature Encoding in SAS (Bayesian Encoders)”

5 Categorical Feature Encoding Techniques in SAS

What is Categorical Feature Encoding?

Categorical variables are usually represented as strings in limited numbers while categorical feature encoding is the process of converting data into a format understandable by machine learning models.

The performance of machine models depends on several factors. One factor that determines performance of the models are the methods used to process data and feed it to the model. As such, encoding data is a crucial process because it converts data into categorical variables understandable by machine learning models. Encoding data elevates model quality and helps in feature engineering.

In this blog, we explore the different classic encoding methods along with a snapshot of how each encoding method works in SAS Macro.

1. Label Encoding

Label Encoding assigns the value of 1-N to a class of categorical features. For instance, if there is a variable “Hair color” with values of Black, Brown, and Red, Label encoding will replace these values with 1, 2, and 3. However, one problem with Label Encoding is that it does not consider the order or any relationship between class levels. This will not stop machine learning algorithms from treating them in this incorrect order, which may lead to inaccurate readings.

SAS Macro for Label Encoding

Here is an example macro to perform Label Encoding in SAS:

%macro label_encode(dataset,var);
   proc sql noprint;
     select distinct(&var)
     into:val1-
     from &dataset;
 select count(distinct(&var))  into:mx from &dataset;
 quit;
 data new;
     set &dataset;
   %do i=1 %to &mx;
     if &var="&&&val&i" then new=&i;
   %end;
   run;
 %mend;
2. Binary Encoding

Binary Encoding converts class values into numeric values, like Label Encoding does. However, Binary Encoding takes it a step further and converts the numeric values into binary numbers where each digit will have their own separate column.

“If there are n unique categories, then binary encoding results in the only log (base 2) ⁿ features”.

For more information, visit here.

SAS Macro for Binary Encoding

Here is an example macro for Binary Encoding in SAS:

%macro binary_encoding(dataset,var);
 proc sql noprint;
     select distinct(&var)
     into:val1-
     from &dataset;
 select count(distinct(&var))  into:mx from &dataset;
 quit;
 data new;
     set &dataset;
   %do i=1 %to &mx;
     if &var="&&&val&i" then new=&i;
   %end;
     format new binary.;
   run;
 %mend;

This macro creates a single variable with a binary formatted value. To split those values into multiple columns, you could create a Split Column Macro.

SAS Macro for Splitting Column

Here is an example macro for splitting columns in SAS:

%macro split_column(data,var);
   data try;
     set &data;
     cha=put(&var, binary.);
   run;
 proc sql noprint;
     select max(length(cha)) into :ln from try ;
   quit;
 data &data;
     set try;
   %do i=1 %to &ln;
       c_&i=substr(cha,&i,1);
   %end;
   run;
 %mend;
3. One-Hot Encoding

One-Hot Encoding is the process of converting categorical variables into 1’s and 0’s. The binary digits are fed into machine learning, deep learning, and statistical algorithms to make better predictions or improve the efficiency of the ML/DL/Statistical models.

SAS Macro for One-Hot Encoding

Here is an example macro to do One-Hot encoding in SAS:

%macro hot_encoding(data,var);
   proc sql noprint;
     select distinct &var 
       into:val1-
     from &data;
 select count(distinct(&var))   into:len from &data;
 quit;
 data encoded_data;
     set &data;
   %do i=1 %to &len;
       if &var="&&&val&i" then %sysfunc(compress(&&&val&i,'$ - /'))=1 ;
       else  %sysfunc(compress(&&&val&i,'$ - /'))=0;
   %end;
   run;
 %mend;
4. Count/Frequency Encoding

As the name suggests, Frequency Encoding counts unique class values, then divides it by the total number of values. This encoding technique helps the model understand and assign the weight either inversely or directly.

SAS Macro for Count/ Frequency Encoding

Here is an example of a macro for Frequency Encoding in SAS:

%macro frequency_encoding(dataset, var);
   proc sql noprint;
     create table freq as 
     select distinct(&var) as values, count(&var) as number
     from &dataset 
     group by Values ;
 create table new as  select *, round(freq.number/count(&var),00.01) As freq_encode  from &dataset  left join freq    on &var=freq.values;
 quit;
 data new(drop=values number);
     set new;
   run;
 %mend;
5. Effect/Sum/ Deviation Encoding

The Deviation Encoding technique has different names, like Effect encoding, some analysts call it Deviation Encoding, and some say Sum Encoding, but the meaning and the definition is the same. Deviation encoding is the same as Hot Encoding, but the only difference is if there are 0 values in all the columns, then the values will become -1. For example

One Hot Encode

Effect/ Sum/ Deviation

SAS Macro for Effect/Sum/ Deviation Encoding

Here is an example macro for Deviation Encoding in SAS:

%macro sum_encoding(data,var);
   proc sql noprint;
     select distinct &var 
       into:val1-
     from &data;
 select count(distinct(&var))   into:len from &data;
 quit;
 data encoded_data;
     set &data;
   %do i=1 %to &len;
       if &var="&&&val&i" then %sysfunc(compress(&&&val&i,'$ - /'))=1 ;
       else  %sysfunc(compress(&&&val&i,'$ - /'))=0;
   %end;
   run;
 data sum_encode;
     set encoded_data;
     if %sysfunc(compress(&&&val&Len,'$ - /'))=1 then do;
     %do x=1 %to %eval(&len-1);
           %sysfunc(compress(&&&val&x,'$ - /'))=-1;
     %end;
     end;
     drop %sysfunc(compress(&&&val&Len,'$ - /'));
   run;
 %mend;

Wrapping Up

A data scientist spends over 70-80% of their time cleaning and preparing data, which means encoding or converting categorical data is a crucial part of their work. However, it is important to select the right encoding technique to ensure data quality, which is why it is important to understand the different encoding methods.

If you are looking for more information, more specifically, on SAS Macro Definition code, you can check it out here.