Bayesian Encoding is a type of encoding that takes into account intra-category variation and the target mean when encoding categorical variables. It is a type of targeted encoding that comes with several advantages. For example, Bayesian Encoding requires minimal effort compared to other encoding methods.
In this blog post, we talk about the different Bayesian encoding techniques and how they work.
Target or Mean Encoding is one of the most commonly used encoding techniques in Kaggle competitions.
Target encoding is where each class value of the categorical variable is replaced by the mean value of the target variable, with respect to the categorical class in the training dataset.
Hence, we have to specify the target variable in the SAS Mean Encoding Macro, as shown in the code below.
Check out this link to know more information about categorical variable encoding.
%macro mean_encoding(dataset,var,target);
proc sql;
create table mean_table as
select distinct(&var) as gr, round(mean(&target),00.1) As mean_encode
from &dataset
group by gr;
create table new as
select d.* , m.mean_encode
from &dataset as d
left join mean_table as m
on &var=m.gr;
quit;
%mend;
“Weight of Evidence (WoE) is a measure of the “strength” of a grouping technique that is used to separate good and bad. This method was developed primarily to build a predictive model to evaluate the risk of loan default in the credit and financial industry.
WoE will be 0 if the P(Goods) / P(Bads) = 1. That is, if the outcome is random for that group. If P(Bads) > P(Goods), the odds ratio will be < 1, and the WoE will be < 0. If, on the other hand, P(Goods) > P(Bads) in a group, then WoE > 0.
WoE is well suited for Logistic Regression because the logit transformation is simply the log of the odds, i.e. in(P(Goods)/P(Bads)). Therefore, by using WoE-coded predictors in Logistic Regression, the predictors are all prepared and coded to the same scale. The parameters in the linear logistic regression equation can be directly compared.
%macro woe_encoding(dataset,var,target);
proc sql noprint;
create table stats as
select distinct(&var) as gr, round(mean(&target),00.1) as mean_encode
from &dataset
group by gr;
quit;
data stats;
set stats;
bad_prob=1-mean_encode;
if bad_prob=0 then bad_prob=0.0001;
me_by_bp=mean_encode/bad_prob;
woe_encode=log(me_by_bp);
run;
proc sql noprint;
create table new as
select d.* , s.woe_encode
from &dataset as d
left join stats as s
on &var=s.gr;
quit;
%mend;
“Probability Ratio Encoding” is similar to Weight Of Evidence, the only difference is the ratio of good and bad probability being used. For each label, we calculate the mean of target=1, that is, the probability of being 1 ( P(1) ), and also the probability of the target=0 ( P(0) ). Then, we calculate the ratio P(1)/P(0) and replace the labels by that ratio.
We need to add a minimal value with P(0) to avoid any divide by zero scenarios where for any particular category, there is no target=0. Check out this link for more information.
%macro probability_encoding(dataset,var,target);
proc sql noprint;
create table stats as
select distinct(&var) as gr, round(mean(&target),00.1) as mean_encode
from &dataset
group by gr;
quit;
data stats;
set stats;
bad_prob=1-mean_encode;
if bad_prob=0 then bad_prob=0.0001;
prob_encode=mean_encode/bad_prob;
run;
proc sql noprint;
create table new as
select d.* , s.prob_encode
from &dataset as d
left join stats as s
on &var=s.gr;
quit;
%mend;
Categorical Feature Encoding is an important part of cleaning up data for machine learning models. However, each method works in different circumstances so it is important to know about different techniques that fall under the Bayesian category.
If you want to take a look at how the coding operates in a SAS environment, you can find all the SAS Macro Definition code on my GitHub page here.
blog-categorical-feature-encoding-bayesian (this link opens in a new window) by Selerity (this link opens in a new window)
SAS Macro examples for the Blog Post “Categorical Feature Encoding in SAS (Bayesian Encoders)”
Categorical variables are usually represented as strings in limited numbers while categorical feature encoding is the process of converting data into a format understandable by machine learning models.
The performance of machine models depends on several factors. One factor that determines performance of the models are the methods used to process data and feed it to the model. As such, encoding data is a crucial process because it converts data into categorical variables understandable by machine learning models. Encoding data elevates model quality and helps in feature engineering.
In this blog, we explore the different classic encoding methods along with a snapshot of how each encoding method works in SAS Macro.
Label Encoding assigns the value of 1-N to a class of categorical features. For instance, if there is a variable “Hair color” with values of Black, Brown, and Red, Label encoding will replace these values with 1, 2, and 3. However, one problem with Label Encoding is that it does not consider the order or any relationship between class levels. This will not stop machine learning algorithms from treating them in this incorrect order, which may lead to inaccurate readings.
Here is an example macro to perform Label Encoding in SAS:
%macro label_encode(dataset,var);
proc sql noprint;
select distinct(&var)
into:val1-
from &dataset;
select count(distinct(&var)) into:mx from &dataset;
quit;
data new;
set &dataset;
%do i=1 %to &mx;
if &var="&&&val&i" then new=&i;
%end;
run;
%mend;
Binary Encoding converts class values into numeric values, like Label Encoding does. However, Binary Encoding takes it a step further and converts the numeric values into binary numbers where each digit will have their own separate column.
“If there are n unique categories, then binary encoding results in the only log (base 2) ⁿ features”.
For more information, visit here.
Here is an example macro for Binary Encoding in SAS:
%macro binary_encoding(dataset,var);
proc sql noprint;
select distinct(&var)
into:val1-
from &dataset;
select count(distinct(&var)) into:mx from &dataset;
quit;
data new;
set &dataset;
%do i=1 %to &mx;
if &var="&&&val&i" then new=&i;
%end;
format new binary.;
run;
%mend;
This macro creates a single variable with a binary formatted value. To split those values into multiple columns, you could create a Split Column Macro.
Here is an example macro for splitting columns in SAS:
%macro split_column(data,var);
data try;
set &data;
cha=put(&var, binary.);
run;
proc sql noprint;
select max(length(cha)) into :ln from try ;
quit;
data &data;
set try;
%do i=1 %to &ln;
c_&i=substr(cha,&i,1);
%end;
run;
%mend;
One-Hot Encoding is the process of converting categorical variables into 1’s and 0’s. The binary digits are fed into machine learning, deep learning, and statistical algorithms to make better predictions or improve the efficiency of the ML/DL/Statistical models.
Here is an example macro to do One-Hot encoding in SAS:
%macro hot_encoding(data,var);
proc sql noprint;
select distinct &var
into:val1-
from &data;
select count(distinct(&var)) into:len from &data;
quit;
data encoded_data;
set &data;
%do i=1 %to &len;
if &var="&&&val&i" then %sysfunc(compress(&&&val&i,'$ - /'))=1 ;
else %sysfunc(compress(&&&val&i,'$ - /'))=0;
%end;
run;
%mend;
As the name suggests, Frequency Encoding counts unique class values, then divides it by the total number of values. This encoding technique helps the model understand and assign the weight either inversely or directly.
Here is an example of a macro for Frequency Encoding in SAS:
%macro frequency_encoding(dataset, var);
proc sql noprint;
create table freq as
select distinct(&var) as values, count(&var) as number
from &dataset
group by Values ;
create table new as select *, round(freq.number/count(&var),00.01) As freq_encode from &dataset left join freq on &var=freq.values;
quit;
data new(drop=values number);
set new;
run;
%mend;
The Deviation Encoding technique has different names, like Effect encoding, some analysts call it Deviation Encoding, and some say Sum Encoding, but the meaning and the definition is the same. Deviation encoding is the same as Hot Encoding, but the only difference is if there are 0 values in all the columns, then the values will become -1. For example
One Hot Encode
Effect/ Sum/ Deviation
Here is an example macro for Deviation Encoding in SAS:
%macro sum_encoding(data,var);
proc sql noprint;
select distinct &var
into:val1-
from &data;
select count(distinct(&var)) into:len from &data;
quit;
data encoded_data;
set &data;
%do i=1 %to &len;
if &var="&&&val&i" then %sysfunc(compress(&&&val&i,'$ - /'))=1 ;
else %sysfunc(compress(&&&val&i,'$ - /'))=0;
%end;
run;
data sum_encode;
set encoded_data;
if %sysfunc(compress(&&&val&Len,'$ - /'))=1 then do;
%do x=1 %to %eval(&len-1);
%sysfunc(compress(&&&val&x,'$ - /'))=-1;
%end;
end;
drop %sysfunc(compress(&&&val&Len,'$ - /'));
run;
%mend;
A data scientist spends over 70-80% of their time cleaning and preparing data, which means encoding or converting categorical data is a crucial part of their work. However, it is important to select the right encoding technique to ensure data quality, which is why it is important to understand the different encoding methods.
If you are looking for more information, more specifically, on SAS Macro Definition code, you can check it out here.
blog-categorical-feature-encoding-classical (this link opens in a new window) by Selerity (this link opens in a new window)
SAS Macro examples for the Blog Post “5 Categorical Encoding Techniques in SAS”
You must be logged in to post a comment.