5 Categorical Feature Encoding Techniques in SAS

What is Categorical Feature Encoding?

Categorical variables are usually represented as strings in limited numbers while categorical feature encoding is the process of converting data into a format understandable by machine learning models.

The performance of machine models depends on several factors. One factor that determines performance of the models are the methods used to process data and feed it to the model. As such, encoding data is a crucial process because it converts data into categorical variables understandable by machine learning models. Encoding data elevates model quality and helps in feature engineering.

In this blog, we explore the different classic encoding methods along with a snapshot of how each encoding method works in SAS Macro.

1. Label Encoding

Label Encoding assigns the value of 1-N to a class of categorical features. For instance, if there is a variable “Hair color” with values of Black, Brown, and Red, Label encoding will replace these values with 1, 2, and 3. However, one problem with Label Encoding is that it does not consider the order or any relationship between class levels. This will not stop machine learning algorithms from treating them in this incorrect order, which may lead to inaccurate readings.

SAS Macro for Label Encoding

Here is an example macro to perform Label Encoding in SAS:

%macro label_encode(dataset,var);
   proc sql noprint;
     select distinct(&var)
     into:val1-
     from &dataset;
 select count(distinct(&var))  into:mx from &dataset;
 quit;
 data new;
     set &dataset;
   %do i=1 %to &mx;
     if &var="&&&val&i" then new=&i;
   %end;
   run;
 %mend;
2. Binary Encoding

Binary Encoding converts class values into numeric values, like Label Encoding does. However, Binary Encoding takes it a step further and converts the numeric values into binary numbers where each digit will have their own separate column.

“If there are n unique categories, then binary encoding results in the only log (base 2) ⁿ features”.

For more information, visit here.

SAS Macro for Binary Encoding

Here is an example macro for Binary Encoding in SAS:

%macro binary_encoding(dataset,var);
 proc sql noprint;
     select distinct(&var)
     into:val1-
     from &dataset;
 select count(distinct(&var))  into:mx from &dataset;
 quit;
 data new;
     set &dataset;
   %do i=1 %to &mx;
     if &var="&&&val&i" then new=&i;
   %end;
     format new binary.;
   run;
 %mend;

This macro creates a single variable with a binary formatted value. To split those values into multiple columns, you could create a Split Column Macro.

SAS Macro for Splitting Column

Here is an example macro for splitting columns in SAS:

%macro split_column(data,var);
   data try;
     set &data;
     cha=put(&var, binary.);
   run;
 proc sql noprint;
     select max(length(cha)) into :ln from try ;
   quit;
 data &data;
     set try;
   %do i=1 %to &ln;
       c_&i=substr(cha,&i,1);
   %end;
   run;
 %mend;
3. One-Hot Encoding

One-Hot Encoding is the process of converting categorical variables into 1’s and 0’s. The binary digits are fed into machine learning, deep learning, and statistical algorithms to make better predictions or improve the efficiency of the ML/DL/Statistical models.

SAS Macro for One-Hot Encoding

Here is an example macro to do One-Hot encoding in SAS:

%macro hot_encoding(data,var);
   proc sql noprint;
     select distinct &var 
       into:val1-
     from &data;
 select count(distinct(&var))   into:len from &data;
 quit;
 data encoded_data;
     set &data;
   %do i=1 %to &len;
       if &var="&&&val&i" then %sysfunc(compress(&&&val&i,'$ - /'))=1 ;
       else  %sysfunc(compress(&&&val&i,'$ - /'))=0;
   %end;
   run;
 %mend;
4. Count/Frequency Encoding

As the name suggests, Frequency Encoding counts unique class values, then divides it by the total number of values. This encoding technique helps the model understand and assign the weight either inversely or directly.

SAS Macro for Count/ Frequency Encoding

Here is an example of a macro for Frequency Encoding in SAS:

%macro frequency_encoding(dataset, var);
   proc sql noprint;
     create table freq as 
     select distinct(&var) as values, count(&var) as number
     from &dataset 
     group by Values ;
 create table new as  select *, round(freq.number/count(&var),00.01) As freq_encode  from &dataset  left join freq    on &var=freq.values;
 quit;
 data new(drop=values number);
     set new;
   run;
 %mend;
5. Effect/Sum/ Deviation Encoding

The Deviation Encoding technique has different names, like Effect encoding, some analysts call it Deviation Encoding, and some say Sum Encoding, but the meaning and the definition is the same. Deviation encoding is the same as Hot Encoding, but the only difference is if there are 0 values in all the columns, then the values will become -1. For example

One Hot Encode

Effect/ Sum/ Deviation

SAS Macro for Effect/Sum/ Deviation Encoding

Here is an example macro for Deviation Encoding in SAS:

%macro sum_encoding(data,var);
   proc sql noprint;
     select distinct &var 
       into:val1-
     from &data;
 select count(distinct(&var))   into:len from &data;
 quit;
 data encoded_data;
     set &data;
   %do i=1 %to &len;
       if &var="&&&val&i" then %sysfunc(compress(&&&val&i,'$ - /'))=1 ;
       else  %sysfunc(compress(&&&val&i,'$ - /'))=0;
   %end;
   run;
 data sum_encode;
     set encoded_data;
     if %sysfunc(compress(&&&val&Len,'$ - /'))=1 then do;
     %do x=1 %to %eval(&len-1);
           %sysfunc(compress(&&&val&x,'$ - /'))=-1;
     %end;
     end;
     drop %sysfunc(compress(&&&val&Len,'$ - /'));
   run;
 %mend;

Wrapping Up

A data scientist spends over 70-80% of their time cleaning and preparing data, which means encoding or converting categorical data is a crucial part of their work. However, it is important to select the right encoding technique to ensure data quality, which is why it is important to understand the different encoding methods.

If you are looking for more information, more specifically, on SAS Macro Definition code, you can check it out here.

Suraj Saini

>