How SAS Custom Macro make feature engineering easier

1. One Hot Encoding

There are many different definitions for one-hot encoding on the internet. In general words, it is the process of converting categorical variables into binary form (1’s and 0’s), which can be fed into machine learning, deep learning, and statistical algorithms to make better predictions or improve the efficiency of the ML/DL/Statistical models.

A. Here is the HOT_encode SAS Custom Macro explanation

%hot_encode (Dataset_name, Variable_name)
1. Dataset_name= Specify a dataset with SAS library name for instance: sashelp.cars.
2. Variable_name= Name of the categorical variable that you want to one_hot_encode.

B. What does hot_encode () do behind the scenes?

It creates a new dataset “encoded_data” in the work library with new encoded (binary form) variables that you can use to train your ML models. But remember, tables in the work library are temporary tables, you can either save this table in your permanent library or you can merge the new variable with your existing dataset according to your choices.

C. Examples of hot_encod ()

D. SAS Macro Definition Code for One Hot Encoding

2. Outlier Detection Method

There are a plethora of methods and algorithms to find outliers and extreme values in the dataset. The custom SAS Macro I build will test normality and then decide whether to use standard deviation or percentiles to find out the extreme values in the dataset.
If a variable is normally distributed, by default, it will use standard deviation methods to find outliers, otherwise, it will use the percentiles method.

A. Outliers SAS Custom Macro Explanation

%Outliers (Dataset_name, Variable_name)
1. Dataset_name= Specify a dataset with a SAS library name for instance: sashelp.cars.
2. Variable_name= Name of a variable in which you want to find outliers.

B. What do Outliers () do behind the scenes?

It creates a new dataset called “Outliers” in the work library for observations considered to be extreme/outliers. It will run normality tests and create test statistics in the “Test” table that you can find in the work library just after the execution of this macro. If your variable is normally distributed, then it will consider all those observations as outliers, which are falling either above or below 3 standard deviations of the mean.

If your variable is not distributed normally, it will deploy the percentile method to find outliers. The outliers will be all the observations above the ninety-nine percentile or below the one percentile (“Range” table will be created in the work library if you want to see the values of mean and percentiles). You can change the benchmark to set observations as outliers in the SAS macro definition as per your requirements.

C. Examples of Outliers ()

D. SAS Macro Definition Code for Outlier Detection

3. Lag Features

Normally, lag features are derived from the time-series dataset, these features contain a data value from a previous time or value. “Lag features are the classical way that time series forecasting problems are transformed into supervised learning problems.”
https://machinelearningmastery.com/basic-feature-engineering-time-series-data-python/

A. Lag_n SAS Custom Macro Explanation

%Lag_n (Dataset_name, Variable_name, start_time_period, end_time_period)

1. Dataset_name= Specify a dataset with a SAS library name for instance: sashelp.cars.
2. Variable_name= Name of a variable from which you want to drive lag features.
3. Start_time_period= Specify a non-negative integer value 0, 1, 2, 3 etc. But this value will be the Lag_ (integer), which means, if you give a value 2, it will start deriving lag features from lag_2.
4. End_time_period= Specify a non-negative integer value 0, 1, 2, 3 etc. But this value will be the Lag_ (integer), which means, if you give a value 3, it will stop deriving lag features until lag_3.

B. What does Lag_n () do behind the scenes?

It will create new temporary datasets with the name “WithLag” consisting of n number of lag variables. You can control the n value with the lower and upper limit of the time period, which you can specify in the start_time_period and end_time_period arguments of the Lag_n Macro respectively.

C. Examples of Lag_n ()

D. SAS Macro Definition Code Lag Features

4. Describe Table

Describe Table SAS custom macro gives you a lot more information than the Python Pandas Describe() function. https://www.tutorialspoint.com/python_pandas/python_pandas_descriptive_statistics.htm#:~:text=The%20describe()%20function%20computes,pertaining%20to%20the%20DataFrame%20columns.&text=This%20function%20gives%20the%20mean,given%20summary%20about%20numeric%20columns. It will give you all the required information about the dataset that a data analyst or data scientist needs to know.

A. Here is the describe_table () SAS Custom Macro Explanation

%describe_table (Dataset_name)
1. Dataset_name= Specify a dataset with SAS library name for instance: sashelp.cars.

B. What does describe_table () do behind the scenes?

It creates a report that includes all information-related data variables, like types, formats, length, and labels. Second, it will give you the summary statistics table of all the numerical features. Apart from that, it will also create frequency tables and plots for all the categorical features.

C. Examples of describe_table ()

D. SAS Custom Macro Definition Code for Table Description

Suraj Saini

Suraj Saini

>