There are many different definitions for one-hot encoding on the internet. In general words, it is the process of converting categorical variables into binary form (1’s and 0’s), which can be fed into machine learning, deep learning, and statistical algorithms to make better predictions or improve the efficiency of the ML/DL/Statistical models.

%hot_encode (Dataset_name, Variable_name)

1. Dataset_name= Specify a dataset with SAS library name for instance: sashelp.cars.

2. Variable_name= Name of the categorical variable that you want to one_hot_encode.

It creates a new dataset “encoded_data” in the work library with new encoded (binary form) variables that you can use to train your ML models. But remember, tables in the work library are temporary tables, you can either save this table in your permanent library or you can merge the new variable with your existing dataset according to your choices.

There are a plethora of methods and algorithms to find outliers and extreme values in the dataset. The custom SAS Macro I build will test normality and then decide whether to use standard deviation or percentiles to find out the extreme values in the dataset.

If a variable is normally distributed, by default, it will use standard deviation methods to find outliers, otherwise, it will use the percentiles method.

%Outliers (Dataset_name, Variable_name)

1. Dataset_name= Specify a dataset with a SAS library name for instance: sashelp.cars.

2. Variable_name= Name of a variable in which you want to find outliers.

It creates a new dataset called “Outliers” in the work library for observations considered to be extreme/outliers. It will run normality tests and create test statistics in the “Test” table that you can find in the work library just after the execution of this macro. If your variable is normally distributed, then it will consider all those observations as outliers, which are falling either above or below 3 standard deviations of the mean.

If your variable is not distributed normally, it will deploy the percentile method to find outliers. The outliers will be all the observations above the ninety-nine percentile or below the one percentile (“Range” table will be created in the work library if you want to see the values of mean and percentiles). You can change the benchmark to set observations as outliers in the SAS macro definition as per your requirements.

Normally, lag features are derived from the time-series dataset, these features contain a data value from a previous time or value. “Lag features are the classical way that time series forecasting problems are transformed into supervised learning problems.”

https://machinelearningmastery.com/basic-feature-engineering-time-series-data-python/

%Lag_n (Dataset_name, Variable_name, start_time_period, end_time_period)

1. Dataset_name= Specify a dataset with a SAS library name for instance: sashelp.cars.

2. Variable_name= Name of a variable from which you want to drive lag features.

3. Start_time_period= Specify a non-negative integer value 0, 1, 2, 3 etc. But this value will be the Lag_ (integer), which means, if you give a value 2, it will start deriving lag features from lag_2.

4. End_time_period= Specify a non-negative integer value 0, 1, 2, 3 etc. But this value will be the Lag_ (integer), which means, if you give a value 3, it will stop deriving lag features until lag_3.

It will create new temporary datasets with the name “WithLag” consisting of n number of lag variables. You can control the n value with the lower and upper limit of the time period, which you can specify in the start_time_period and end_time_period arguments of the Lag_n Macro respectively.

Describe Table SAS custom macro gives you a lot more information than the Python Pandas Describe() function. https://www.tutorialspoint.com/python_pandas/python_pandas_descriptive_statistics.htm#:~:text=The%20describe()%20function%20computes,pertaining%20to%20the%20DataFrame%20columns.&text=This%20function%20gives%20the%20mean,given%20summary%20about%20numeric%20columns. It will give you all the required information about the dataset that a data analyst or data scientist needs to know.

You must be logged in to post a comment.