Tag Archives for " Data Processing "

Techniques of Feature Scaling with SAS Custom Macro

Feature scaling with SAS Custom Macro

What is Feature Scaling?

Feature scaling is a process that is used to normalize data, it is one of the most preponderant steps in data pre-processing. Feature scaling is done before feeding data into machine learning, deep learning and statistical algorithms/models. In most cases, it has been noticed that the performance of the models increases when features are scaled, especially in models that are based on Euclidian distance. Normalization and Standardization are the two main techniques of feature scaling. I am going to define and explain how we can implement different feature scaling techniques in SAS Studio or Base SAS by using SAS Macro facility.

What is Normalization?

Normalization is the process of feature scaling in which data values are rescaled or bound into two values, most commonly between (0, 1) or (–1, 1). Min_MaxScaler and Mean_Normalization are very common examples of Normalization.

1. Min_MaxScaler

It ranges /rescales the data values between 0 and 1, the mathematical formula is here.

1.1 How can you use Min_MaxScaler in SAS?

1.2 Min_MaxScaler SAS Custom Macro Definition

1.3 What does Min_MaxScaler SAS Macro do behind the scenes?

Min_MaxScaler takes the variable that you want to scale and creates a new variable “MMVariableName” with scaled values. It also creates a univariate report where you can see the histogram of both the Actual Variable and the new Scaled Variable.

2. Mean_Normalization

It rescales the data values between (–1, 1), the mathematical formula is here.

2.1 How can you use Mean_Normalization in SAS

2.2 Mean_Normalization SAS Custom Macro Definition

2.3 What does Mean_Normalization SAS Macro do behind the scenes?

Mean_Normalization takes the variable that you want to scale and creates a new variable “MNVariableName” with scaled values. It also creates a univariate report where you can see the histograms of both the Actual Variable and the new Scaled Variable.

What is Standardization?

Standardization is a technique of feature scaling in which data values are centered around the mean with 1 standard deviation, which means after the standardization, data will have a zero mean with a variance of 1.

“Standardization assumes that your observations fit a Gaussian distribution (bell curve) with a well-behaved mean and standard deviation. You can still standardize your data if this expectation is not met, but you may not get reliable results.”

https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/

3. Standard_Scaler

It rescales the distribution of data values so that the mean of the observed value will be 0 and standard deviation equals to 1, the mathematical formula is here.

3.1 Standard_Scaler in SAS

3.2 Standard_Scaler SAS Custom Macro Definition

3.3 What does Standard_Scaler SAS Custom Macro do behind the scenes?

Standard_Scaler takes the variable that you want to scale and creates a new variable “SDVariableName” with scaled values. It also creates a univariate report where you can see the histogram of both the Actual Variable and the new Scaled Variable.

4. Robust_Scaler

A Robust_Scaler converts the data values. First, by subtracting the median for the data values, then dividing by IQR, which is the Inter Quartile Range (3Quantile – 1Quantile), which means it centers the median value at zero and very robust method for outliers. The mathematical formula is here.

4.1 How can you use Robust_Scaler in SAS?

4.2 Robust_Scaler SAS Custom Macro Definition

4.3 What does Robust_Scaler SAS Custom Macro do behind the scenes?

Robust_Scaler takes the variable that you want to scale and creates a new variable “RSVariableName” with scaled values. In the work library, it will create a STAT table where you can find the Median, Quantile 1 and Quantile 3 values to verify your results. It also creates a univariate report where you can see the histograms of both the Actual Variable and the new Scaled Variable.

How data science trends are transforming analytics and data processing

Find the latest data science trends here

Data science trends are transforming data processing and analytics. New technologies are paving the way for new trends that will empower professional data scientists to complete more work in less time or at least keep up with the growing volume of data. In 2018, over 59% of companies adopted big data analytics, up from 17% in 2015. In this blog post, we are going to discuss the trends that are transforming analytics and how it affects professionals who work directly with data.

Developing data science trends

Before diving into the implications of transformative trends, it is important to explain what these trends are.

Automation

The effects of automation are felt across multiple industries and are a huge part of data science trends discussed today. Automation will see the development of smart tools that can perform basic procedures usually done by data scientists. This is because the technology can execute several different processes, like feature selection, model selection and even basic code generation without human intervention. On a positive note, this frees up a data scientist’s time to perform more advanced processes that generate real value, while smart tools can perform routine tasks, like preparing and analysing data. This will have an effect on analytics and data processing because automation will lower the barrier to entry by making more advanced analytics platforms, like business intelligence more accessible. Automation will be a major factor in analytics and data processing because it negates the need for a large team of data engineers to work the platform. In fact, it is possible to say that analytics and data processing will be done autonomously, in the near future.

Converging data sources

Data science trends are not just about technology, they are about changing practices. One of the biggest changes to data analytics is the converging of disruptive technologies. Silos in data are coming to an end as data is merged from different sources to get the most insight into organisational performance. For example, IoT data analytics can capture and analyse data from different devices. This is huge because 20.4 billion IoT devices are expected to be in circulation, by 2020. This is going to change data processing and analytics because these devices will generate five times more information than data scientists can do by themselves. This means more detailed findings across the board, which will pave the way for superior operational efficiency and higher profit margins.

Trustworthy Artifical Intelligence

It is hard to talk about data science trends without talking about data regulation practices. The GDPR may be an EU invention, but do not be surprised if the essential principles of the act make their way across the world. Data is a valuable commodity, but organisations have to be responsible and respectful in the way they are using it. But with AI taking over certain responsibilities, like data curation and processing, the question is: Can AI be ethical and trustworthy?

According to the GDPR, ethical AI must have two essential components: One, the ability to respect fundamental rights, principles, values and regulatory practices. Two, the AI should be technically robust and reliable even with good intentions. It should be noted that we are still in the early days of ethical and trustworthy AI, so answers are more vague than specific. However, new solutions, like reinforcement learning are opening up new possibilities that could make ethical AI a reality. Naturally, this will have a tremendously positive impact on data analytics because an ethical AI tool can help data scientists automate data processing without the risk of breaking laws.

New types of analytics platforms

Data science and analytics are connected, hence it makes sense that one of the more significant data science trends taking place across the board is the development of new analytics solutions. We have discussed the potential of predictive analytics in the past, but in the future, its potency will grow stronger due to the development of new technologies, like machine learning and NLP. Alongside more powerful predictive analytics, organisations will adopt IoT data analytics and DataOps analytics. These new analytics platforms promise to produce results at a faster rate. For example, DataOps is an agile development methodology that combines data engineers and DevOps teams to develop faster and more efficient data pipelines to churn out insights at a faster rate.

The future of data science

The past few years have seen hundreds of companies across the world adopt data analytics and big data. However, the future will see data production and analysis increase at rates not seen before and current data science trends are trying to make this a reality. For example, 73% of companies surveyed said they had plans to adopt DataOps analytics to address growing challenges in maintaining the data pipeline. We can expect to see this technology transform data science significantly in the future because organisations will produce even more data at an even faster rate.