Predict the Sound Pressure of Airfoil Self-Generated Noise
Let’s use NASA Airfoil Self-Noise Dataset to predict sound pressure level
Table of Contents
- Introduction
- About DataSet
- Pre-processing
- Feature Engineering
- Linear Regression With Cross Validation
- Lasso Regression With Cross Validation
- Ridge Regression With Cross Validation
- Evaluating Different Regression Models
- Conclusion
Introduction
Hi Guys! In this article, I will walk through how you can pre-process the dataset, and how you can apply Feature Engineering to the dataset. In the end, we will predict a target variable using different regression models. And as the last step, we will evaluate those models.
Prerequisites
Since we have limited space, I will be very briefly explaining some of the broad theories that we apply to the dataset. So, if you have some sort of understanding of machine learning concepts such as pre-processing, feature engineering, and regression models, this article will suit you better.
I will provide code snippets where necessary. At the end of this article, you will find the GitHub Repository with the Colab Notebook including all the steps we discuss here. So, you can open that in a new tab, and follow the rest of the article.
About DataSet
We will use NASA Airfoil Noise Dataset to explain each and every step. In the end, we will predict the sound pressure level using other dataset features.
As Figure 1 suggests, this NASA dataset comprises different size NACA 0012 airfoils, at different wind tunnel speeds, and angles of attacks. The scaled sound pressure level is a primary measurement of the noise generated by an aircraft.
So these features can be considered as inputs,
- Frequency (Hz)
- The angle of attack (degrees)
- Chord length (meters)
- Free-stream velocity (meters per second)
- Suction side displacement thickness (meters)
Our target variable will be,
6. Scaled sound pressure level (dB)
Let’s start by pre-processing the dataset.
Pre-processing
There are certain steps we can follow to improve the accuracy of our model to be applied. We will follow the following ordered steps.
- Handle Missing Values
- Handle Outliers
- Feature Transformations
- Feature Coding
- Feature Scaling
- Feature Discretization
1. Handle Missing Values
First, we need to identify null or missing values if there are any.
As Figure 2 suggests, when we check for null values, ‘False’ was returned for every column. Therefore, this indicates that we don’t have any null or missing values in the dataset.
2. Handle Outliers
Outliers are unusual data points that differ significantly from the rest of the records. The first step of outlier handling is to check if there are any outliers. For that, we can use boxplots and percentiles. Let’s draw the box plot for Frequency.
We can clearly see that there are outliers. We can either remove all outliers, or we can replace those outliers using Inter-Quartile Range. In this case, it is not a good option to remove outliers since we have a high percentage of outliers from overall records. Instead, let’s replace them with upper limit values.
Similarly, I have identified and replaced outliers in the Angle of Attack and Displacement Thickness variables. Since this dataset was collected as an experiment by changing the various free stream velocities and chord lengths, we won’t be removing outliers in those variables.
If you want to learn more about outlier handling in datasets, refer to this blog post.
3. Feature Transformations
When applying machine learning models such as Linear Regression, we would prefer if independent features follow a normal distribution. If not, in the pre-processing stage, we will try to transform them into one.
Let’s use histograms and Q-Q plots to see if our variables follow a normal distribution.
Similarly, we need to identify skewness in every variable. You can refer to the Colab notebook for the complete code. We can identify that the following features have skewed distributions.
- Suction Side Displacement(SSD) Thickness — Right Skewed
- Frequency — Right Skewed
- Angle of Attack — Right Skewed
Since all these variables are right-skewed, we have two options to transform data. We can apply Logarithmic Transformations or Square Root Transformations. SSD Thickness and Frequency variables only have positive values, we can use logarithmic transformation. But since the Angle of Attack variable has zero values, we can apply square root transformation.
Before applying transformations, let’s split our dataset into training and testing. We need to split the dataset now because, when we are testing our models, we should test on real-world data instead of transformed values. We will divide the dataset randomly into training and testing containing 80% and 20% instances.
Now let’s apply logarithmic transformations to our training dataset.
Let’s check our Frequency (Hz), and Suction Side Displacement Thickness (delta) columns, after applying logarithmic transformations.
3. Feature Coding
If we have categorical data in our dataset, we need to convert them into numeric data. To do that, we can use encoding techniques such as one-hot encoding, integer (label) encoding, and ordered label encoding.
Since all the variables in our original dataset are numerical, we don’t need to apply feature coding to our dataset. If we apply coding to our variables, the meanings of those numerical values will be lost.
4. Feature Scaling
We can apply feature scaling to normalize the values in our independent variables to a similar range. This helps to control feature magnitude. Especially since the scale of a variable directly influences the regression coefficient, this step is necessary.
Since we do not have any categorical variables, we can apply feature scaling to the whole dataset. Let’s scale the dataset to minimum and maximum. Note that I did not go for standard scaling, because our dataset cannot have negative values.
Let’s plot the variables after scaling the dataset.
We can observe that all dataset features have been scaled into similar values.
5. Feature Discretization
Variable discretization is the process of transforming continuous variables into discrete ones. Since we have datasets with small ranges and small unique values, applying feature discretization is unnecessary.
However, I did apply feature discretization later to find out whether we can improve the performance of the model. I applied discretization to Frequency (Hz). The model accuracy was decreased after applying discretization, so I disregarded that step to make sure I don’t complicate the pre-processing steps.
In case you want to apply discretization in the future, here’s how you can do it. Since we know the target variable, we can find out the optimal number of bins from Decision Tree Discretizer.
Now that’s the final step of our pre-processing steps. Let’s move on to Feature Engineering.
Feature Engineering
In the Feature Engineering stage, we will mainly try to do a few things.
- Filtering features based on their significance in determining the output.
- Compress the dataset by identifying and removing redundant features.
Since many features contribute to the required result at various coefficients and degrees. This step is highly necessary to every machine learning process. Also, this step helps to reduce the noise in our dataset.
Feature Extraction
1. Analyze Correlation
As the first step, let’s analyze the correlation between independent features. We can draw a correlation matrix for every independent feature.
The correlation matrix simply defines how much one variable changes for a slight change in another variable. If the correlation is near +1 or near -1, then there’s a strong positive or negative correlation respectively. If the correlation between two features is near 0, then we can assume that those two variables are independent.
In this dataset, we can see that ‘Suction Side Displacement Thickness (delta)’ and ‘Angle of Attack (o)’ have a somewhat high correlation of 0.84. Since it is below 0.9, we can assume that all independent features do not have any strong positive or negative correlation between any two features. Therefore we need to extract all the features since they are independent.
2. Analyze Significant Features
Let’s use a new correlation matrix to analyze significant independent features by comparing those features with our dependent feature, the Scaled Sound Pressure Level (dB).
If we only analyze the top row of the correlation matrix, we can clearly observe that all the independent features have some sort of significance with the dependent feature, Scaled Sound Pressure Level (dB). Therefore we can identify all the features, Frequency, Angle of Attack, Chord Length, Free Stream Velocity, and Suction Side Displacement Thickness as significant features.
Dimensionality Reduction
In this step, we try to compress our data points into lower dimensional space. To do that let’s apply PCA (Principle Component Analysis) to our dataset.
To learn more about applying PCA, and the concept behind PCA, you can refer to this article.
We can use sklearn pipeline, and GridSearchCV model selection method to find the optimal number of components.
Therefore we can see that more than 95% of covariance can be achieved with 4 components. We don’t need to take all the features to increase the accuracy. If we take all the features in scenarios like this, the model may get overfitted when applied to real-world scenarios. If you reduce the number of components, it will result in less covariance. In this case, the model can get under-fitted.
Let’s define PCA with 4 components and reduce our dataset into 4 dimensions.
Now since we have performed feature engineering on our dataset, let’s predict our target variable using different regression models.
Polynomial Features
Polynomial features are features created by raising existing features to an exponent. Polynomial features do have the potential of improving the accuracy of machine learning algorithms.
It has been shown that our machine learning algorithms performed drastically better after applying polynomial features. Here I have used the PolynomialFeature class library provided by scikit-learn Python machine learning library.
Now that we have applied feature engineering to our dataset, let’s train different models with our dataset.
Linear Regression With Cross Validation
Since we have already separated our dataset into training and testing. Let’s first train our Linear Regression model with the training dataset.
Here we have used the ‘sklearn linear_model’ library. Note that we have more than 5 coefficients because we have increased the dimensions in the Feature Engineering step.
K-Cross Validation
Now let’s use the K-Fold Cross-Validation technique to measure our model performance. K represents the number of groups that the data sample gets divided into. Let’s use k=10 for our evaluation.
To learn more about K-Cross Validation, please refer to this article.
We can clearly see that our linear regression model has an accuracy of 92.349% for 10-Fold Cross-validation.
Lasso Regression With Cross Validation
Let’s build a Lasso Regression model with our training dataset. You can learn about the theories behind Lasso Regression using this article.
Let’s validate our model with K-Fold Cross-Validation.
When testing with cross-validation, our lasso regression model outputted an accuracy of 91.130%
Ridge Regression With Cross Validation
Let’s fit a Ridge Regression model to our training dataset.
Let’s figure out the accuracy of our model using K-Fold cross-validation.
When testing with cross-validation, our lasso regression model outputted an accuracy of 87.069%
Evaluating Different Regression Models
Now we are into our final step. Let’s evaluate our models by calculating model errors for every three models we have trained up to now. We will mainly use Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE).
Mean Squared Error simply demonstrates the variation between predicted values and observed values in the dataset. These are the calculated errors and accuracies for our models.
Out of the three models we have observed that the Linear Regression technique has the best accuracy and fewer error rates. Especially since our target variable, Scaled Sound Pressure Level (dB) follows a normal distribution, the most applicable model will be a linear model.
Conclusion
In this blog post we have gone through a comprehensive walkthrough from preprocessing the dataset to performing feature engineering, and predicting the target variable using different machine learning models such as Linear Regression, Lasso Regression, ad Ridge Regression.
We haven’t stopped there, we were able to evaluate different models using different evaluation metrics. You can find the Python Notebook in the following Github Repository.
- Python Jupyter Notebook — https://github.com/SahanAmarsha/airfoil-self-noise-prediction
- NASA Airfoil Self-Noise Dataset — https://www.kaggle.com/fedesoriano/airfoil-selfnoise-dataset
I hope you found this tutorial helpful. Feel free to ask any questions, in the comments section below. Happy Coding!💫