What it is:
SMOTE is a data science tool that mitigates the challenges of imbalances in datasets for machine learning, such as accuracy rate.
What it does:
SMOTE synthesizes instances of the under-represented data set, to reduce potential for imbalances.
Why it matters:
Imbalances in training data can make artificial intelligence unreliable, increasing risk and undermining trust.
What to do about it:
Data scientists can use SMOTE tools and techniques when they are concerned about potential weaknesses in data sets. As an ML decision maker, you can ensure that your data scientists have all the necessary tools to create reliable models, with SMOTE as an element of the toolset.
Designing a machine learning algorithm for a neural network is problematic when one class in the training set dominates the other. For example, 96 percent of the records in a quality assurance data set may have passed, whereas only four percent failed. Unfortunately, without modification, a machine learning application might create a highly accurate algorithm, one that correctly predicts whether a product is defective 96 percent of the time, simply by “computing” that all of the products are defect-free. The problem is that it would miss 100 percent of the defective products.
As discussed in the GigaBrief “What Is Explainable AI?”, data science is under increasing pressure to prove algorithms are working without bias or preconception. An imbalanced data set (for example, having only a certain type of face) can exacerbate the potential for bias.
SMOTE synthesizes new data points within within the minority set. In the quality control example, these points represent defective products. These new instances help to balance the data set and address the problems associated with an imbalanced data set.
For example, in the following data set there are many more triangles than there are stars.
After identifying the minority class, SMOTE synthesizes the new instances:
The new data set now has 4 new start instances and creates a closer approximation of balance than the original star-triangle set. The star class will no longer be under-represented and the resulting algorithm trained on this data set will have a higher degree of predictive accuracy.
“Machine-Learning Approach to Optimize SMOTE Ratio in Class Imbalance Dataset for Intrusion Detection,” Jae-Hyun Seo and Yong-Hyuk Kim.