Types of Predictive Models (& How They Work)
- Technical Posts
Predictive analytics is an umbrella term for the use of data to predict future outcomes. The basic idea is that predictive analytics models can analyze historical data and find patterns. Patterns discovered by these models can predict future behavior. Some predictive analytics models are easier to understand than others, but they all have their place in business and data science.
In this article, we’ll outline the major types of predictive analytics models so you know what kind of model could help your organization achieve its goals.
A predictive analytics model is essentially a set of algorithms that discovers patterns in data and uses those patterns to predict beneficial outcomes. Predictive Analytics Models can predict two main outputs:
Data scientists commonly employ five main predictive analytics models: Classification, regression, time series, clustering, and anomaly detection.
It’s important to note that predictive analytics models will not work well in every situation. Several common problems can arise when using predictive analytics, and understanding them will help you ensure your data is adequately prepared, and your model is working as effectively as possible.
One of the most common problems with predictive models is that they require data preparation with high accuracy and thoroughness. Achieving high accuracy in data preparation can be difficult, especially when you have to consider multiple algorithms and their differences in data preparation requirements and assumptions.
For example, at least ten viable and commonly used algorithms can predict customer churn. Each of these algorithms has unique data preparation requirements and assumptions, which must be known and addressed before building the model.
Another problem is that knowing which algorithm is best in advance is hard. Even for experienced practitioners, it’s impossible to tell whether an algorithm will be effective without trial and error.
The difficulty in selecting the correct algorithm means that companies should be ready to experiment with different types and variations. They should also be prepared to change their approach if one algorithm doesn’t work or if it seems like another would yield better results.
The biggest problem with using predictive analytics models is that failure is common. It’s not always clear why models fail, but it can be frustrating when you know you have a great model and it doesn’t work.
It can also be hard to tell how good a model is in advance; you have to run it on actual data to know if it will work well, which means waiting until after something has happened before you know if your model performed correctly.
Let’s briefly review each in more detail to understand better how each model works and the critical use cases for each.
A classification model is a predictive analytics model that predicts a sample (e.g., will default on their loan) or another (e.g., will pay off their loan). It is about finding knowledge that can be applied to process new examples. Classification can be either binary or multinomial. Binary means we have two categories; multinomial means we have several categories to which our sample may belong.
Decision trees are a common way of representing a classification model, often shown as a flowchart with branches that lead to different outcomes depending on the value of one or more inputs (features).
A bank might train a model to predict which prospective customers will repay a new loan in full or which might default by leveraging historical data from previously approved loans.
The model assigns a probability for each prospective customer, “classifying” them as likely to pay or not. For every person who applies for a loan, the bank can use predictive models to assess risk before granting the loan.
Another example is a model trained to classify emails into several categories (e.g., spam, promotions, social, and business). When you receive an email, it will automatically assign it to one of the categories, making it easier for you to handle your inbox.
Classification models are used to predict the probability of each outcome. The model can then identify who (age, location, etc.) to target with marketing campaigns to attract the most reliable customers.
Another use of classification models is in financial services: identifying who should receive a loan or be denied based on their credit score or other financial metrics.
It can help us understand our customers better by separating them into different categories based on different patterns in their behavior. A deeper understanding can, in turn, help us adjust our business model to suit each of the categories better.
Method | Advantage | Drawback |
Generalized linear models (Logistic regression) | Good probabilistic interpretation Avoids overfitting with regularization Easily updated with new data | Underperforms when there are multiple or non-linear decision boundaries |
Support vector machines | Can model non-linear decision boundaries Robust against overfitting in a high-dimensional space | Memory intensive Tricky to tune |
Ensemble tree-based models (XGBoost, LightGBM, RandomForest) | Robust to outliers Scalable Naturally model non-linear decision boundaries Great for numeric and categorical data | Can be prone to overfitting |
Neural networks & Deep learning | Very good when classifying audio, text and image data | Require very large amounts of data to train |
Regression is a predictive analytics model that uses past examples to predict the future. Regression predicts numerical values, such as how much money an individual might earn in any given year or the likely retirement age. Although similar to classification, a regression can be applied when predicting numerical or continuous values, e.g., in the case of prices, quantities, or data involving quantities.
The most common approach to regression modeling is linear regression, which uses historical data points to draw a line that predicts future results based on past trends.
Using past customers’ information and their Customer Lifetime Value (CLV), you can train a model to predict CLV for new or existing customers.
Once the training is complete, you can use the model to make predictions about the CLV of new customers or for existing ones who don’t have enough data in the company’s customer relationship management system (CRM) to determine CLV properly.
Other examples would include predicting the number of people graduating from the university or the house price if we know house features (e.g., size, location).
Regression models can inform various business decisions, including which customers will likely churn and how much they’re worth.
For example, if your company is trying to maximize CLV across customers, you may want to know which variables are most important in determining CLV.
You could then use this information to help guide your marketing strategy by understanding which characteristics are most likely to lead to higher CLVs or better retention rates among existing customers.
Additionally, suppose you’re limited on resources and need help prioritizing which customers should receive the most attention. In that case, regression models can be used to determine which characteristics are most predictive of high customer value.
A time series model also called a forecasting model, is a predictive analytics model that uses historical data to predict future outcomes. The best forecasting models allow you to make predictions about sales, revenue, or financial results but can also help with other forecasts.
Forecasting is best understood as a subset of prediction. We consider a prediction to be forecasting when the model is used to estimate future values based on the past values of the time series.
Forecasting models typically contain three main components: a base model, which predicts historic values; a time series adjustment layer, which adjusts the base model’s predictions based on time-sensitive factors; and a forecast layer, which provides the final prediction.
Forecasting models are helpful in retail sales. For example, a retailer could use a forecast model to predict weekly sales for individual items at each retail location.
The retailer might start by collecting historical data on all past sales, including items sold, amount sold, and purchase location. The retailer could then use this data to build a model that predicts the amount of an item sold in the store next week.
A machine learning algorithm would identify patterns in the data that are likely to continue. The algorithm can then use these patterns to predict future sales for each item and store combination.
For example, if one pattern shows that sales tend to go up at certain times of the year and down at others, this could help predict what will happen next year based on prior years’ data.
Forecasting models help companies in a variety of ways. They can help avoid numerous operational problems like out-of-stock situations that could hamper sales or maintain reasonable operating costs by limiting overstocking items.
Forecasting models can also predict the likely volume of units of a product sold over a given time. An accurate item forecast enables retailers to decide how much stock they need to carry at any given time, ensuring that there will always be enough products available for customers.
Another example is a businessman who used a forecasting model to predict how many people would be shopping on a given day. With this knowledge, he knew whether he needed extra help in the store and, therefore, could allocate his resources more smartly.
Cluster analysis or clustering is grouping a set of objects into groups or clusters so that those objects within one group are more similar to the others.
Clustering can be helpful for market segmentation, understanding our target audience better so we can offer specific products to specific people depending on their segment. For example, a mobile service provider company can provide several different packages to suit the needs of several other groups. Clustering has many other uses, such as in medical imaging (e.g., classifying patients by the most suitable treatment), crime analysis (e.g., narrowing suspects by personality type), and biology (e.g., segmenting other species for easier inspection and analysis).
Understanding our customers better enables us to provide more value for them. Understanding how to target them depending on which group they belong to means we can maximize profits while providing them with the most benefit.
Anomaly detection is about finding novel or surprising knowledge. It is about finding points that stand out from the usual patterns we expect and recognize.
When looking into the data, we may find that one sizeable demographic group could result in profit, but we haven’t focused on it. Another example would be recognizing fraud or suspicious activity and being able to act on time to deter it.
The value of anomaly detection is that it can either result in bigger profits by helping uncover meaningful signals or by finding potential threats and challenges.
An algorithm is a set of rules the computer system will follow to create analytics models. A machine learning engineer will choose the most suitable algorithm depending on the problem we are trying to solve, our data, and our work environment. Below is a description of some of the most commonly used algorithms and their most common use cases.
GLM refers to a wide variety of both logistic and linear regression models. The main idea is to build simple regression models with higher accuracy.
Random forests or random decision forests are ensemble learning methods for classification and regression. Random forests work by constructing a multitude of decision trees at training time. Random decision forests help correct for decision trees’ habit of overfitting to their training set.
Random forests are most suitable when working with large amounts of data. The downside is that such large forests can become difficult to interpret.
Gradient Boosting algorithm builds trees one at a time, where each new tree helps to correct errors made by the previously trained tree.
A Gradient Boosted Model is useful when considering model performance metrics. Just like random forests, it can also work well with large datasets but is not recommended when dealing with small datasets.
K-means is an iterative, unsupervised algorithm. It will iterate until each data point belongs to only one group. It is unsupervised because we do not know what the correct answer is in advance. We do not know how many clusters the data is supposed to have or where each point is supposed to belong. Due to its simplicity, it is one of the most popular algorithms for clustering. KMeans is valuable whenever we don’t know how we want to segment our data.
Prophet is a time-series modeling algorithm provided by Facebook, popularly used for forecasting models. One of the things to consider when working with time series data is seasonal variation, or more commonly, seasonality. Seasonality refers to repeating patterns within a time period, like months in a year. Prophet works best with data with a strong seasonality influence, especially when we have several seasons of historical data.
Data exploration is finding potentially valuable data that is not immediately apparent. It is a critical step in the process of making models, and it is often an overlooked one.
We accelerate your predictive analytics process by exploring all available algorithms to find the ones that might be able to give the most valuable results.
We do this automatically, without needing any experience with the models. We also do this in a distributed way, which allows the exploration of far more models than would otherwise be possible.
We handle all of the individual requirements for each algorithm to ensure that each model is created following best practices “under the hood,” guaranteeing a model that gives the best results possible.
Get started today by scheduling a demo.