In this section, I want to give a high-level overview

of the models we'll build on the CAS server:

logistic regression, support vector machines, decision

trees, random forests, gradient boosting, and neural networks.

Let's start with a simple data set

consisting of two interval inputs, X-1 and X-2,

along with a binary target, blue or yellow.

Although most real data will have more than two inputs,

it is useful to visualize how our models work

in two dimensions.

The intuition we build in this two-dimensional case

can be extended to higher dimensions

when working with real data.

Each point in the plot represents

a single observation from our data

set, with its coordinates corresponding

to the values of the input variables and the color

corresponding to the value of the target variable.

Each of our predictive modeling algorithms

will create a rule that is applied to this plot

to distinguish the blue dots from the yellow dots.

The simplest model we will consider

is logistic regression, which uses

a linear combination of the input variables

to predict the target.

With our two input variables, we get three model parameters:

an intercept and a weight for each input.

Instead of predicting an unbounded number

like linear regression, we want to predict a probability,

so we use the logit link function to convert the model

output from an unbounded number to a probability bounded

between zero and 1.

The model is used to estimate values

of p-hat, which is the probability of an observation

being an event or not.

In this case, p-hat tells us the probability

of an observation being yellow (the event of interest)

versus blue.

In the context of the two-dimensional plot,

the logistic regression model draws

lines on the plot corresponding to different probabilities

of an observation being yellow.

Each line represents a different probability.

After we choose a cutoff probability (for example,

0.50), we would say every observation above the line

for that probability is yellow and every observation below

that line is blue.

The angle and location of the lines in the plot

are determined by the parameter estimates,

which are found by maximizing the log-likelihood function.

The limitation of logistic regression

is that it can draw only straight lines in this plot,

even if the data would be better separated

by a more complicated geometry.

In higher dimensions, this problem

persists, although lines are replaced

by planes (in three dimensions) or hyperplanes

(in higher dimensions).

One way to overcome the straight-line limitation

and to introduce curved decision boundaries

is to include higher-order polynomial terms in the model.

A second-order polynomial regression model

would include all quadratic terms, so in our example,

we would have a parameter for X-1-squared

a parameter for X-2-squared and a parameter for X-1 times X-2.

Adding these terms to the model equation

enables the model to draw quadratic decision

boundaries rather than just linear boundaries.

Adding third-order terms (like X-1-cubed)

enables the model to draw cubic decision boundaries, and so on,

for higher-order terms.

These extra terms increase the model flexibility, possibly

leading to overfitting.

If we know that the inputs are nonlinearly related

to the target, it can be useful to add polynomial terms.

But it is important to still evaluate

performance on validation data to ensure that we're not

overfitting the data.