The historical development of neural networks was motivated by observations of biological neurons. The early work on neural networks conceived of them to represent and approximate human cognitive processing. Although this is an interesting historical perspective, Neural networks in the modern era are tools used to represent nonlinear functions. There is a universal approximation theorem that states that sufficiently complex neural networks can be used to model any nonlinear function. This is a powerful theorem because it means we can theoretically model any input to target relationship using a neural network. The caveat here is that the theorem doesn't tell us how to model any specific function, or even how complex the network must be to model any specific function. The difficulty with neural networks is then to determine the architecture and training algorithm necessary for the network to learn the function we are interested in modeling. The ability to model nonlinear relationships makes neural networks a powerful modeling algorithm, but there is a big drawback when it comes to model interpretability. Neural network models cannot be easily interpreted like the logistic regression or decision tree models we saw earlier. The logistic regression parameters can be interpreted as odds ratios, whereas the decision tree creates a list of rules that are interpretable by humans. Neural networks have parameters, but each input is used in many places in the model and thus has many parameters corresponding to it. The easy interpretation of parameters is lost. The model equations at first glance look very similar to the equations for logistic regression. We have a logit link function on the left side and a linear combination of weights multiplying hidden units. The difference is of course that these hidden units are not the input variables themselves, but a nonlinear transformation of the input variables. The nonlinearity in the hidden units is what allows neural networks to model nonlinear functions. We use a hyperbolic tangent function as the nonlinear activation function, but there are other nonlinear functions that can be used. The choice of this activation function usually does not impact model performance significantly, but when in doubt, choose the function that improves model performance on validation data. This neural network has one hidden layer with three hidden units. Using two inputs with this network architecture leads to 13 weight parameters, already a lot more than the 3 parameters we would have in logistic regression. The network is trained by using maximum likelihood estimation to determine the weight estimates. This is done using a backpropagation algorithm to move backward through the network, updating weight values to reduce error in the training data. When the weights have been trained, the model can be used to classify new data points by simply plugging the new inputs into the model equations. This generates a probability that can be used to classify the new data point. Unlike our other models, the neural network generates decision boundaries that are fundamentally nonlinear. The model has drawn circles and hyperbolas onto the two-dimensional space to separate the yellow dots from the blue dots. The ability to model arbitrary nonlinear functions is very powerful and allows much more complicated decision boundaries, but this added complexity comes at a cost. Because neural networks are so flexible, it can be hard to choose the correct network architecture or weight training procedure. Poorly designed or trained neural networks often give poor performance, so developing good neural network models requires practice.