Simple perceptrons compute linear functions

Simple perceptrons are feed-forward (neural) networks with no hidden layers. They have only an input layer and an output layer.

Input layer is the set of input nodes, each of which provides the value for one coordinate of the input vector. Output layer is the set of output nodes, each of which gives out the value for one coordinate of the output vector. There are weighted connections between the input layer nodes and the output layer nodes.

Say, input vector is $N$ -dimensional, and output vector is $M$ -dimensional. Then, there are $N$ input layer nodes and $M$ output layer nodes, and up to $NM$ weighted connections between the two layers (for a fully-connected network, there are actually $NM$ connections).

For a simple perceptron, the output $O_i$ from the $i$ th output node is given by:

$O_i = g\left(\sum\limits_{k=1}^{N}w_{ik}\xi_k\right)$ ,

where $w_{ik}$ is the weight value on the connection from $k$ th input layer node to $i$ th output layer node, $\xi_k$ is the value given by $k$ th input layer node and $g$ is the activation function used by the network.

A simple perceptron is used to compute two kinds of functions: classification functions and regression functions.

A classification function is one that takes an input vector and categorizes it into one of several classes. Thus, with $N$ -dimensional input vector and $M$ -dimensional output vector, its input is an element of $R^N$ , while its input is $A^M$ , where $A$ is a finite set. As a simple example, each output node may produce only either $-1$ or $1$ output, in which case $A = \{-1, 1\}$ .

A regression function is one that takes an input vector and produces a real-valued vector. Again, with $N$ -dimensional input vector and $M$ -dimensional output vector, its input is an element of $R^N$ , while its input is an element of $R^M$ .

Let us first consider use of perceptrons for computing classification functions. For simplicity, we consider that each output node gives out only $-1$ or $1$ . Further, we assume the output layer has only one output node, or we can say, we analyze each output node in turn. For this case, the input-output equation becomes:

$O = sgn(\overrightarrow{w}.\overrightarrow{\xi})$

where, $\overrightarrow{w} = \langle w_1, w_2,\cdots,w_N\rangle$ is the weight vector, $\overrightarrow{\xi} = \langle\xi_1, \xi_2,\cdots,\xi_N\rangle$ is the input vector, $\overrightarrow{w}.\overrightarrow{\xi} = \sum\limits_{k=1}^{N}w_k\xi_k$ , the dot-product of $\overrightarrow{w}$ and $\overrightarrow{\xi}$ , and $sgn(x)$ is the signum function defined as: $sgn(x) = \begin{cases} 1 & \mbox{ if }x \ge 0 \\ -1 & \mbox{ if }x < 0\end{cases}$ . Note that, we dropped the suffix $i$ in the above equation because we are considering only one output node. Also, other variations are possible for the equation. In particular, instead of just looking at the sign of $\overrightarrow{w}.\overrightarrow{\xi}$ , we might require that $|\overrightarrow{w}.\overrightarrow{\xi}|$ is greater than a threshold to prevent ambiguity in classification of input when $\overrightarrow{w}.\overrightarrow{\xi}$ is close to $0$ .

In a classification problem, the setting is that $p$ input vectors are given, and for each input vector, its correct class $1$ or $-1$ is also given. For example, we may have $\{\overrightarrow{\xi^1},\overrightarrow{\xi^2},\cdots,\overrightarrow{\xi^l}\}$ in class $1$ and $\{\overrightarrow{\xi^{l+1}},\overrightarrow{\xi^{l+2}},\cdots,\overrightarrow{\xi^p}\}$ in class $-1$ . The question at hand is if a simple perceptron can correctly classify the given input vectors (or in machine learning scenario, new vectors that may come as input in future) into $1$ and $-1$ classes. In other words, can a weight vector $\overrightarrow{w}$ be found such that $\overrightarrow{w}.\overrightarrow{\xi} \ge 0$ for each $\overrightarrow{\xi} = \overrightarrow{\xi^1},\overrightarrow{\xi^2},\cdots,\overrightarrow{\xi^l}$ , and $\overrightarrow{w}.\overrightarrow{\xi} < 0$ for each $\overrightarrow{\xi} = \overrightarrow{\xi^{l+1}},\overrightarrow{\xi^{l+2}},\cdots,\overrightarrow{\xi^p}$ ?

To answer this, observe that $\overrightarrow{w}.\overrightarrow{\xi} = 0$ represents a hyperplane in the $\overrightarrow{\xi}$ -space that is orthogonal to $\overrightarrow{w}$ and that passes through the origin. (The hyperplane would be a straight line if $\overrightarrow{\xi}$ is $2$ -dimensional, and would be a plane if $\overrightarrow{\xi}$ is $3$ -dimensional). Further, observe that all points $\overrightarrow{\xi}$ such that $\overrightarrow{w}.\overrightarrow{\xi} > 0$ lie on one side of the hyperplane, while all points $\overrightarrow{\xi}$ such that $\overrightarrow{w}.\overrightarrow{\xi} < 0$ lie on the other side of the hyperplane. (Lying on one or the other side of the hyperplane makes visually sense only if $\overrightarrow{\xi}$ is $2$ -dimensional or $3$ -dimensional. In higher dimensions, the two sides of the hyperplane are defined just algebraically as $\overrightarrow{w}.\overrightarrow{\xi} > 0$ and $\overrightarrow{w}.\overrightarrow{\xi} < 0$ ). Thus, the question of classification boils down to whether a weight vector $\overrightarrow{w}$ can be found such that the hyperplane orthogonal to it and passing through the origin separates the points $\{\overrightarrow{\xi^1},\overrightarrow{\xi^2},\cdots,\overrightarrow{\xi^l}\}$ to one side of it and the points $\{\overrightarrow{\xi^{l+1}},\overrightarrow{\xi^{l+2}},\cdots,\overrightarrow{\xi^{p}}\}$ to the other side of it. Thus, a simple perceptron is capable of classifying a given set of input points if and only if the points are separable by a hyperplane – that is, (since a hyperplane is represented by a linear equation) if and only if the points are linearly separable.

Next, let us consider the use of simple perceptrons for computing regression functions. Again, to simplify, we assume that the output layer has only one node. For our first case, we assume that the activation function is a linear function. For this case, the input-output equation becomes:

$O = \overrightarrow{w}.\overrightarrow{\xi} = \sum\limits_{k=0}^{N-1}w_k\xi_k$ .

Observe that since our activation is linear (of the form $y = mx$ ), we could assume it is actually just the identity function.

In the regression problem scenario, we are again given a collection of input vectors $\{\overrightarrow{\xi^1},\overrightarrow{\xi^2},\cdots,\overrightarrow{\xi^p}\}$ . For each input vector $\overrightarrow{\xi^i}$ , the corresponding output real value $\zeta_i$ is given. The question is whether a weight vector $\overrightarrow{w} = \langle w_0,w_1,\cdots,w_{N-1}\rangle$ exists such that $\overrightarrow{w}.\overrightarrow{\xi^i} = \zeta_i$ for each $i = 1, 2,\cdots, p$ – that is, the question is just whether there is a solution to the system of $p$ linear equations. Thus, a simple perceptron with a linear activation function can solve a regression problem (for given input vectors) if and only if weights $w_0, w_1,\cdots,w_{N-1}$ exist to solve the system of linear equations capturing the input vector-output value relationships.

For the second case, we consider non-linear activation functions. In this case, the input-output equation becomes:

$O = g\left(\overrightarrow{w}.\overrightarrow{\xi}\right)$

Here, $g$ is a non-linear function such as the sigmoid function. It turns out if $g$ is an invertible function, then the question here is equivalent to previous case. This is because we could form an equivalent input-output equation as:

$g^{-1}(O) = \overrightarrow{w}.\overrightarrow{\xi}$

And, we analyze the case just like the case of linear activation function. Thus, simple perceptrons with non-linear (invertible) activation functions have the same capability for solving regression problems as simple perceptrons with linear activation functions.

In the above discussion, we went round about to analyze when a simple perceptron can map a given collection of input vectors to their corresponding outputs, and when it cannot. We found that whether a simple perceptron can solve the problem for a given set of input vectors and their corresponding output values has to do with a linear criterion. This is not at all surprising because simple perceptrons are able to compute only linear functions. Given an input vector $\overrightarrow{\xi} = \langle \xi_0,\xi_1,\cdots,\xi_{N-1}\rangle$ , a simple perceptron constructed using weights $w_0, w_1, \cdots, w_{N-1}$ by definition produces output $w_0\xi_0+w_1\xi_1+\cdots+w_{N-1}\xi_{N-1}$ , which is a linear in $\overrightarrow{\xi}$ .