# A Simple Example

Let’s begin with a very simple dataset, two curves on a plane. The network will learn to classify points as belonging to one or the other.

# Continuous Visualization of Layers

In the approach outlined in the previous section, we learn to understand networks by looking at the representation corresponding to each layer. This gives us a discrete list of representations.

1. A linear transformation by the “weight” matrix W
2. A translation by the vector b
3. Point-wise application of tanh.

# Topology of tanh Layers

Each layer stretches and squishes space, but it never cuts, breaks, or folds it. Intuitively, we can see that it preserves topological properties. For example, a set will be connected afterwards if it was before (and vice versa).

1. Let’s assume W has a non-zero determinant. Then it is a bijective linear function with a linear inverse. Linear functions are continuous. So, multiplying by W is a homeomorphism.
2. Translations are homeomorphisms
3. tanh (and sigmoid and softplus but not ReLU) are continuous functions with continuous inverses. They are bijections if we are careful about the domain and range we consider. Applying them pointwise is a homeomorphism

# The Manifold Hypothesis

Is this relevant to real world data sets, like image data? If you take the manifold hypothesis really seriously, I think it bears consideration.

# Links And Homotopy

Another interesting dataset to consider is two linked tori, A and B.

1. The hardest part is the linear transformation. In order for this to be possible, we need W to have a positive determinant. Our premise is that it isn’t zero, and we can flip the sign if it is negative by switching two of the hidden neurons, and so we can guarantee the determinant is positive. The space of positive determinant matrices is path-connected, so there exists p:[0,1]→GLn(R)5 such that p(0)=Id and p(1)=W. We can continually transition from the identity function to the W transformation with the function x→p(t)x, multiplying x at each point in time t by the continuously transitioning matrix p(t).
2. We can continually transition from the identity function to the b translation with the function x→x+tb.
3. We can continually transition from the identity function to the pointwise use of σ with the function: x→(1−t)x+tσ(x). ∎

# The Easy Way Out

The natural thing for a neural net to do, the very easy route, is to try and pull the manifolds apart naively and stretch the parts that are tangled as thin as possible. While this won’t be anywhere close to a genuine solution, it can achieve relatively high classification accuracy and be a tempting local minimum.

# Better Layers for Manipulating Manifolds?

The more I think about standard neural network layers — that is, with an affine transformation followed by a point-wise activation function — the more disenchanted I feel. It’s hard to imagine that these are really very good for manipulating manifolds.

# K-Nearest Neighbor Layers

I’ve also begun to think that linear separability may be a huge, and possibly unreasonable, amount to demand of a neural network. In some ways, it feels like the natural thing to do would be to use k-nearest neighbors (k-NN). However, k-NN’s success is greatly dependent on the representation it classifies data from, so one needs a good representation before k-NN can work well.

# Conclusion

Topological properties of data, such as links, may make it impossible to linearly separate classes using low-dimensional networks, regardless of depth. Even in cases where it is technically possible, such as spirals, it can be very challenging to do so.

--

--