Demo 2: Neural Network Interpretability

We created a very small neural network with GELU-GELU-Linear architecture (49 → 3 → 3 → 10) for MNIST digit classification, with Softmax for post-processing to obtain class probabilities. This tiny network allows us to build exact polytope representations using linear programming.

Mathematical Formulation

Affine Transformations

a_ℓ = W_ℓ · z_ℓ-1 + b_ℓ Where: • W_ℓ is the weight matrix for layer ℓ • b_ℓ is the bias vector • z_ℓ-1 is the output from the previous layer

GELU Activation Function

GELU(x) = x · Φ(x) Where Φ(x) is the CDF of the standard normal distribution. Approximation used: GELU(x) ≈ 0.5x(1 + tanh(√(2/π) · (x + 0.044715x³)))

Polytope Encoding of GELU

For each neuron with pre-activation a and post-activation z: Lower envelope: z ≥ α_L · a + β_L Upper envelope: z ≤ α_U · a + β_U Where α_L, β_L, α_U, β_U are computed using: • Interval bounds [L, U] for a (from IBP) • Tight linear envelopes that bound GELU over [L, U] This replaces the nonlinear GELU with linear constraints!

Softmax (Post-processing)

Softmax converts logits to probabilities: p_i = exp(a₃[i]) / Σ_j exp(a₃[j]) Where: • a₃[i] is the logit for class i • p_i is the probability for class i • Σ p_i = 1 Note: The polytope operates on logits (a₃), not probabilities. Softmax is only used for final classification and visualization.

Network Forward Pass

x₀ = input (49-dimensional, 7×7 flattened) a₁ = W₁ · x₀ + b₁ (shape: 3) z₁ = GELU(a₁) a₂ = W₂ · z₁ + b₂ (shape: 3) z₂ = GELU(a₂) a₃ = W₃ · z₂ + b₃ (shape: 10, output logits) Prediction = argmax(a₃) [or argmax(Softmax(a₃))]

What is the Polytope?

A polytope is a geometric region defined by linear inequalities. For neural network verification, we construct a polytope that over-approximates all possible network behaviors for inputs in a given region.

Variables in our polytope:

x₀[i] for i = 0..48: Input pixels (constrained to ε-ball around input)
a₁[j], z₁[j] for j = 0..2: Pre/post-activation for hidden layer 1
a₂[k], z₂[k] for k = 0..2: Pre/post-activation for hidden layer 2
a₃[m] for m = 0..9: Output logits (unbounded)

Constraints in our polytope:

Input box: x₀[i] ∈ [x₀[i] - ε, x₀[i] + ε] ∩ [0, 1] for all i
Affine relations: a_ℓ = W_ℓ · z_ℓ-1 + b_ℓ (equality constraints)
GELU envelopes: Linear lower/upper bounds on z = GELU(a)

Why it's useful: Any output reachable by the network for inputs in the ε-ball is guaranteed to be in our polytope. This approach, inspired by Singh et al.'s DeepPoly, enables us to analyze network behavior through linear programming. Beyond verification, this representation provides interpretability: we can probe how individual neurons contribute to predictions by equipping the polytope with the right objective functions and doing the optimization.

Current demo uses ε = 0.01, giving 49 input constraints + 6 GELU envelope constraints (2 per neuron × 3 neurons in each hidden layer).

Interactive Network Visualization

Node color:

(Low → High activation)

━━ Positive weight

━━ Negative weight

Hover for details • Click hidden neurons for patterns

Current Digit:

Mechanistic Interpretability

By stepping through the NN, we can understand how the network composes features to make predictions. Each hidden neuron learns interpretable patterns that combine to form digit classifiers. We formally verify these properties with the polytope.

Click on the hidden neurons in the visualization above to see what patterns they detect. The 6 hidden neurons (3 in each layer) learn distinct visual features:

Example: How the network recognizes Digit 0

Digit 0 ∝ (++ Frame) - (- Spine) - (- Belt)
Must act like a container Must have empty center Must have empty middle

The dashboard shows how Layer 1 neurons detect basic patterns (Frame, Spine, Belt), and Layer 2 neurons combine them with learned weights to produce the final digit 0 logit. The network learns that digit 0 should strongly activate the "Frame" detector while avoiding activation of "Spine" and "Belt" detectors (which would indicate filled regions).

Robustness Analysis

The plot below shows robustness rates (percentage of test samples where the LP makes the correct prediction within an ε-ball) across different perturbation sizes for 600 test samples:

Key findings: The LP maintains high accuracy for small perturbations (ε ≤ 0.02), with varying sensitivity across different digit classes. For example, digit 1 remains highly robust even at ε = 0.02, while digit 4 degrades more quickly. Even more interestingly, the LP appears to be a good classifier for the MNIST problem in and of itself.