Friday, October 25, 2024

A Modest Proposal: Statistical Token Prediction Is No Replacement for Syntactic Construction

A Modest Proposal: Statistical Token Prediction Is No Replacement for Syntactic Construction

by Stephen Crowley

October 25, 2024

1Current Generative-Pretrained-Transformer Architecture

Given vocabulary \(V\), \(|V| = v\), current models map token sequences to vectors:

\(\displaystyle (t_1, \ldots, t_n) \mapsto X \in \mathbb{R}^{n \times d}\)

Through layers of transformations:

\(\displaystyle \text{softmax} (QK^T / \sqrt{d}) V\)

where \(Q = XW_Q\), \(K = XW_K\), \(V = XW_V\)

Optimizing:

\(\displaystyle \max_{\theta} \sum \log P (t_{n + 1} |t_1, \ldots, t_n ; \theta)\)

2Required Reformulation

Instead, construct Abstract Syntax Trees where each node \(\eta\) must satisfy:

\(\displaystyle \eta \in \{ \text{Noun}, \text{Verb}, \text{Adjective}, \text{Conjunction}, \ldots\}\)

With composition rules \(R\) such that for nodes \(\eta_1, \eta_2\):

\(\displaystyle R (\eta_1, \eta_2) = \left\{ \begin{array}{ll} \text{valid\_subtree} & \text{if grammatically valid}\\ \emptyset & \text{otherwise} \end{array} \right.\)

And logical constraints \(L\) such that for any subtree \(T\):

\(\displaystyle L (T) = \left\{ \begin{array}{ll} T & \text{if logically consistent}\\ \emptyset & \text{if contradictory} \end{array} \right.\)

3Parsing and Generation

Input text \(s\) maps to valid AST \(T\) or error \(E\):

\(\displaystyle \text{parse} (s) = \left\{ \begin{array}{ll} T & \text{if } \exists \text{valid AST}\\ E (\text{closest\_valid}, \text{violation}) & \text{otherwise} \end{array} \right.\)

Generation must traverse only valid AST constructions:

\(\displaystyle \text{generate} (c) = \{T|R (T) \neq \emptyset \wedge L (T) \neq \emptyset\}\)

where \(c\) is the context/prompt.

4Why Current GPT Fails

The statistical model:

\(\displaystyle \text{softmax} (QK^T / \sqrt{d}) V\)

Has no inherent conception of:

  • Syntactic validity

  • Logical consistency

  • Conceptual preservation

It merely maximizes:

\(\displaystyle P (t_{n + 1} |t_1, \ldots, t_n)\)

Based on training patterns, with no guaranteed constraints on:

\(\displaystyle \prod_{i = 1}^n P (t_i |t_1, \ldots, t_{i - 1})\)

This allows generation of:

  • Grammatically invalid sequences

  • Logically contradictory statements

  • Conceptually inconsistent responses

5Conclusion

The fundamental flaw is attempting to learn syntax and logic from data rather than building them into the architecture. An AST-based approach with formal grammar rules and logical constraints must replace unconstrained statistical token prediction.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Stationary Dilations

1 Stationary Dilations Definition 1 . Let \((\Omega, \mathcal{F}, P)\) and \((\tilde{\Omega}, \tilde{\mathcal{F}}, \tilde{P})\) be pr...