For folks who aren’t sure how to interpret this, what we’re looking at here is early work establishing an upper bound on the complexity of a problem that a model can handle based on its size. Research like this is absolutely essential for determining whether these absurdly large models are actually going to achieve the results people have already ascribed to them on any sort of consistent basis. Previous work on monosemanticity and superposition are relevant here, particularly with regards to unpacking where and when these errors will occur.
I'm not sure this work accomplishes that. Sure, it builds up on previous work that showed that a transformer can be simulated by a TC^0^ family. However, the limits of this fact are not clear. The paper even admits as such
Our result on the limitations of T-LLMs as general learners comes from Proposition 1 and Theorem 2. On the one hand, T-LLMs are within the TC^0^ complexity family; on the other hand, general learners require at least as hard as P/ poly-complete. In the field of circuit theory, it is known that TC^0^ is a subset of P/ poly and commonly believed that TC^0^ is a strict subset of P/ poly, though the strictness is still an open problem to be proved.
I believe this is one of the weakest points of the paper, as it bases all of its reasoning on an unproven theorem. And you can implement many things with a TC^0^ algorithm, addition, multiplication, basic logic, heck you can even make transformers.
There still is something that bothers me. Why did it define general learning as being at least a universal circuit for the set of all circuits within a polynomial size ? Why this restriction ? I tried googling general learner and universal circuit and only came up with this paper.
While searching, I found that this paper was rejected, you can find the reviews here : https://openreview.net/forum?id=e5lR6tySR7
If you are searching for a paper on the limits of T-LLMs the paper What Algorithms can Transformers Learn? A Study in Length Generalization may prove more informative. https://arxiv.org/pdf/2310.16028.pdf It explains why transformers are so bad at addition.
Here is the key part of their abstract :
Specifically, we leverage RASP (Weiss et al., 2021)— a programming language designed for the computational model of a Transformer— and introduce the RASP-Generalization Conjecture: Transformers tend to length generalize on a task if the task can be solved by a short RASP program which works for all input lengths.
There are two issues with large prompts. One is linked to the current language technology, were the computation time and memory usage scale badly with prompt size. This is being solved by projects such as RWKV or mamba, but these remain unproven at large sizes (more than 100 billion parameters). Somebody will have to spend some millions to train one.
The other issue will probably be harder to solve. There is less high quality long context training data. Most datasets were created for small context models.