Interesting read, since the author brought up future work on large corpora, allow me to share our experience.
Our work "A partition cover approach to tokenization" https://openreview.net/forum?id=prGyR9id7X, explored the tokenization problem as a vertex cover problem with an Integer Program formulation, which can be relaxed further to a Weighted Max Cover problem with an IP formulation and polynomial-time greedy algorithm GreedTOK.
From the comparisons in our work using algorithmic tokenizers (on millions of unique word/token candidates), as the token vocabulary size increases (arguably necessary for decent compression of large corpus), the compression gains eventually becomes marginal and 'weaker' tokenizer algos catches up (smaller delta compared to larger delta at lower token vocab size). I will be pleasantly surprised if LP solvers can maintain a non-decreasing advantage as token vocab size increases.
Also, instead of comparing solely on compression, another relevant consideration is on the quality of the token set. For example, an LP solution could produce a token set of superior compression but selects mostly whole words, which in turn sacrifices the tokenization of rare/unseen word vocabulary. This latter comparison is much harder/subjective to quantify.
As we had the greedy algo, we did not look at the LP angle in-depth as we thought that (hundreds of) millions of variables and constraints was crazy!
This is an interesting approach with integer programming and then using an explicit solver. It’s probably very slow, but you only have to run this once and it produces the mathematically perfect result.
In the past, I got good results with trying to reduce the variance in entropy in-between tokens, which you can implement very easily by starting with each single character as its own token and then doing a greedy merge of the most numerous outlier token pairs in a loop until you reach your desired token count. https://arxiv.org/abs/2206.12693
Interesting read, since the author brought up future work on large corpora, allow me to share our experience.
Our work "A partition cover approach to tokenization" https://openreview.net/forum?id=prGyR9id7X, explored the tokenization problem as a vertex cover problem with an Integer Program formulation, which can be relaxed further to a Weighted Max Cover problem with an IP formulation and polynomial-time greedy algorithm GreedTOK.
From the comparisons in our work using algorithmic tokenizers (on millions of unique word/token candidates), as the token vocabulary size increases (arguably necessary for decent compression of large corpus), the compression gains eventually becomes marginal and 'weaker' tokenizer algos catches up (smaller delta compared to larger delta at lower token vocab size). I will be pleasantly surprised if LP solvers can maintain a non-decreasing advantage as token vocab size increases.
Also, instead of comparing solely on compression, another relevant consideration is on the quality of the token set. For example, an LP solution could produce a token set of superior compression but selects mostly whole words, which in turn sacrifices the tokenization of rare/unseen word vocabulary. This latter comparison is much harder/subjective to quantify.
As we had the greedy algo, we did not look at the LP angle in-depth as we thought that (hundreds of) millions of variables and constraints was crazy!
This is an interesting approach with integer programming and then using an explicit solver. It’s probably very slow, but you only have to run this once and it produces the mathematically perfect result.
In the past, I got good results with trying to reduce the variance in entropy in-between tokens, which you can implement very easily by starting with each single character as its own token and then doing a greedy merge of the most numerous outlier token pairs in a loop until you reach your desired token count. https://arxiv.org/abs/2206.12693
[dead]