I get what you are saying, but at the same time I was bored on a Saturday and 'vibe coded' a small VR game, nothing special, but I had the LLM throw down a structure, and then I walked through it looking at and thinking about why placement of code was how it was and how different things were handled. It was basically exactly like my job, jump into some okay working legacy app, code I have never actually seen, try to get my brain around it, then personally tweak things until the app performs the way I fully want.
> These samples have very good scores overall, but they are useless. I am guessing it's not English text... I counted a few hundred examples mostly from LOC-PD and other few hundred in the OTA datasets. Imagine if I feed that crap to my LLM, what will it learn?
im pretty sure its a real text in Welsh. there might be typos from ocr but yeah thats what the language really looks like, i dont speak it but its easy to recognize.
I am creating my tiny Llama 340M base model from scratch. If you're curious about the steps, challenges and cost, read on. I am still working on the instruct model.
I feel like this is the true frontier, making smaller models that can do more than their predecessors. If we can crack this space to where you can get reasonable outputs from "mediocre hardware" it would be worthwhile, even if its somewhat inferior to frontier models, we can't forget that not long ago, frontier models are nowhere near as good as they are today, and tomorrow's models will likely be even better.
That's exactly what I had in mind. When I started this, I was jumping back and forth between this thought: "Can this model size actually generate logical English text?" and I played with a few different models of the same size and I was really really depressed when seeing how bad they are.... but then I discovered more and more tiny models and LaMini-125M, LaMini-256M, and nanowhale-100m, and SmolLM2-135M-Instruct are very very decent. So I decided to give it a try.
Qwen seems to be going in a good direction -- hundreds of experts on their MoE models. Extremely low active-weight counts while still performing quite admirably. I look forward to models with many, many more experts, to the point where anyone with enough random access can generate hundreds or thousands of tokens per second. Because right now, 80–120t/s is pretty slow.
There are certain things you can only truly learn by doing. I remember doing Linux From Scratch over a weekend and the depth of linux that I still understand to this day.
Thanks for the writeup. A more granular followup would be cool too.
Instead of always trying to make models more current and general, there may be value in making them deliberately narrow, historically constrained and weird in a well-defined way
"The code is semi-vibe-coded with whatever LLM I had with VS-Code and PI (OpenRouter models)."
I appreciate the honesty, but now there's no journey, and that's what I'm interested in. I can ask a LLM myself.
I get what you are saying, but at the same time I was bored on a Saturday and 'vibe coded' a small VR game, nothing special, but I had the LLM throw down a structure, and then I walked through it looking at and thinking about why placement of code was how it was and how different things were handled. It was basically exactly like my job, jump into some okay working legacy app, code I have never actually seen, try to get my brain around it, then personally tweak things until the app performs the way I fully want.
> These samples have very good scores overall, but they are useless. I am guessing it's not English text... I counted a few hundred examples mostly from LOC-PD and other few hundred in the OTA datasets. Imagine if I feed that crap to my LLM, what will it learn?
im pretty sure its a real text in Welsh. there might be typos from ocr but yeah thats what the language really looks like, i dont speak it but its easy to recognize.
Yeah, that seems like an important distinction
It looks like ROT13 text to me, I hope it's not Welsh. Don't want to offend anyone if that's their actual language :)
It's actually Welsh, and the funny thing is that one of the sentences in the example "gibberish" text (although with some further OCR errors) means:
"It will be easy for the knowledgeable to fix the few errors that remain [in the text]". (Bydd yn rwydd iawn i'r cyfarwydd ddiwygio'r ychydig.")
Which is exactly what the OP is doing.
I am creating my tiny Llama 340M base model from scratch. If you're curious about the steps, challenges and cost, read on. I am still working on the instruct model.
I feel like this is the true frontier, making smaller models that can do more than their predecessors. If we can crack this space to where you can get reasonable outputs from "mediocre hardware" it would be worthwhile, even if its somewhat inferior to frontier models, we can't forget that not long ago, frontier models are nowhere near as good as they are today, and tomorrow's models will likely be even better.
That's exactly what I had in mind. When I started this, I was jumping back and forth between this thought: "Can this model size actually generate logical English text?" and I played with a few different models of the same size and I was really really depressed when seeing how bad they are.... but then I discovered more and more tiny models and LaMini-125M, LaMini-256M, and nanowhale-100m, and SmolLM2-135M-Instruct are very very decent. So I decided to give it a try.
Qwen seems to be going in a good direction -- hundreds of experts on their MoE models. Extremely low active-weight counts while still performing quite admirably. I look forward to models with many, many more experts, to the point where anyone with enough random access can generate hundreds or thousands of tokens per second. Because right now, 80–120t/s is pretty slow.
There are certain things you can only truly learn by doing. I remember doing Linux From Scratch over a weekend and the depth of linux that I still understand to this day.
Thanks for the writeup. A more granular followup would be cool too.
"A more granular followup would be cool too"
Do you mind expanding this question? More granular in what way? what would you like to know that is missing from the post?
You may not build your daily system that way afterwards, but the mental model sticks
Except in this case he vibe-coded it
Nice project. I’m curious to see how it writes after instruct.
Instead of always trying to make models more current and general, there may be value in making them deliberately narrow, historically constrained and weird in a well-defined way