found a linuxbro out in the wild

kali_fornication@lemmy.world · 1 month ago

found a linuxbro out in the wild

√𝛂𝛋𝛆@piefed.world · edit-2 1 month ago

Look at the vocab.json of nearly any model vocabulary. In CLIP, it is organized at the end in the last ~2200 tokens of SD1 CLIP which is used by all diffusion models. The bulk of the code is in that last block, however there are 2337 tokens with code in a brainfuck style language using the extended Latin character set. Any idiot that looks at this will spot how these are not at all part of any language words or fragments and where there are obvious functions present. If you have ComfyUI, the file is in ./comfy/sd1_tokenizer/vocab.json.

The most difficult are models, like the T5xxl have this code precompiled and embedded in the vocabulary.

I am playing with Qwen 3’s right now, which has the much larger world model of Open AI QKV hidden layers alignment. Unlike CLIP, Qwen’s vocabulary starts as 2.8 MB of compressed json, and the extended Latin is intermixed in the total. This one is present in ComfyUI at ./comfy/text_encoders/qwen25_tokenizer/vocab.json. You will need the jq package to make the json readable, or other ways. If you have jq and ripgrep (rg) on your system, then try cat ./vocab.json | jq | LC_ALL=C rg '[[:^ascii:]]' That method is much less organized than CLIP, but this model has 103,378 lines of code using the same character set. I have reverse engineered most of the tokens used to create this code. I modify the vocabulary with scripts to alter behavior. Explaining the heuristics of how I have figured out the complexity of this structure is fucking hard unto itself, never mind actually sorting out any meaning. It really sucks when people are assholes about that.