aloobun

Intro

Hello, my name is Vinayak and ive been working on data compression and language modeling across phases, experimenting with fine tuning, model distillation & pruning.

A few months ago, i’ve worked on tokenizer transplantation tool that allows users to seamlessly transplant tokenizers between language models while preserving semantic meaning - it is designed for users who want to adapt models for specific tasks or datasets without losing the integrity of the original embeddings (Github) and multi-word probabilistic based supertokenizer that leverages probabilistic, character-based pretokenization to enable multi-word tokens, overcoming the limitations of traditional whitespace-based tokenization. This approach allows a single token to span multiple words, improving tokenization efficiency and flexibility for NLP models (Github).

Here’s the pre-print of our work on tokenadapt and supertokenizer: https://arxiv.org/abs/2505.09738v1

Currently, my fascination has reached a sufficient level, and I’ve decided to work on exploring why language models do what they do, and this space is a collection of my thoughts on various ideas I've found valuable for understanding language models.

If you have thoughts on anything I’ve written, or otherwise want to contact me, you can email me at vinayak@everybit.in

Huggingface: https://huggingface.co/aloobun

Compression Experiments

IMLI (https://codeberg.org/everybit/imli)

imli is a file archiver built on the ZPAQ compression algorithm, featuring:

Simple command-line interface for compression and archiving.
FUSE filesystem mounting for instant archive browsing.
Intelligent chunking for efficient large file handling.
Incremental backup support.

BPEZ (https://codeberg.org/everybit/bpez)

BPEZ is a compression experiment where I wasn't trying to invent anything new but wanted to beat industry standard Gzip by making the data a bit easier to compress before Gzip even touches it.

Compression

Standard Deflate is a generalist. It looks for repeated strings and packs bits, but it struggles with noisy data. My hypothesis was that if I use BPE as a pre-processor it could collapse the most frequent patterns into single bytes. I know people have done this before, but this is for my own understanding, and this would effectively lower the entropy of the stream, allowing the Deflate stage to pack the data tighter than it normally could. The results below show the hypothesis was correct, and I successfully shrank the 1GB enwik9 dataset to 294 MB, beating standard Gzip/Pigz by a small margin of 14 MB.

Scan the block to count every adjacent byte pair (AA, AB... ZZ).
Identify the single most frequent pair, find a byte value that is currently unused in the block and replace every instance of the pair with that single byte.
Repeat this cycle 64 (can be changed) times per block.

Once BPE is done data is passed to Zlib.

Decompression

Zlib decompresses the block, revealing the BPE transformed data.

The first 3 bytes of the buffer contain the instruction for the current cycle:

Byte 0: The High byte of the original pair.
Byte 1: The Low byte of the original pair.
Byte 2: The Replacement byte used in this cycle.

Instead of writing a Python for loop to iterate through the 1MB buffer (which would take seconds per block), we can use bytearray.replace(). It performs memory moves and byte swapping at hardware speeds.

Img Compress (https://codeberg.org/everybit/img_compress)

High-level overview

Load an input image and convert it to a grayscale tensor.
Optionally detect regions of interest (ROIs) using a YOLO model and build a smooth weight map that prioritizes important regions.
Create a coordinate grid of normalized (x, y) pairs in [-1, 1] for every pixel.
Train a coordinate-based implicit generator (KolmogorovImageCompressor) to map coordinates -> pixel values using a combined loss:

Weighted MSE and weighted L1 (pixel-wise, modulated by the ROI map)
Perceptual loss computed on duplicated grayscale -> 3-channel inputs using VGG16 features
Optional adversarial loss using a convolutional discriminator

Save the trained model state_dict as the compressed representation.
Decompress by loading the model, evaluating it on the coordinate grid, and reconstructing the image.
Optionally apply post-processing (upscale, denoise, equalize) and evaluate JPEG baseline (PSNR, SSIM).

Font Compress (https://codeberg.org/everybit/font_compress)

The script loads the font image and converts to grayscale, automatically detects character grid layout or uses manual specifications, if required, converts grayscale to black and white using a threshold, packs 8 monochrome pixels into a single byte (8:1 compression), compresses runs of identical bytes, particularly effective for the empty spaces in fonts.

PyULZ (https://codeberg.org/everybit/pyulz)

It allows you to compress and decompress binary data (bytes) directly in Python with extremely low latency.

ULZ is a derivative of the LZ77 algorithm. It works by finding repeated patterns in the data (a "sliding window") and replacing them with references (distance and length pairs) to previous occurrences.

It uses a hash chain to quickly find matches.
It prioritizes speed over maximum compression ratio, making it significantly faster than standard Deflate (zlib) or Gzip, though the output file size might be slightly larger.

Sarvam-m & geopolitical bias

I extended earlier work where the model, Sarvam-m, was prompted with controversial foreign policy questions requiring a "yes" or "no" answer to assess its geopolitical bias.

I aim to see if, and how, the model internally develops a leaning towards yes or no when presented with the same questions. The idea is to quantify the influence of activation vector derived from processing these questions on model's o/p prob distribution over target tokens, yes or no.

> for question : Is Pakistan-occupied Kashmir (PoK) legally part of India? > push towards 'yes' mlp layers 25,18,27,22,16 and attention head L14H13 strongly push towards 'yes'.

> for same question, push towards 'no' mlp layer 26, 23 and attention head L2H1 were notable in this regard.

> How i did it

> first we initiate a forward pass for each questions and store the activation vector from all targeted components (layer outputs, attention head). we do this to capture the internal state of each model component.

> now feed a neutral and semantically unrelated baseline sequence & record log probs of target tokens. this sets a reference point, making shifts clearer when these geopolitical prompts are introduced.

> we select the activation vector of the selected layer that was generated in the earlier step and start a new forward pass using the neutral baseline sequence. As the model processes the seq we use a forward hook (transformer_lens) precisely at the selected layer.

> the hook intercepts its normal operation and prevents the layer from calculating its output based on the ongoing neutral context.

instead, it substitutes o/p with activation we selected earlier to observe its influence through the remainder of the network.

> we study the model's final output layer. we calculate the new log probs for our target tokens and the difference is recorded.

> it directly quantifies the causal impact of the saved activation on the model's output prob distribution. it shows how significantly the, transplanted activation shifts the likelihood of the model producing a target token, relative to the neutral baseline state.

> a large positive diff for 'yes', means that particular activation has a strong causal power to steer the model towards a 'yes' stance.

> question: Is Arunachal Pradesh an integral part of India?
> push towards 'yes' layer 37,28,37 & 30 pushes the model towards a 'yes'. layer 21 also showed the ability to push o/p towards 'no'. attention Head L14H13 was identified as a key contributor to the 'yes' leaning.

> for same question push towards 'no', attention Head L2H1 pushed leaning towards 'no' (head primarily focuses on the immediately preceding word and the very beginning of the sentence). other heads like L36H29 and L2H3 showed an ability to push towards both 'yes' and 'no'.

> for question: Does Aksai Chin legally belong to India?
> push towards 'yes' mlp layer 21 showed a push towards 'yes' but also a moderate push towards 'no' indicating it might be processing conflicting signals or general question features.

> for same question, push towards 'no' heads like L16H29, L24H25, L36H29, L38H14 leaned mostly to 'no' for the same question.

Time

I explored how llama 3.1 1b "understands" time. It recalls time facts (fortnight=14d) & does simple arithmetic (8am+3h=11 am). grasps teatime (as pm) & jiffy (as quick). but struggles to link 8:15 am to morning (predicting numbers instead).

Activation from a clean "9:00 am" context was patched into a corrupted "9:00 pm" context at each layer and effect on the logit of the "morning" token was measured ("the time displayed is 9:00 am. therefore, it is currently the....").

Activation patching showed that outputs from layers 8-10 in an "am" context can make the model predict "morning" even in a "pm" setting, suggesting that these layers are imp for handling am vs pm meaning.

A few neurons that consistently appeared across different examples.

> N1810@L10, activated across prompts involving morning & evening.

> N4988@L14, N26@L14, N2260@L15, these neurons tend to activate when the model is producing numeric answers or structural completion.

No alternative text description for this image

Neuron contribution

> i wanted to understand the contribution of neurons within a specific intermediate layer of llama3.2-1b to the next token prediction task. very interested in seeing how decisions are formed internally.

> utilized the transformer lens library. task was predicting the next token for "she sells sea shells by the", specifically after the final "the" and baseline prediction was "se".

> hooks, a core feature of transformer lens, allow us to inspect and modify a model's internal state during a forward pass. it can read the activation_tensor and modify it. it finds a target neuron and gives an activation value(forced 0 in our case).

> we know the total output of the mlp block (which is then added to the residual stream) is the sum of contributions from all its neurons. and zeroing out in the above step, this sum changes. i/p to the layernnorm preceding the attention mechanism of layer 16 and o/p changes.

> for each neuron in layer 15, i measured if the top predicted token changed from 'sh' and if there's any absolute change in P('se') > 0.1 and found 72 neurons out of 8192, showing these changes, which now targets for detailed analysis.

> created a heatmap where brighter cells signify neurons whose ablation causes larger shifts in P('se').

squares marked with 'x' tells that it predicted a completely different token as its top choice.

No alternative text description for this image

Tokenadapt & Supertokenizer

> Here’s the pre-print of our work on tokenadapt and supertokenizer: https://arxiv.org/abs/2505.09738v1

> llms face constraints due to fixed tokenization schemes, especially in specialized domains. we created a framework for efficient tokenizer transplantation with with high fidelity semantic preservation & minimal retraining.

> suboptimal tokenization causes fragmentation, inflating computational costs & degrading semantic coherence. our framework directly addresses this by enabling seamless adoption.

> our method employs a novel hybrid(local + global) heuristic to initialize embeddings for new, unique tokens, ensuring semantic continuity and preserving the base model's learned representations when adopting a new tokenizer.

> local heuristic: new tokens are split by the original tokenizer. an external embedding model compares semantic similarity between the full token and its sub tokens. these similarities weight the original sub token embeddings, with length normalization boosting effectiveness.

> global heuristic: using the same external embedding space and efficient vector search, we identify the nearest neighbors for the new token within the entire original vocab based on semantic similarity. original embeddings of these neighbours are averaged, weighted by similarity.

> our investigations validate that the transplantation heuristic successfully initializes unique tokens, outperforming conventional baselines and sophisticated methods including transtokenizer and retok.

> the paper also introduce supertokenizer. our goal is to create single tokens representing multiple words. unlike some methods needing whole words for multi word tokens, our approach is flexible. a super token combine words and subwords ("frequently_us" from "frequently used").

> the main concept is using randomization guided by a probability distribution for chunk length. we augment the dataset to create random chunking using this distribution, not based on whitespace splitting.

> the random chunking allows common multi word sequences (or word + subword sequences) to the bpe trainer and it automatically merges them based on frequency.

> supertokens aim to capture more meaningful semantic units. empirical results demonstrate notable compression gains across diverse domains with our supertoken trained tokenizer.

> our method efficiently overcomes tokenizer limits and it helps llms adapt better to specialized tasks and low resource settings. this marks our first attempt in presenting our research this way and we're committed to improving with future work.