entropizer is an anti-LLM text generator: it uses a real LLM tokenizer vocabulary, then chooses tokens uniformly at random with no prompt, context, logits, grammar, corpus, or most-likely-next-token step.
This project uses GitHub's MIT-licensed bpe-openai Rust crate from github/rust-gems. GitHub announced it on 2024-12-12 as a faster, more flexible byte-pair tokenizer. The default vocabulary is o200k_base.
That means the pieces are OpenAI-family BPE token chunks, but the sequence is not model output. Every token draw is independent and indifferent.
cargo run -- --tokens 200
cargo run -- --tokens 80 --seed 42
cargo run -- --tokenizer cl100k_base --tokens 120 --mode utf8
cargo run -- --tokens 80 --language japanese --seed 42
cargo run -- --tokens 500 --mode bytes > random-token-bytes.bin
cargo run -- --tokens 40 --show-ids
cargo run -- --tokens 200 --delay-ms 30After building:
cargo build --release
./target/release/entropizer --tokens 100 --seed 123--tokens <n>: number of random tokens to draw. Default:100.--seed <u64>: deterministic seed. If omitted, OS randomness creates a seed.--tokenizer <name>:o200k_base,cl100k_base, orvoyage3_base. Default:o200k_base.--mode <mode>:utf8: default. Samples uniformly from standalone valid UTF-8 tokens and prints text.bytes: samples uniformly from the full token ID range and writes raw bytes.lossy: samples uniformly from the full token ID range and prints UTF-8-lossy text.
--language <filter>: approximate token filter. Supported values:any,english/ascii,latin,russian/cyrillic,greek,arabic,hebrew,hindi/devanagari,chinese/han,japanese,korean/hangul,thai.--show-ids: streams metadata and sampled token IDs to stderr while stdout streams generated output.--separator <s>: inserts a separator between decoded token pieces inutf8/lossymodes.--delay-ms <n>: sleeps after each emitted token so very fast generations visibly stream instead of appearing as one burst.
For context-aware but deliberately wrong generation, scripts/uncanny_rank.py uses a small open model (HuggingFaceTB/SmolLM2-135M by default) and always picks the Nth most likely next token instead of sampling near the top.
Install optional Python deps, preferably with Python 3.10-3.12:
python3 -m pip install -r requirements-uncanny.txtRun it:
./scripts/uncanny_rank.py --tokens 200 --rank 50 --prompt "Once upon a time"
./scripts/uncanny_rank.py --tokens 200 --range 1,10 --seed 42 --prompt "Once upon a time"
./scripts/uncanny_rank.py --tokens 200 --sequence 1,1,1,1,1,10 --prompt "Once upon a time"
./scripts/uncanny_rank.py --tokens 200 --rank 50 --show-ids 2> ids.txt
./scripts/uncanny_rank.py --model HuggingFaceTB/SmolLM2-360M --rank 50 --tokens 200--range START,END picks a fresh random rank from the inclusive range for every generated token. Use --seed to make that rank sequence reproducible.
--sequence 1,1,1,1,1,10 repeats a fixed rank pattern, so this example chooses the top token five times, then the 10th-ranked token once, then loops.
This streams decoded token pieces as they are selected. The first run may pause while downloading/loading the model.
utf8 mode filters the vocabulary to tokens whose individual byte payload is valid UTF-8. Use bytes mode for strict uniform sampling over the complete tokenizer vocabulary.
Language filtering is script-based, not dictionary-based. Tokenizers do not label tokens by natural language, so --language japanese means “tokens made of Han/Hiragana/Katakana plus common punctuation,” --language english means ASCII alphabetic token pieces, and so on. Any non-any language filter also requires standalone valid UTF-8 token bytes, even in bytes/lossy modes.
Output streams token-by-token and flushes after each token, so long runs do not wait for the full generated string before printing. There is still an initial startup pause while bpe-openai loads/deserializes its tokenizer dictionary; after that, generation is usually fast enough to look like one burst unless you use --delay-ms.
cargo fmt
cargo test