entropizer

entropizer is an anti-LLM text generator: it uses a real LLM tokenizer vocabulary, then chooses tokens uniformly at random with no prompt, context, logits, grammar, corpus, or most-likely-next-token step.

Tokenizer

This project uses GitHub's MIT-licensed bpe-openai Rust crate from github/rust-gems. GitHub announced it on 2024-12-12 as a faster, more flexible byte-pair tokenizer. The default vocabulary is o200k_base.

That means the pieces are OpenAI-family BPE token chunks, but the sequence is not model output. Every token draw is independent and indifferent.

Usage

cargo run -- --tokens 200
cargo run -- --tokens 80 --seed 42
cargo run -- --tokenizer cl100k_base --tokens 120 --mode utf8
cargo run -- --tokens 80 --language japanese --seed 42
cargo run -- --tokens 500 --mode bytes > random-token-bytes.bin
cargo run -- --tokens 40 --show-ids
cargo run -- --tokens 200 --delay-ms 30

After building:

cargo build --release
./target/release/entropizer --tokens 100 --seed 123

Options

--tokens <n>: number of random tokens to draw. Default: 100.
--seed <u64>: deterministic seed. If omitted, OS randomness creates a seed.
--tokenizer <name>: o200k_base, cl100k_base, or voyage3_base. Default: o200k_base.
--mode <mode>:
- utf8: default. Samples uniformly from standalone valid UTF-8 tokens and prints text.
- bytes: samples uniformly from the full token ID range and writes raw bytes.
- lossy: samples uniformly from the full token ID range and prints UTF-8-lossy text.
--language <filter>: approximate token filter. Supported values: any, english/ascii, latin, russian/cyrillic, greek, arabic, hebrew, hindi/devanagari, chinese/han, japanese, korean/hangul, thai.
--show-ids: streams metadata and sampled token IDs to stderr while stdout streams generated output.
--separator <s>: inserts a separator between decoded token pieces in utf8/lossy modes.
--delay-ms <n>: sleeps after each emitted token so very fast generations visibly stream instead of appearing as one burst.

Uncanny rank-50 LLM mode

For context-aware but deliberately wrong generation, scripts/uncanny_rank.py uses a small open model (HuggingFaceTB/SmolLM2-135M by default) and always picks the Nth most likely next token instead of sampling near the top.

Install optional Python deps, preferably with Python 3.10-3.12:

python3 -m pip install -r requirements-uncanny.txt

Run it:

./scripts/uncanny_rank.py --tokens 200 --rank 50 --prompt "Once upon a time"
./scripts/uncanny_rank.py --tokens 200 --range 1,10 --seed 42 --prompt "Once upon a time"
./scripts/uncanny_rank.py --tokens 200 --sequence 1,1,1,1,1,10 --prompt "Once upon a time"
./scripts/uncanny_rank.py --tokens 200 --rank 50 --show-ids 2> ids.txt
./scripts/uncanny_rank.py --model HuggingFaceTB/SmolLM2-360M --rank 50 --tokens 200

--range START,END picks a fresh random rank from the inclusive range for every generated token. Use --seed to make that rank sequence reproducible.

--sequence 1,1,1,1,1,10 repeats a fixed rank pattern, so this example chooses the top token five times, then the 10th-ranked token once, then loops.

This streams decoded token pieces as they are selected. The first run may pause while downloading/loading the model.

Notes

utf8 mode filters the vocabulary to tokens whose individual byte payload is valid UTF-8. Use bytes mode for strict uniform sampling over the complete tokenizer vocabulary.

Language filtering is script-based, not dictionary-based. Tokenizers do not label tokens by natural language, so --language japanese means “tokens made of Han/Hiragana/Katakana plus common punctuation,” --language english means ASCII alphabetic token pieces, and so on. Any non-any language filter also requires standalone valid UTF-8 token bytes, even in bytes/lossy modes.

Output streams token-by-token and flushes after each token, so long runs do not wait for the full generated string before printing. There is still an initial startup pause while bpe-openai loads/deserializes its tokenizer dictionary; after that, generation is usually fast enough to look like one burst unless you use --delay-ms.

Development

cargo fmt
cargo test

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
SPEC.md		SPEC.md
requirements-uncanny.txt		requirements-uncanny.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

entropizer

Tokenizer

Usage

Options

Uncanny rank-50 LLM mode

Notes

Development

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

entropizer

Tokenizer

Usage

Options

Uncanny rank-50 LLM mode

Notes

Development

About

Resources

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages