Skip to content

mia-cx/entropizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

entropizer

entropizer is an anti-LLM text generator: it uses a real LLM tokenizer vocabulary, then chooses tokens uniformly at random with no prompt, context, logits, grammar, corpus, or most-likely-next-token step.

Tokenizer

This project uses GitHub's MIT-licensed bpe-openai Rust crate from github/rust-gems. GitHub announced it on 2024-12-12 as a faster, more flexible byte-pair tokenizer. The default vocabulary is o200k_base.

That means the pieces are OpenAI-family BPE token chunks, but the sequence is not model output. Every token draw is independent and indifferent.

Usage

cargo run -- --tokens 200
cargo run -- --tokens 80 --seed 42
cargo run -- --tokenizer cl100k_base --tokens 120 --mode utf8
cargo run -- --tokens 80 --language japanese --seed 42
cargo run -- --tokens 500 --mode bytes > random-token-bytes.bin
cargo run -- --tokens 40 --show-ids
cargo run -- --tokens 200 --delay-ms 30

After building:

cargo build --release
./target/release/entropizer --tokens 100 --seed 123

Options

  • --tokens <n>: number of random tokens to draw. Default: 100.
  • --seed <u64>: deterministic seed. If omitted, OS randomness creates a seed.
  • --tokenizer <name>: o200k_base, cl100k_base, or voyage3_base. Default: o200k_base.
  • --mode <mode>:
    • utf8: default. Samples uniformly from standalone valid UTF-8 tokens and prints text.
    • bytes: samples uniformly from the full token ID range and writes raw bytes.
    • lossy: samples uniformly from the full token ID range and prints UTF-8-lossy text.
  • --language <filter>: approximate token filter. Supported values: any, english/ascii, latin, russian/cyrillic, greek, arabic, hebrew, hindi/devanagari, chinese/han, japanese, korean/hangul, thai.
  • --show-ids: streams metadata and sampled token IDs to stderr while stdout streams generated output.
  • --separator <s>: inserts a separator between decoded token pieces in utf8/lossy modes.
  • --delay-ms <n>: sleeps after each emitted token so very fast generations visibly stream instead of appearing as one burst.

Uncanny rank-50 LLM mode

For context-aware but deliberately wrong generation, scripts/uncanny_rank.py uses a small open model (HuggingFaceTB/SmolLM2-135M by default) and always picks the Nth most likely next token instead of sampling near the top.

Install optional Python deps, preferably with Python 3.10-3.12:

python3 -m pip install -r requirements-uncanny.txt

Run it:

./scripts/uncanny_rank.py --tokens 200 --rank 50 --prompt "Once upon a time"
./scripts/uncanny_rank.py --tokens 200 --range 1,10 --seed 42 --prompt "Once upon a time"
./scripts/uncanny_rank.py --tokens 200 --sequence 1,1,1,1,1,10 --prompt "Once upon a time"
./scripts/uncanny_rank.py --tokens 200 --rank 50 --show-ids 2> ids.txt
./scripts/uncanny_rank.py --model HuggingFaceTB/SmolLM2-360M --rank 50 --tokens 200

--range START,END picks a fresh random rank from the inclusive range for every generated token. Use --seed to make that rank sequence reproducible.

--sequence 1,1,1,1,1,10 repeats a fixed rank pattern, so this example chooses the top token five times, then the 10th-ranked token once, then loops.

This streams decoded token pieces as they are selected. The first run may pause while downloading/loading the model.

Notes

utf8 mode filters the vocabulary to tokens whose individual byte payload is valid UTF-8. Use bytes mode for strict uniform sampling over the complete tokenizer vocabulary.

Language filtering is script-based, not dictionary-based. Tokenizers do not label tokens by natural language, so --language japanese means “tokens made of Han/Hiragana/Katakana plus common punctuation,” --language english means ASCII alphabetic token pieces, and so on. Any non-any language filter also requires standalone valid UTF-8 token bytes, even in bytes/lossy modes.

Output streams token-by-token and flushes after each token, so long runs do not wait for the full generated string before printing. There is still an initial startup pause while bpe-openai loads/deserializes its tokenizer dictionary; after that, generation is usually fast enough to look like one burst unless you use --delay-ms.

Development

cargo fmt
cargo test

About

No description, website, or topics provided.

Resources

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors