Rllama is a Ruby gem that wraps the high-performance llama.cpp shared library and allows you to load LLMs and generate text from Ruby code. No external API or network connection is required once the model file is on disk. At DocuSeal, we built the Rllama gem to enable semantic search for our API documentation using local embedding models.
require 'rllama'
# 1) Load a local GGUF model (download one first or pass an absolute path)
model = Rllama.load_model('lmstudio-community/gemma-3-1B-it-QAT-GGUF/gemma-3-1B-it-QAT-Q4_0.gguf')
# 2) Generate text
result = model.generate('Write a two-line poem about Ruby.') { |t| print t }
puts result.text
# 3) Optional: basic generation stats
puts "tokens: #{result.stats[:tokens_generated]}, tps: #{result.stats[:tps]}, seconds: #{result.stats[:duration]}"
# 4) Clean up
model.close
rllama
) for quick interactive chats. It can list installed models, show a curated popular list, download models for you, and accept a local path or direct URL to a GGUF model on Hugging Face.With the script above, you point Rllama to a .gguf
model file, ask for a completion, and print the answer. Under the hood, it uses llama.cpp for efficient CPU inference, which makes it a good fit for laptops, servers, and CI machines where you want low‑latency, private inference without managing an external service.
If you prefer to explore before writing code, the CLI is the fastest way to test a model. After installing the gem, run rllama
to see your downloaded models and a selection of popular ones; pick one to chat, supply a local path, or pass a direct model URL.
# Install the gem
gem install rllama
# Open an interactive chat (shows installed and popular models)
rllama
Rllama works with GGUF models. You can browse and download them from Hugging Face (look for GGUF), then point the gem at the file. The CLI can also download a model for you on demand. As a starting point, smaller instruction‑tuned models are great for laptops; you can switch to larger files when you need better reasoning.