← All posts

Run open-source LLMs locally with Ruby

• Written by Pete Matsyburka

Rllama is a Ruby gem that wraps the high-performance llama.cpp shared library and allows you to load LLMs and generate text from Ruby code. No external API or network connection is required once the model file is on disk. At DocuSeal, we built the Rllama gem to enable semantic search for our API documentation using local embedding models.

require 'rllama'

# 1) Load a local GGUF model (download one first or pass an absolute path)
model = Rllama.load_model('lmstudio-community/gemma-3-1B-it-QAT-GGUF/gemma-3-1B-it-QAT-Q4_0.gguf')

# 2) Generate text
result = model.generate('Write a two-line poem about Ruby.') { |t| print t }
puts result.text

# 3) Optional: basic generation stats
puts "tokens: #{result.stats[:tokens_generated]}, tps: #{result.stats[:tps]}, seconds: #{result.stats[:duration]}"

# 4) Clean up
model.close

Features

  • A Ruby API to load a local model and generate completions. You can stream tokens as they arrive, adjust generation parameters (max_tokens, temperature, top_k, top_p), and view simple stats like tokens per second.
  • Built-in chat context for multi-turn conversations, with support for system prompts and role-based message lists.
  • Embeddings support to load an embedding model and embed single strings or arrays.
  • A CLI (rllama) for quick interactive chats. It can list installed models, show a curated popular list, download models for you, and accept a local path or direct URL to a GGUF model on Hugging Face.

How it works

With the script above, you point Rllama to a .gguf model file, ask for a completion, and print the answer. Under the hood, it uses llama.cpp for efficient CPU inference, which makes it a good fit for laptops, servers, and CI machines where you want low‑latency, private inference without managing an external service.

Install and try it from the terminal

If you prefer to explore before writing code, the CLI is the fastest way to test a model. After installing the gem, run rllama to see your downloaded models and a selection of popular ones; pick one to chat, supply a local path, or pass a direct model URL.

# Install the gem
gem install rllama

# Open an interactive chat (shows installed and popular models)
rllama

Finding models

Rllama works with GGUF models. You can browse and download them from Hugging Face (look for GGUF), then point the gem at the file. The CLI can also download a model for you on demand. As a starting point, smaller instruction‑tuned models are great for laptops; you can switch to larger files when you need better reasoning.

When this is most useful

  • Great for prototyping Rails features, building offline agents, performing semantic search, summarizing text, and generating dummy data.
  • Privacy by default, with no reliance on third-party providers.
  • Simple deployment: use the rubygem with bundled llama.cpp shared library binaries