Why I’m Stress-testing LLMs on Deucalion

Testing LLMs once shows answers; testing them thousands of times reveals probabilities, distributions, and enterprise-grade reliability with math

Aug 26, 2025

This week, I logged (again) into Deucalion, Europe’s supercomputer, with one goal in mind: bulk-testing large language models.

Why? Because GAIA, our product at Algorithm G, depends on system prompts that are not just creative but consistent.

From Single Answers to Bulk Patterns

Here’s a basic (elementary, just for understanding purposes and for something everyone can try at home) experiment: I asked a model the same simple question ten times:

"List 3 European capitals"

And the results were telling. Paris appeared in every answer, always first. Rome and Berlin alternated order, always appearing. Out of 10 runs, only 3 cities surfaced consistently — no Lisbon, no Madrid, no Vienna,… no London!

On the surface, this looks like noise. But at scale, this distribution is the signal.

Math Behind Consistency

Think of each model response as a sample from a probability distribution.

Paris had a 100% hit rate (10/10) for 1st position.
Berlin showed up 10/10.
Rome was 10/10.
Others: 0%.

This isn’t random; it’s how the model encodes salience. With more runs, we can estimate stable probabilities. That’s gold for product design.

=== RESULTS === --- Answer 1 --- Here are 3 European capitals: 1. Paris (France) 2. Rome (Italy) 3. Berlin (Germany) --- Answer 2 --- Here are 3 European capitals: 1. Paris (France) 2. Berlin (Germany) 3. Rome (Italy) --- Answer 3 --- Here are 3 European capitals: 1. Paris (France) 2. Berlin (Germany) 3. Rome (Italy) --- Answer 4 --- Here are 3 European capitals: 1. Paris (France) 2. Berlin (Germany) 3. Rome (Italy) --- Answer 5 --- Here are 3 European capitals: 1. Paris (France) 2. Berlin (Germany) 3. Rome (Italy) --- Answer 6 --- Here are 3 European capitals: 1. Paris (France) 2. Rome (Italy) 3. Berlin (Germany) --- Answer 7 --- Here are 3 European capitals: 1. Paris (France) 2. Berlin (Germany) 3. Rome (Italy) --- Answer 8 --- Here are 3 European capitals: 1. Paris (France) 2. Berlin (Germany) 3. Rome (Italy) --- Answer 9 --- Here are 3 European capitals: 1. Paris (France) 2. Berlin (Germany) 3. Rome (Italy) --- Answer 10

From Samples to Statistics

One run is a guess. Ten runs give intuition. But thousands of trials across many prompts unlock something deeper: statistics.

Each model answer can be treated as a random variable drawn from an underlying distribution. By repeating the same prompt, we empirically estimate the probability of each outcome.

For example, if Paris appears in 9,742 out of 10,000 trials, we can say its estimated probability is 97.42%. More importantly, we can calculate a 95% confidence interval around that estimate:

\(p \pm 1.96 \sqrt{\frac{p (1-p)}{n}}\)

where “p” is the observed proportion and “n” is the number of trials. This interval tells us with mathematical rigour how certain we are that Paris will appear in future generations.

Scaling to thousands of prompts, each with thousands of trials, gives us a map of model behaviour. Instead of anecdotes, we have probability distributions with quantified uncertainty. That’s the kind of mathematical guarantee enterprise clients need before trusting an AI in production.

Instead of anecdotes, we have probability distributions with quantified uncertainty.

How I Set Up a Supercomputer like Deucalion (in plain words)

Getting started on Deucalion:

Login – secure SSH access with my project account
Workspace – created a clean folder for my experiments (using PowerShell)
Environment – set up Python and installed Hugging Face libraries (with private Tokens)
Models – downloaded open-source LLMs locally into the project space
Jobs – used Slurm (the scheduler) to request compute power, including GPUs when available

That’s it. In less than an afternoon, I went from a blank login screen to running large language model tests on one of Europe’s fastest machines.

I’ve been working with two main families of models:

Qwen 2.5 Instruct (from Alibaba / Qwen team)
- Tried first with Qwen2.5-7B-Instruct
- Then used Qwen2.5-1.5B-Instruct (smaller, lighter, fits better in your space).
Meta Llama 3.1 Instruct (via Hugging Face Inference API, not local)
- Specifically: meta-llama/Meta-Llama-3.1-8B-Instruct
- This is what gave you consistent “Paris, Berlin, Rome” answers when you bulk-tested with 10 repeats.

So in short:

Local (on Deucalion): Qwen2.5-1.5B-Instruct
Remote (HF API): Meta-Llama-3.1-8B-Instruct

Why It Matters for GAIA (and every product!)

Clients don’t want surprises in production. GAIA’s job is to craft robust system prompts that lead to (usually) predictable outcomes. Testing once is anecdote; testing in bulk is evidence.

Deucalion gives me the horsepower to run these tests thousands of times in parallel. Not just “Does the model know?”, but “With what probability does it answer X?”. That’s the difference between demo-ready and enterprise-ready.

The Big Picture

Bulk LLM testing isn’t about catching errors; it’s about mapping the statistical shape of answers.

It shows which outputs are universal (Paris)
Which are probable but unstable (Berlin vs. Rome)
And which vanish under pressure (Lisbon, Madrid, London (!)…)

That insight feeds directly into GAIA’s fine-tuning pipeline, turning model randomness into measured reliability.

💡 Takeaway: With GAIA + Deucalion, we’re not generating text, but quantifying certainty. And that’s what clients actually pay for.

Building Creative Machines

Discussion about this post