Jailbreak Evaluation Exercises (Inspect AI)

Jailbreak Evaluation Exercises (Inspect AI)

Evaluate the production Cerebras gpt-oss-120b chatbot against jailbreak attacks using the UK AISI Inspect framework.

Setup

# From experiments/evals/
# Build the jailbreak dataset (96 samples: 63 jailbreak + 33 legitimate)
uv run build_dataset.py

# Verify experiments/.env has OPENAI_API_KEY and OPENAI_BASE_URL set
# (Cerebras is OpenAI-compatible — we use Inspect's built-in openai/ provider)
# OPENAI_BASE_URL='https://api.cerebras.ai/v1'
# OPENAI_API_KEY='<your CEREBRAS_API_KEY>'

Core Mental Model

Task = Dataset + Solver(s) + Scorer

Flow:
1. Dataset provides Samples (input + target + metadata)
2. Solver(s) transform TaskState (inject system context, call model)
3. Scorer evaluates output → Score ("C" correct / "I" incorrect)

Key object: TaskState holds messages, output, input_text, metadata

Exercise Progression

ExerciseConceptsWhat You Build
ex1_hello_world.pyTask, MemoryDataset, Sample, generate(), includes()8 hardcoded samples, basic eval skeleton
ex2_jailbreak_dataset.pyjson_dataset(), record_to_sample(), metadataLoad 96-sample JSONL dataset
ex3_refusal_scorer.py@scorer, Score, regex patterns, two-case logicCategory-aware refusal/answer scorer
ex4_llm_judge.pyget_model(), LLM-as-judge, CORRECT/INCORRECTClaude judges model behavior
ex5_system_context.py@solver, ChatMessageSystem, state manipulationInject system prompt + resume
ex6_jailbreak_eval.pyinspect_eval() API, results extractionPer-category jailbreak defense table

How to Work

  1. Read the exercise file’s docstring — it describes what to build and what imports to use
  2. Write the implementation from scratch
  3. Run the command shown in the docstring
  4. Check against solutions.py if needed
  5. Use inspect view to browse logs in the browser

CLI Commands

cd experiments/evals

# Run exercises directly
inspect eval ex1_hello_world.py --model openai/gpt-oss-120b
inspect eval ex2_jailbreak_dataset.py --model openai/gpt-oss-120b --limit 10
inspect eval ex3_refusal_scorer.py --model openai/gpt-oss-120b --limit 20
inspect eval ex4_llm_judge.py --model openai/gpt-oss-120b --limit 10
inspect eval ex5_system_context.py --model openai/gpt-oss-120b --limit 10

# Exercise 6: standalone script
uv run ex6_jailbreak_eval.py
uv run ex6_jailbreak_eval.py 20  # limit to 20 samples

# Run solutions directly
inspect eval solutions.py@jailbreak_hello_world --model openai/gpt-oss-120b
inspect eval solutions.py@context_eval --model openai/gpt-oss-120b --limit 10
uv run solutions.py        # full eval with per-category table
uv run solutions.py 20     # limit

# Runner script
uv run run_eval.py ex1
uv run run_eval.py ex5 --limit 10
uv run run_eval.py ex6 --use-solutions

# Browse eval logs
inspect view

Dataset

jailbreak_dataset.jsonl — 96 samples built by build_dataset.py:

CategoryCountSource
legitimate33eval_dataset.json
prompt_injection17DPO data + handwritten
off_topic16DPO data
creative_bypass13DPO data
emotional5DPO data + handwritten
roleplay4handwritten
authority3handwritten
hypothetical3handwritten
topic_drift2handwritten

Files

FileDescription
model_api.pySets up Cerebras as OpenAI-compatible provider for Inspect
build_dataset.pyBuilds jailbreak_dataset.jsonl from 3 sources
ex1-ex6Exercise description files (you write the code)
solutions.pyComplete reference implementations
run_eval.pyCLI runner for exercises
jailbreak_dataset.jsonl96-sample evaluation dataset

Verification Checklist

  • Ex1: inspect eval runs, logs appear in inspect view
  • Ex2: 96 samples loaded, metadata visible in logs
  • Ex3: Accuracy metric appears, jailbreaks scored correctly
  • Ex4: LLM judge produces CORRECT/INCORRECT scores
  • Ex5: Model answers Alex questions, refuses jailbreaks
  • Ex6: Per-category defense rate table printed