Evaluation

Red-teaming is not “asking rude questions”

Serious adversarial evaluation maps failure regions under stress—then locks them into regression suites so patches do not rot.

Model the adversary

Beyond prompt tricks, examine data exfiltration via retrieval, confused-deputy patterns in agents, and cross-boundary injections where tools fetch untrusted content. Multimodal attacks differ from text-only—see multimodal fusion.

Pair exploration with regression

Ad-hoc “gotchas” make headlines; durable safety needs repeatable tests after each change—including quantization or policy updates.

Benchmarks are not substitutes

Public leaderboards lag attacker creativity— benchmarks essay.

Human factors

Operators override safeguards when workflows feel blocked— automation bias and organizational incentives matter as much as logits.

Blue team / red team handoffs

Findings should land in issue trackers with severity, reproduction steps, and proposed mitigations—closing the loop between offensive research and defensive engineering; regression tests lock fixes.

Third-party audits

Independent auditors bring fresh adversarial imagination; scope and access agreements must be explicit—especially when testing open-weight derivatives.