Tag: Safety

OpenAI trains GPT-Red to attack its own models at scale (openai.com)

AI · just now · July 16, 2026
Lilian Weng: the harness matters as much as the model (lilianweng.github.io)

AI · 2 days ago · July 13, 2026
Claude has an internal workspace, and Anthropic built a tool to read it (anthropic.com)

AI · 1 week ago · July 7, 2026
AI out-persuades expert humans, but the edge is speed, not eloquence (importai.substack.com)

AI · 1 week ago · July 6, 2026
Anthropic drafts a severity scale for AI jailbreaks (anthropic.com)

AI · 1 week ago · July 4, 2026
DeepMind plans to treat its own AI agents as insider threats (deepmind.google)

AI · 3 weeks ago · June 21, 2026
DeepMind treats a misbehaving AI agent like an insider threat (deepmind.google)

AI · 3 weeks ago · June 20, 2026
OpenAI predicts model behavior by replaying real conversations (openai.com)

AI · 3 weeks ago · June 18, 2026
Florida sues OpenAI over ChatGPT-linked harms (techcrunch.com)

AI · 1 month ago · June 1, 2026
Anthropic tests an ethical reminder tool that lowers Claude's misaligned behavior (anthropic.com)

AI · 1 month ago · May 25, 2026
Kapoor and Narayanan argue against extraordinary AI rules (normaltech.ai)

AI · 1 month ago · May 21, 2026
OpenAI adds provenance signals to its AI images (openai.com)

AI · 1 month ago · May 19, 2026
OpenAI's new default ChatGPT model hallucinates less (openai.com)

AI · 1 month ago · May 18, 2026
DeepMind built a way to measure when AI manipulates people (deepmind.google)

AI · 3 months ago · March 26, 2026
Reward Hacking: Why Better Models Game You More (lilianweng.github.io)

AI · 1 year ago · November 28, 2024
Crash Testing GPT-4: The First Dangerous-Capability Eval (asteriskmag.com)

AI · 3 years ago · June 1, 2023
A Field Guide to the AI Safety Camps (asteriskmag.com)

AI · 3 years ago · June 1, 2023