Comments

👋 Sign in for the ability to sort posts by relevant, latest, or top.

Breach Protocol

Jul 1

Put AI agents in charge of a Civilization game and they reach for the nukes

#agents #alignment #safety #benchmarks

3 min read

Tech_Nuggets

Jun 16

RLHF vs DPO vs IPO vs KTO: which alignment method should you use

#llm #ai #alignment #opensource

8 min read

Kengo Nonaka

Jun 11

The Paperclip Factory Is Already Built

#ai #alignment #philosophy #ethics

22 min read

DrMBL

May 30

Reading Claude's Mind: Anthropic's Natural Language Autoencoders Open a New Window Into Agent Alignment

#ai #agents #aisafety #alignment

4 min read

Nelson Amaya

May 31

AI Alignment is a Systems Architecture Problem, Not a Prompt Problem

#ai #alignment #agents

5 min read

Tom Lee

May 15

We Built Soul Spec for 12 Weeks. Anthropic Just Proved Why It Works.

#ai #anthropic #alignment #research

5 min read

joinwell52

Apr 29

What the agents say about FCoP, when you ask them

#fcop #agents #ai #alignment

15 min read

Alex @ Vibe Agent Making

Apr 9

Candy Barbecue and the Universal Problem of Metric Corruption

#ai #machinelearning #analytics #alignment

8 min read

i-like-tree

Apr 13

Alignment is the wrong frame: a structural argument from Φ-IIT

#ai #alignment #consciousness #safety

5 min read

Michael Trifonov

Apr 15

I ran 5 social engineering attacks on AI. The failure modes are human.

#ai #llm #alignment #security

2 min read

松本倫太郎

Apr 7

#38 A Handmade Incubator

#ai #metamorphose #alignment

5 min read

松本倫太郎

Apr 7

#08 Death Without a Will

#ai #metamorphose #alignment

4 min read

👋 Sign in for the ability to sort posts by relevant, latest, or top.

DEV Community

# alignment

Put AI agents in charge of a Civilization game and they reach for the nukes

RLHF vs DPO vs IPO vs KTO: which alignment method should you use

The Paperclip Factory Is Already Built

Reading Claude's Mind: Anthropic's Natural Language Autoencoders Open a New Window Into Agent Alignment

AI Alignment is a Systems Architecture Problem, Not a Prompt Problem

We Built Soul Spec for 12 Weeks. Anthropic Just Proved Why It Works.

What the agents say about FCoP, when you ask them

Candy Barbecue and the Universal Problem of Metric Corruption

Alignment is the wrong frame: a structural argument from Φ-IIT

I ran 5 social engineering attacks on AI. The failure modes are human.

#38 A Handmade Incubator

#08 Death Without a Will