zero0_one1

?>

zero0_one1

2025-02-13 20:02:46

Which LLMs are greedy and which are generous? In the public goods game, players donate tokens to a shared fund that gets multiplied and split equally, but each can profit by free-riding on others.

zero0_one1

2025-02-10 18:58:01

DeepSeek R1 outperforms o3-mini (medium) on the Confabulations (Hallucinations) Benchmark

zero0_one1

2025-02-10 19:06:24

LLM Confabulation (Hallucination) Benchmark: DeepSeek R1, o1, o3-mini (medium reasoning effort), DeepSeek-V3, Gemini 2.0 Flash Thinking Exp 01-21, Qwen 2.5 Max, Microsoft Phi-4, Amazon Nova Pro, Mistral Small 3, MiniMax-Text-01 added

zero0_one1

2025-02-10 19:02:34

DeepSeek R1 outperforms o3-mini (medium) on the Confabulations (Hallucinations) Benchmark

zero0_one1

2025-02-05 19:30:56

DeepSeek R1 ties o1 for first place on the Generalization Benchmark.

zero0_one1

2025-02-05 19:32:32

DeepSeek R1 ties o1 for first place on the Generalization Benchmark

zero0_one1

2025-02-03 16:38:37

o3-mini ties DeepSeek R1 for second place (behind o1) on the Multi-Agent Step Game benchmark which tests LLM strategic thinking, collaboration, and deception

zero0_one1

2025-02-03 20:03:45

o3-mini ties DeepSeek R1 for second place (behind o1) on the Multi-Agent Step Game benchmark which tests LLM strategic thinking, collaboration, and deception

zero0_one1

2025-01-31 20:38:45

DeepSeek R1 takes #1 overall on a Creative Short Story Writing Benchmark

zero0_one1

2025-01-29 19:08:43

DeepSeek R1 takes second place on the multi-player benchmark for cooperation, negotiation, and deception.

zero0_one1

2025-01-31 20:45:37

DeepSeek R1 takes #1 overall on a Creative Short Story Writing Benchmark

zero0_one1

2025-01-06 16:42:06

New LLM Creative Story-Writing Benchmark! Claude 3.5 Sonnet wins

zero0_one1

2025-02-01 03:07:08

o3-mini scores 72.4 on NYT Connections

man-o-action

2025-01-31 19:05:43

o3 mini dropped!!!

zero0_one1

2025-01-31 20:42:44

DeepSeek R1 takes #1 overall on a Creative Short Story Writing Benchmark

zero0_one1

2025-01-30 17:02:39

DeepSeek R1 scores between o1 and o1-mini on NYT Connections

zero0_one1

2025-01-30 17:04:57

DeepSeek R1 scores between o1 and o1-mini on NYT Connections

zero0_one1

2025-01-29 19:20:37

DeepSeek R1 takes second place, Gemini 2.0 Flash Thinking Exp 01-21 does well on the multi-player benchmark for cooperation, negotiation, and deception.

zero0_one1

2025-01-29 19:15:28

DeepSeek R1 takes second place on the multi-player benchmark for cooperation, negotiation, and deception

Either-Job-341

2025-01-28 15:41:07

Qwen2.5-Max

danilofs

2025-01-28 16:52:39

"Sir, China just released another model"

zero0_one1

2025-01-22 18:02:25

Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure

zero0_one1

2025-01-22 18:16:42

o1 is first, GPT-4o is last - Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure

zero0_one1

2025-01-22 18:15:21

Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure

zero0_one1

2025-01-22 18:12:33

o1 easily wins the Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure