Which LLMs are greedy and which are generous? In the public goods game, players donate tokens to a shared fund that gets multiplied and split equally, but each can profit by free-riding on others.
DeepSeek R1 outperforms o3-mini (medium) on the Confabulations (Hallucinations) Benchmark
LLM Confabulation (Hallucination) Benchmark: DeepSeek R1, o1, o3-mini (medium reasoning effort), DeepSeek-V3, Gemini 2.0 Flash Thinking Exp 01-21, Qwen 2.5 Max, Microsoft Phi-4, Amazon Nova Pro, Mistral Small 3, MiniMax-Text-01 added
DeepSeek R1 ties o1 for first place on the Generalization Benchmark.
DeepSeek R1 ties o1 for first place on the Generalization Benchmark
o3-mini ties DeepSeek R1 for second place (behind o1) on the Multi-Agent Step Game benchmark which tests LLM strategic thinking, collaboration, and deception
DeepSeek R1 takes #1 overall on a Creative Short Story Writing Benchmark
DeepSeek R1 takes second place on the multi-player benchmark for cooperation, negotiation, and deception.
New LLM Creative Story-Writing Benchmark! Claude 3.5 Sonnet wins
o3-mini scores 72.4 on NYT Connections
o3 mini dropped!!!
DeepSeek R1 scores between o1 and o1-mini on NYT Connections
DeepSeek R1 takes second place, Gemini 2.0 Flash Thinking Exp 01-21 does well on the multi-player benchmark for cooperation, negotiation, and deception.
DeepSeek R1 takes second place on the multi-player benchmark for cooperation, negotiation, and deception
Qwen2.5-Max
"Sir, China just released another model"
Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure
o1 is first, GPT-4o is last - Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure
o1 easily wins the Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure