A study conducted by Tuhin Chakrabarty, assistant professor of computer science at Stony Brook University, and a team of researchers at Columbia University shows that the New York Times word game “Connections” may serve as a challenging benchmark for large-scale language training. It turns out that there is. Models of Abstract Reasoning (LLM).
AI and machine learning regularly beat the world’s best chess players, but when it comes to ‘connections’ even the best LLM, Claude 3.5 Sonnect, can only fully solve the game 18% of the time. I found out through research. The study investigated AI responses to over 400 Connections games and found that both novice and expert players outperformed the AI at solving puzzles.
In the game, players are presented with a 4×4 grid containing 16 words. The task is to group these words into four clusters of four words according to their common characteristics. For example, the words “believer,” “sheep,” “doll,” and “lemming” form a group because they can be classified as “conformists.”
To classify words into appropriate categories, players must be able to reason using various forms of knowledge, from semantic knowledge (about “fits”) to encyclopedic knowledge.

“This may seem easy to some, but many of these words can easily be placed into several other categories,” Chakrabarty says. “For example, ‘likes’, ‘followers’, ‘shares’, ‘insults’, etc., may be classified as ‘social media interactions’ at first glance.” These possible groupings are dangerous information. It will be. The game is designed with this in mind. That makes it even more interesting.
In this study, LLM is relatively good at inferences involving semantic relations (“happy,” “joyful,” “enjoyable”), but at multi-word expressions (“kick the bucket” is “die”), A combination of word form and word meaning knowledge (adding the prefix “un-” to the verb “do” creates the word “undo” with the opposite meaning).
In this study, we used five LLMs (Google’s Gemini 1.5 Pro, Anthropic’s Claude 3.5 Sonnet, OpenAI’s GPT4 Omni, Meta’s Llama 3.1 405B, and Mistral Large 2 (Mistral-AI, 2024)) in 438 NYT Connections games. We tested it and compared the results to human performance. In a subset of these games. The results showed that while all LLMs were able to partially solve some games, “performance was far from ideal.”
Read the full article on the AI Innovation Institute website.