Recent Posts

15 Jun 2025

🔗 [Quote] How we built our multi-agent research system

While reading Anthropic's great article "How we built our multi-agent research system", I stumbled upon this quote where Anthropic researchers present the results where they found that multi-agent systems outperform single-agents for complex tasks:

For example, when asked to identify all the board members of the companies in the Information Technology S&P 500, the multi-agent system found the correct answers by decomposing this into tasks for subagents, while the single agent system failed to find the answer with slow, sequential searches.

This makes a ton of sense to me. We know that LLMs do their best when the scope of the task they are given is as narrow as possible and when they have as much relevant context as possible. By using an orchestrator agent to decompose tasks and give them to sub-agents, we are effectively narrowing down the scope of the task, as well as slimming down the amount of context not relevant to the specific subtask that the sub-agent will do.

Another interesting finding from this article is that Anthropic claims that 80% of the variance of results in the BrowseComp benchmark can be explained by more token usage:

In our analysis, three factors explained 95% of the performance variance in the BrowseComp evaluation (which tests the ability of browsing agents to locate hard-to-find information). We found that token usage by itself explains 80% of the variance, with the number of tool calls and the model choice as the two other explanatory factors.

This also makes using multiple agents more optimal, because they can use more tokens (because they do so in parallel) more efficiently (because agents are less likely to hit a context window limit where the performance starts to degrade if the context is separated for each subtask). It is also in the best interest of Anthropic that you burn tokens at 15x (according to them) the token rate with multi-agent architectures, so they get paid more. So take this with a grain of salt.

I encourage you to read the whole article, as there are many very interesting tips for designing multi-agent applications.
[[ Visit external link ]]
07 Jun 2025

🔗 [Link] Artificial Intelligence 3E: Foundations of computational agents

An agent is something that acts in an environment; it does something. Agents include worms, dogs, thermostats, airplanes, robots, humans, companies, and countries. Artificial Intelligence: Foundations of Computational Agents, 3rd edition by David L. Poole and Alan K. Mackworth, Cambridge University Press 2023

This is the definition I personally like the best for what agents are in the context of AI. Since the overuse of this word has left some of us confused on what it actually means, I would say that any application that uses AI and has the ability to act on its environment, through tools or function calling, is an agent.
[[ Visit external link ]]
05 Jun 2025

🔗 [Link] AGI is not multimodal

A true AGI must be general across all domains. Any complete definition must at least include the ability to solve problems that originate in physical reality, e.g. repairing a car, untying a knot, preparing food, etc.

In this excellent article, Benjamin Spiegel argues that our current approach to building LLMs cannot lead to an AGI. While the current next-token prediction approach is really good at reflecting human-understanding of the world, not everything in this world can be expressed with language and not all valid language constructs are consistent with the world. Therefore, they are not actually learning world models, but just the minimum language patterns that are useful in our written mediums.

Multimodal models can be seen as solving this problem, since they unite multiple ways to see the world in a single embedding space. However, in multimodal models different modalities are unnaturally separated in the training process. Instead of learning about something by interacting with it via different modalities, two models are separately trained for each modality and then artificially sewn together in the same embedding space.

Instead of pre-supposing structure in individual modalities, we should design a setting in which modality-specific processing emerges naturally.

In conclusion, while LLMs are still getting more capable, those gains are already diminishing and might hit a wall soon. To build a general model that is not constrained by the limitations of human language we should go back to the drawing board and come up with a perception system that can seamlessly unite all modalities.

This article has also made me think about AI capabilities that are thriving today because they might not need to unite multiple modalities to form an understanding of that world. For example, programming. Software is built and executed in a digital environment and ruleset that can be easily encoded into plain text. I'm genuinely curious if you need to know about anything about how the world works, apart from just how programming languages can be used (and maybe architecture of the computer and networks), to be a good programmer.
[[ Visit external link ]]
04 Jun 2025

🔗 [Quote] Hype Coding - Steve Krouse

There's a new kind of coding I call "hype coding" where you fully give into the hype, and what's coming right around the corner, that you lose sight of whats' possible today. Everything is changing so fast that nobody has time to learn any tool, but we should aim to use as many as possible. Any limitation in the technology can be chalked up to a 'skill issue' or that it'll be solved in the next AI release next week. Thinking is dead. Turn off your brain and let the computer think for you. Scroll on tiktok while the armies of agents code for you. If it isn't right, tell it to try again. Don't read. Feed outputs back in until it works. If you can't get it to work, wait for the next model or tool release. Maybe you didn't use enough MCP servers? Don't forget to add to the hype cycle by aggrandizing all your successes. Don't read this whole tweet, because it's too long. Get an AI to summarize it for you. Then call it "cope". Most importantly, immediately mischaracterize "hype coding" to mean something different than this definition. Oh the irony! The people who don't care about details don't read the details about not reading the details

I would summarize this sarcastic piece by Steve Krouse by reminding everyone that, while it's fun to try new technologies, it's important not to fall victim of the hype and always use the latest, shiniest, new thing for everything. Instead of choosing a tool based on the hype around it, and what people say can do or will be able to do, assess the tool objectively in your workflow. If it makes YOU more productive, by all means, use it. If it doesn't, don't worry, the fad will die down eventually.
[[ Visit external link ]]
16 Apr 2025

🔗 [Link] OpenAI Codex CLI

Together with the launch of the o3 and o4-mini reasoning models, OpenAI has released a coding assitant for the terminal: Codex.

Codex is meant to be used with OpenAI models. You can use it to create new projects, make changes to existing projects or ask the model to explain code to you, all in the terminal. It can use multimodal input (e.g. screenshots). It also allows sandboxing your development environment to secure your computer. It also allows the use of context files, ~/.codex/instructions.md for global instructions for Codex and codex.md in the project root for project-specific context.

In Full-auto mode, Codex can not only read and write files, but also run shell commands in an environment confined around the current directory and with network disabled. However, OpenAI suggests that in the future, you will be able to whitelist some shell commands to run with network enabled, once they have polished some security concerns.

You can install codex via npm:
```
npm i -g @openai/codex
```
[[ Visit external link ]]
14 Apr 2025

🔗 [Link] GPT 4.1

After the unimpressive release of GPT-4.5 a month and a half ago, OpenAI is now releasing a new version - backwards. Today, they released three new models, exclusive to the API: GPT-4.1, GPT-4.1 mini and GPT-4.1 nano. In the benchmarks, GPT-4.1 easily beats GPT-4.5 at a lower price and higher speed. For this reason, OpenAI has said they will be deprecating GPT-4.5 in 3 months time.

While this is a good step ahead for OpenAI, they are still a bit behind Claude and Gemini in some key benchmarks. In SWE-bench, GPT-4.1 gets a 55%, against 70% for Claude 3.7 Sonnet and 64% for Gemini 2.5 Pro. In Aider Polyglot, GPT-4.1 gets 53%, while Claude 3.7 Sonnet gets 65% and Gemini 2.5 Pro gets 69%.

On the other hand, GPT-4.1 nano offers a similar price and latency as Gemini Flash 2.0. If the performance of this small model is comparable to Gemini Flash, it can be a great option for simple tasks.
[[ Visit external link ]]
12 Apr 2025

🔗 [Link] The Agent2Agent Protocol

Just in the middle of the year of agents, Google has released two great tools for building agents: the Agent2Agent (A2A) protocol and the Agent Development Kit (ADK).

The Agent2Agent Protocol is based on JSON RPC, working both over plain HTTP and SSE. It is also built with security in mind, it implements the OpenAPI Authentication Specification.

The agents published using this protocol will advertise themselves to other agents via the Agent Card, which by default can be found at the path https://agent_url/.well-known/agent.json. The Agent Card will include information about the agent's capabilities and requirements, which will help other agents decide to ask it for help or not.

The specification includes definitions for these concepts, which agents can use to exchange between themselves: Task, Artifact, Message, Part and Push Notification.

This new protocol is not meant to replace Anthropic's Model Context Protocol. They are actually meant to work together. While MCP allows agents to have access to external tools and data sources, A2A allows agents to communicate and work together.
[[ Visit external link ]]
08 Apr 2025

🔗 [Quote] LMArena on X

Meta should have made it clearer that “Llama-4-Maverick-03-26-Experimental” was a customized model to optimize for human preference. As a result of that we are updating our leaderboard policies to reinforce our commitment to fair, reproducible evaluations so this confusion doesn’t occur in the future.

We now have acknowledgement from LMArena of what we already knew: AI labs are cheating to get their models as high as possible in the LMArena leaderboard / benchmarks.

This is inevitable, all of them want to win the AI race at any cost. If you don't want to be fooled by ever-slightly-increasing benchmarks, you should set up your own benchmarks that measure their performance on your own use cases.
[[ Visit external link ]]
06 Apr 2025

🔗 [Link] The Llama 4 herd

Meta has finally released the Llama 4 family of models that Zuckerberg hyped up so much. The Llama 4 models are open-source, multi-modal, mixture-of-experts models. First impression, these models are massive. None of these models will be able to run in the average computer with a decent GPU or any single Mac Mini. This is what we have:

Llama 4 Scout

The small model in the family. A mixture-of-experts with 16 experts, totaling 109B parameters. According to Meta, after an int-4 quantization, it fits in an H100 GPU, which is 80GB of VRAM. It's officially the model with the largest context window ever, with a supported 10M context window. However, a large context window takes a big toll on the already high VRAM requirements, so you might want to keep the context window contained. As they themselves write in their new cookbook example notebook for Llama 4:

Scout supports up to 10M context. On 8xH100, in bf16 you can get upto 1.4M tokens.

Llama 4 Maverick

The mid-sized model. This one has 128 experts, totaling 400B parameters. This one "only" features a 1M context window, due to its larger size. Maverick, as of today, has reached the second place in LMArena with 1417 ELO, only surpassed by Gemini 2.5 Pro. Which is scary, knowing this is not even the best model in the family.

Llama 4 Behemoth

The big brother in the family. 16 experts, 2 TRILLION parameters. Easily surpasses Llama 3.1 405B, which was the largest Llama model until today. This model has not yet been released, as according to Meta is still training, so we don't know anything about its capabilities.

Llama 4 Reasoning

We have no details on what it's going to be, just the announcement that it's coming soon.

Overall, these look like very capable frontier models that can compete with OpenAI, Anthropic and Google while at the same time being open-source, which is a huge win. Check out Meta's post on the models' architecture and benchmarks and also check the models on HuggingFace.
[[ Visit external link ]]
30 Mar 2025

🔗 [Link] Circuit Tracing: Revealing Computational Graphs in Language Models

A group of Anthropic-affiliated scientists has released a paper where they study how human concepts are represented across Claude 3.5 Haiku's neurons and how these features interact to produce model outputs.

This is a specially difficult task since these concepts are not contained within a single neuron. Neurons are polysemantic, meaning that they encode multiple unrelated concepts in its representation. To make matters worse, superposition makes it so the representation of features are built from a combination of multiple neurons, not just one.

In this paper, the researches build a Local Replacement Model, where they replace the neural network's components with a simpler, interpretable function that mimics its behavior. Also, for each prompt, they show many Attribution Graph that help visualize how the model processes information and how the features smeared across the model's neurons influence its outputs.

Also check out the companion paper: On the Biology of a Large Language Model. In this paper the researchers also use interactive Attribution Graphs to study how models can think ahead of time to perform complex text generations that require the model to think through many steps to answer.
[[ Visit external link ]]

Recent Posts

🔗 [Quote] How we built our multi-agent research system

🔗 [Link] Artificial Intelligence 3E: Foundations of computational agents

🔗 [Link] AGI is not multimodal

🔗 [Quote] Hype Coding - Steve Krouse

🔗 [Link] OpenAI Codex CLI

🔗 [Link] GPT 4.1

🔗 [Link] The Agent2Agent Protocol

🔗 [Quote] LMArena on X

🔗 [Link] The Llama 4 herd

Llama 4 Scout

Llama 4 Maverick

Llama 4 Behemoth

Llama 4 Reasoning

🔗 [Link] Circuit Tracing: Revealing Computational Graphs in Language Models