My name is Roger Oriol, I am a Software Architect based in Barcelona, Spain. I am a MSc graduate in Big Data Management, Technologies and Analytics. This blog will be the vehicle to divulgate and discuss topics on web development, data architecture, software architecture and much more.
Recent posts
-
๐ [Link] Claude Fable 5 and Mythos 5
Anthropic has announced it's most capable model with the name Fable 5. This model was previously hidden from the public and only made available only to a select number of companies with the name Claude Mythos Preview. The reported reason for hiding it was that it was "too powerful" to be made available to the broad public and therefore bad actors out there.
Apparently, now Anthropic is sufficiently confident in Fable's safeguards to be released to the broad public.
Two models have been released, Mythos 5, which is the same as the previous model only been released to some select people, now with a bit better benchmark results but still not publicly available. Then also Fable 5, which is Mythos 5 (they share the exact same benchmark results so it doesn't look like they are different models or finetuned) with a safeguard that appears to be a classifier that if it detects a query on cybersecurity, biology, chemistry or attempts to distill, it automatically degrades to Opus 4.8.
Still, unless you have a big budget you will probably not be playing around with this model a lot. Opus was already the big, expensive model from Anthropic and this is one is even bigger and more expensive. The price is $10 per million input tokens and $50 per million of output tokens. Opus was $5/$25, so double the price. When Mythos Preview was first announced, the price was even steeper, at $25/$125 per million tokens, so it looks like for now Anthropic has found a way to serve this model for cheaper. If you have a Pro or Max subscription, you will be able to use those models at no cost until June 22, from then those models will cost usage credits.
Another interesting point of the presentation is that Anthropic will require a 30-day retention for all traffic to Fable, Mythos and future models, for all platforms where those models are deployed. According to Anthropic, this is to help them defend against attacks and won't be used to finetune models.
[[ Visit external link ]] -
[[ Read more ]] ยท 52 minute read
Build A Basic AI Agent From Scratch: Long Task Planning
In the previous part of the Build A Basic AI Agent From Scratch series, we added the essential tools to our agent to allow it to work autonomously for us. We gave it the ability to find files, read and write files, run bash commands and get content from the web. We got a very cap...
-
[[ Read more ]] ยท 58 minute read
Build A Basic AI Agent From Scratch: Tools
In the previous part of the Build A Basic AI Agent From Scratch series, we built the most basic AI agent harness possible. It was just a connection to a model, a way to take user input, a store of context of the conversation and a loop that kept the agent running. Of course, this...
-
[[ Read more ]] ยท 8 minute read
Build a Basic AI Agent From Scratch
2026 is without a doubt the year of AI agents. Since the release of Claude Code, the power of these AI agents has become undeniable. Claude Code, Codex, OpenCode are a must for many developers nowadays. OpenClaw and Hermes are becoming many people's AI assistants. Agents are also...
-
๐ [Link] GPT-5
OpenAI has finally released it's GPT-5 model, and as we were already expecting, it's a hybrid reasoning model. Now the model itself chooses how much to think about each task, and you can force the reasoning effort as well. This probably means the end of the o series of reasoning models from OpenAI, as the regular language models and the reasoning models will now be unified.
Of course, the benchmarks look good but saturated. What stands out to me is that they announced a 74.9 score on SWE-bench (with high reasoning effort), which is just a tad over the score from Claude Opus 4.1 just announced this very same week (74.5).
With the GPT-5 iteration, come 4 new models: GPT-5, GPT-5-mini, GPT-5-nano and GPT-5 Chat. Free users will be allowed to use GPT-5, although when they hit the maximum quota, they will fallback to GPT-5-mini.
GPT-5 allows to set the reasoning effort using the "reasoning.effort" parameter, although you can also force it telling the model to "Think hard about this". These new models introduce a new reasoning tier called "minimal" which produces a few as possible reasoning tokens before answering. The output tokens can also be customized by setting the "verbosity" parameter, which didn't exist for past models. This parameter can be set to "high", "medium" or "low".
The new models also bring some new quality of life improvements for tool calling:
- Tool choice: While the models can choose to call zero, one or multiple tools, you can now set "tool_choice" to "forced" to force the invocation of at least one tool. You can also set a specific function that must be called by passing {"type": "function", "name": "function name"} to the "tool_choice" parameter. Finally, in "tool_choice" you can also specify a list of allowed tools from the list of tools provided to the model: {"type": "allowed_tools", "mode": "auto", "tools": []}.
- Tool preambles: New feature that makes the models explain the rationale behind why they are invoking a function. This provides transparency and better understanding on the model's process. By default, this feature is not enabled. To enable it, you have to include a system message like "Before you call a tool, explain why you are calling it.".
- Custom tools: This feature allows to define functions that allow unstructured, free-form text as input, which frees the model from using a structured JSON object to call the tool. This might improve the ability of the model to call these tools. This can be even more powerful when paired with Context-Free Grammar.
- Context-Free Grammar: This feature allows to set grammar rules for the free-form text, to make them follow a set of rules. You can define this rules using Lark or a regular expression.
The GPT-5 models are now available both in ChatGPT and in the OpenAI API, give them a try!
[[ Visit external link ]] -
๐ [Quote] GPT-5 variants
It's not at all straightforward to understand the variants of the GPT-5 model released today. The API docs describe four models: gpt-5, gpt-5-mini, gpt-5-nano and gpt-5-chat. However, the system card describes 6 models to replace older models, and none of the names match with the API:
It can be helpful to think of the GPT-5 models as successors to previous models:
Table 1: Model progressions
Previous model GPT-5 model GPT-4o gpt-5-main GPT-4o-mini gpt-5-main-mini OpenAI o3 gpt-5-thinking OpenAI o4-mini gpt-5-thinking-mini GPT-4.1-nano gpt-5-thinking-nano OpenAI o3 Pro gpt-5-thinking-pro
The answer is that the gpt-5 model is composed of the gpt-5-main model, the gpt-5-thinking model and a router that selects the model to send the prompt to:GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use...
The same applies to the mini model. Gpt-5-mini is made of a gpt-5-main-mini model, a gpt-5-thinking-mini model and a router. The nano model only seems to have a thinking variant, not a main, but this makes sense as a single model without a router will allow the model to be faster. This leaves only the gpt-5-thinking-pro model, which cannot be used via API, only via ChatGPT, with a Pro subscription:
[[ Visit external link ]]In the API, we provide direct access to the thinking model, its mini version, and an even smaller and faster nano version of the thinking model, made for developers (gpt-5-thinking-nano). In ChatGPT, we also provide access to gpt-5-thinking using a setting that makes use of parallel test time compute; we refer to this as gpt-5-thinking-pro.
-
๐ [Link] GPT-OSS
Just like Sam Altman hinted at a while ago, OpenAI just released two open-weight models trying to appease the common criticism of being a company with "Open" in the name that hasn't released any open language models in a long while (since GPT-2!).
The new open-weights models (not open-source like the name seems to imply) are Mixture-of-experts models with:
- 116.83 billion parameters with 5.13 billion active parameters. It has 128 experts and activates 4 experts for each token.
- 20.91 billion parameters with 3.61 billion active parameters. It has 32 experts and activates 4 experts for each token.
Both models are reasoning models and therefore OpenAI compares them to their own o3 and o4 models. It seems like the 120b version is comparable to o4-mini and the 20b version is comparable to o3-mini. The new models have been throughoutly trained for agentic tasks as in the post-training stage, they were trained specifically to use a browser tool and a python code execution tool, as well as other generic tools.
OpenAI has also introduced a new tokenized specially for these new models called harmony. What stands out about this tokenizer from others is that it introduces a "channels" concept that allows the model to separate the output between user-facing text and internal-facing outputs. Another interesting concept that it introduces is the "system message", which differs from the already known "system prompt". The system message allows for configuration of dates like: "Knowledge cutoff: 2024-06", "Current date: 2025-06-28". It also allows to set the reasoning effort with "Reasoning: high". Finally, it also allows the configuration of channels and what are they used for and tools that the model can use.
A great feature of these models is that it seems that OpenAI has optimized them to be able to easily fit in a single H100 80GB GPU for the largest model and in a 16GB consumer GPU for the small one. This was achieved using MXFP4 quantization after training to 4.25 bits per parameter, which very significantly reduces the model size. While it is possible to natively train models in this quantization to reduce model quality degradation, it looks that in this case the quantization was applied after training.
You can easily start using these models locally using Ollama. I recommend downloading the 20b model that fits in a consumer GPU. It runs really fast in my Macbook!
[[ Visit external link ]] -
๐ [Quote] How we built our multi-agent research system
While reading Anthropic's great article "How we built our multi-agent research system", I stumbled upon this quote where Anthropic researchers present the results where they found that multi-agent systems outperform single-agents for complex tasks:
For example, when asked to identify all the board members of the companies in the Information Technology S&P 500, the multi-agent system found the correct answers by decomposing this into tasks for subagents, while the single agent system failed to find the answer with slow, sequential searches.
This makes a ton of sense to me. We know that LLMs do their best when the scope of the task they are given is as narrow as possible and when they have as much relevant context as possible. By using an orchestrator agent to decompose tasks and give them to sub-agents, we are effectively narrowing down the scope of the task, as well as slimming down the amount of context not relevant to the specific subtask that the sub-agent will do.
Another interesting finding from this article is that Anthropic claims that 80% of the variance of results in the BrowseComp benchmark can be explained by more token usage:
In our analysis, three factors explained 95% of the performance variance in the BrowseComp evaluation (which tests the ability of browsing agents to locate hard-to-find information). We found that token usage by itself explains 80% of the variance, with the number of tool calls and the model choice as the two other explanatory factors.
This also makes using multiple agents more optimal, because they can use more tokens (because they do so in parallel) more efficiently (because agents are less likely to hit a context window limit where the performance starts to degrade if the context is separated for each subtask). It is also in the best interest of Anthropic that you burn tokens at 15x (according to them) the token rate with multi-agent architectures, so they get paid more. So take this with a grain of salt.
I encourage you to read the whole article, as there are many very interesting tips for designing multi-agent applications.
[[ Visit external link ]] -
๐ [Link] Artificial Intelligence 3E: Foundations of computational agents
An agent is something that acts in an environment; it does something. Agents include worms, dogs, thermostats, airplanes, robots, humans, companies, and countries. Artificial Intelligence: Foundations of Computational Agents, 3rd edition by David L. Poole and Alan K. Mackworth, Cambridge University Press 2023
This is the definition I personally like the best for what agents are in the context of AI. Since the overuse of this word has left some of us confused on what it actually means, I would say that any application that uses AI and has the ability to act on its environment, through tools or function calling, is an agent.
[[ Visit external link ]] -
๐ [Link] AGI is not multimodal
A true AGI must be general across all domains. Any complete definition must at least include the ability to solve problems that originate in physical reality, e.g. repairing a car, untying a knot, preparing food, etc.
In this excellent article, Benjamin Spiegel argues that our current approach to building LLMs cannot lead to an AGI. While the current next-token prediction approach is really good at reflecting human-understanding of the world, not everything in this world can be expressed with language and not all valid language constructs are consistent with the world. Therefore, they are not actually learning world models, but just the minimum language patterns that are useful in our written mediums.
Multimodal models can be seen as solving this problem, since they unite multiple ways to see the world in a single embedding space. However, in multimodal models different modalities are unnaturally separated in the training process. Instead of learning about something by interacting with it via different modalities, two models are separately trained for each modality and then artificially sewn together in the same embedding space.
Instead of pre-supposing structure in individual modalities, we should design a setting in which modality-specific processing emerges naturally.
In conclusion, while LLMs are still getting more capable, those gains are already diminishing and might hit a wall soon. To build a general model that is not constrained by the limitations of human language we should go back to the drawing board and come up with a perception system that can seamlessly unite all modalities.
This article has also made me think about AI capabilities that are thriving today because they might not need to unite multiple modalities to form an understanding of that world. For example, programming. Software is built and executed in a digital environment and ruleset that can be easily encoded into plain text. I'm genuinely curious if you need to know about anything about how the world works, apart from just how programming languages can be used (and maybe architecture of the computer and networks), to be a good programmer.
[[ Visit external link ]]