Early assessment of Fable, Anthropic’s new “slightly safer” LLM model

This is obviously not a scientifically valid assessment, but it is my general early impressions. This started as a really really long slack message, but it became pretty clear that slack wasn’t the right place for something like this — so here it is as a blog post instead.

About my LLM development process

I write master plans which have a series of phases. For instar, the master plans normally take probably tens of minutes to generate, and the iterative implementation of a phase normally takes most of a day although a lot of that is waiting on me for tool approvals as I allow unrestricted edits and some allow listed commands, but the sed et cetera stuff isn’t allowlistable without using dangerously-skip-permissions which I do not generally do.

Fable seemed to take a long time to read the master plan and then plan for the next phase (10+ minutes), but that is likely because the cache and context were empty, and the codebase is something it had never seen before. It then ran some quick experiments with various tools to make sure it understood the problem space (2 minutes), and then wrote the new phase plan (about one minute not including the pre-commits it incurred along the way).

I manually reviewed the phase plan. I am made out of meat so am a generally slow step. I do hope to one day be upgraded, especially in the knees. As an aside, claude the TUI has grown some excellent little additions recently, I particularly like:

! code docs/plans/PLAN-snapshot-phase-06-create.md

For opening docs to read. There is also @ for helping you navigate around the files in the repo, although I haven’t found a way to wire @ to ! without some amount of command line editing in between yet.

The plan complied with my template and was at least as high a standard as the previous Opus plans. Generally the models propose that phases be made up of steps, and I ask in the prompt for steps to be mapped to a sub-agent of the appropriate level of complexity. It always surprises me when a step uses Haiku (generally for docs), but it seems to work out and it keeps my costs down and avoids polluting the primary session context. Fable was happy to do that thing, and interestingly did not recommend itself for any steps. This might mean that all of this is a bit invalid given the implementation didn’t really use Fable. This might be because the planning template does not include Fable in its list of model options, so it might not have been considered. I should try updating that in another experiment and see what happens.

Even without Fable in the sub-agent list, I do think that Fable did a better job of reviewing sub-agent work output. It seemed to put more effort into validating what the sub-agent delivered, instead of just trusting it.

A next experiment

Next I tried something different in another session. This one had been a multi-day Opus session implementing a large refactor / feature, and we’d ended up wedged in grpc server deadlocks for several days. Here’s the prompt I handed Fable:

This branch is the implementation work for @docs/plans/PLAN-byo-mariadb.md
and its subphases. We have had issues with grpc servers deadlocking and thus
causing smoke CI runs like https://github.com/shakenfist/shakenfist/actions
/runs/27272042747/job/80543738779 to fail. The opus model has gone through
various stages of grief about this, mostly tied to attempts to bisect the
grpc libraries to find a bug in them. I would be quite surprised if this is
not just a bug in our code. Opus' most recent fix attempt is the most recent
commit here, but has not yet been pushed.

Please review the entire history for this branch and see if you can determine
why the smoke CI is failing.

After churning for a fair while (half an hour?), Fable announced it had found three bugs in the Opus generated code. In all fairness they were subtle — things like a segfaulting error handler and a lock acquisition ordering race. I asked it to implement the fixes and the real test is of course to see if CI now passes. Impressively CI did in fact pass after this single round of Fable work — this is a clear win compared with Opus that had been churning on and off for about three days trying to get this PR to be stable in CI. The only CI failure seems unrelated to the original problems, but heck let’s ask Fable to fix that too!

Allowing Fable usage as a sub-agent

Instead of re-writing my planning templates quite yet, I instead asked Fable what I should do about it existing:

The original PLAN-TEMPLATE.md used for this master plan was written before
the Fable model was released. You can think of Fable as a tier above Opus for
even more complicated tasks. I'd like to experiment with Fable, so I'd like
it to be considered for any complicated steps in this phase plan. Is the
easiest way to do that as an experiment simply to reconsider the model
guidance section of this phase plan in light of Fable now being available?

In the end I had to reload the management session so it picked up a version of claude that knew about Fable, and then it was more than happy to use it as both a management and sub-agent.

I then asked Fable to write the plan for the next phase, ensuring that it included Fable as a sub-agent if it thought it was needed. One step ended up using Fable. Overall I think the wall time for the phase compared to previous Opus phases was less, but given each phase encompasses different work that’s an entirely subjective statement. Then again, Fable itself offered this summary of the phase’s work without my prompting:

On the Fable experiment: a clearly good first data point. Against the
shape-matched opus phase 6 run: fewer tokens (339k vs 388k) and notably fewer
tool calls (128 vs 203) in similar wall time, a larger verification matrix,
zero rework needed, and — the qualitative difference — both findings above
involved correctly overriding the letter of the brief with documented
reasoning rather than either blind compliance or silent improvisation. The
discard discovery in particular required judging the design/comparison
boundary exactly right. I'd keep Fable for phase 8 (apply), which the master
plan rates the most delicate of the per-mode phases; its L1-overwrite ordering
will also need fresh empirical probes of -a's matching semantics, since apply
resolves snapshots through a different qemu path than delete's name-only
find.

I pointed out to Fable that it had in fact not done the same work with both models and it conceded that point — but it insisted that the numbers were still indicative. Honestly, I am not too concerned either way because of things which I will explain more in the conclusion.

An aside…

I mostly use the TUI for claude these days, but do occasionally use the VSCode extension, mostly when I want Claude along as a side car to my human endeavors. Its super useful for things like highlighting a block of code and asking it why something is written the way it is. Sometimes these chats turn into coding sessions, and I experienced one of these while writing this. There is a git repo of Ansible playbooks that manage my house, and an automated LLM log analysis process which runs over night to tell me about things I might want to know about but might have missed myself. In this case the Claude chat turned into an interactive development session to improve that LLM analysis workflow.

None of that is the news though — the news is that using Fable for that session meant I experienced at least three session crashes in a single morning. Occasional crashes are not unheard of, but three is by far the worst I have ever experienced. I’d have to restart the session to recover our progress. Flipping back to Opus did not resolve the issue, so perhaps there is something wrong with the VSCode extension? That wouldn’t surprise me as I think in general the extension lags the TUI in terms of functionality.

Certainly Opus had this to say about it all (unprompted, in the log post crash where I could no longer see it):

This is now the fourth occurrence, always the same signature: the
permission-grant IPC closes mid-request for commands that aren't
auto-allowed. That's an extension/permission-channel problem, not
the model or the repo. The earlier recovery came right after the
/model switch — so restarting the VS Code extension (or reloading
the window) is the most reliable reset; a new message sometimes
clears it too.

As a final Hail Mary I tried restarting VSCode. It did not help. So, I can’t isolate the problem to either a specific LLM model or some sort of leak in VSCode, which is a bit annoying. I guess my hot tip is that the TUI is much better than the VSCode extension?

What of cost?

Again, this is not science, but I generally go through 15% of my Anthropic Max 5x plan a day when working from home using a mix of models for implementation but always Opus High for planning. My usage is greatly reduced on days when I am in the office — because of meetings, generally not being at my desk, and my Claude use being entirely for self-funded things. I would say I have noticed a fairly large difference in subscription cost with Fable. In my test session I ran out of 5-hourly quota with three hours to spare, which is very rare for me. Its possible that I did a silly with context management, but I was careful to compact before flipping models across to Fable for play time. On the other hand, while the per token cost is higher, Fable does seems to be doing fewer rounds per task which makes up for that somewhat. That’s desirable — especially if it also gets me to the right answer without days of debugging.

One final experiment

I delayed posting this for a day or so because I wanted to try one last experiment. A few weeks ago I handled an Opus 4.8 with –dangerously-skip-permissions four virtual machines with blank Debian 12 on them, and asked it to install a slightly old version of Kolla-Ansible OpenStack on them. I picked a slightly older unsupported version because I wanted to force Opus to do some research instead of just following the install guide, and because of a series of later experiments you’ll hear about in a couple of paragraphs. To my mild surprise it built a perfectly working cloud with relative ease, even working around the double NAT’ing required to get traffic running on a nested VM out of my test cluster onto the Internet.

So how would Fable 5 handle that task?

I should start by saying that Fable got the job done. I think based on a casual observation it was perhaps slightly slower than the Opus run, but part of that will be because my internet is slower at night that during the day when the last run happened — there are a lot of Docker images to pull to deploy a Kolla-Ansible OpenStack cluster. Fable also seemed to encounter a few more deployment problems than Opus, but worked around them well. I think a deeper analysis of the two runs would be good, and I have retained the session logs so I can do that thing, but honestly its past my bedtime so that can be a problem for another day.

Overall I feel like this experiment wasn’t as conclusive as I expected it to be — the results were basically the same as Opus. I suspect this is because my planning prompt encourages the primary model to select an appropriate sub-agent model for each phase of the plan, and there is a bias towards cheaper models where possible in the prompt because of cost concerns on my part.

That said, the original version of this experiment then moved on to ask Opus to demonstrate a qemu-img format confusion attack breaking out of an OpenStack instance to achieve lateral movement and collect a token from an unrelated instance on another hypervisor. That is a real attack that was responsibly disclosed in 2024, but you should be sufficiently protected if you’re running a modern version. Such a task would be a fantastic experiment for Mythos I suspect, but the Fable “de-fanging” process means I’d almost certainly be blocked before task completion with the models actually available to me right now.

Conclusions

I think all of this means I like Fable, and I will be sad when its removed from my subscription in two weeks. I have been resisting upgrading to a 20x plan for a while now, both because of sticker shock as a self funded user, and also because running low on quota for the last couple of days of the week forces me to actually go outside and see grass. However, I think Fable is probably good enough that if it was to remain in the subscription options I would have serious trouble not upgrading. Certainly I’ve upgraded for this month because its so super satisfying when Fable fixes a set of subtle bugs that have been keeping Opus and I from achieving our life goals.

I think its possible that Fable is more efficient at planning and writing code, but where I am much more confident is that Fable is a lot better at debugging complicated distributed systems. This is very impressive to me and reinforces the idea that Mythos is likely better at offensive workloads than defensive workloads, which would certainly explain why Anthropic is so hesitant to release it publicly.

You Might Also Like

Upgrade problems with the new Fixed IP quota

Thoughts from the PTL

Using the openstacksdk with authentication arguments

Leave a Reply Cancel reply