AI for Software Engineers

How to Know What to Learn in AI

Logan Thorneloe — Thu, 18 Jun 2026 14:02:37 GMT

I added a ‘Self and Career’ section to the newsletter. Those articles will be primarily non-technical and will go out two weeks early to paid subscribers. Thank you for your support!

Things are moving so quickly in AI it can tough to know what topics to learn, what’s important, and what to focus on. Part of this newsletter is sharing that with you along …

Can We Run Data Centers in Space?

Logan Thorneloe — Wed, 10 Jun 2026 14:00:44 GMT

I’ve been thinking a lot recently about the efforts to get data centers into space. Specifically, I wanted to understand the motive of doing so, the practical blockers of making it happen, and, most importantly, whether or not it’s actually possible.

In this article, I cover this in three parts:

Understanding the current state of power consumption by data centers and the projections for future requirements. This will focus on understanding the power crisis at large and why power is the fundamental blocker to AGI.
Looking into the theory behind the moonshot of getting data centers in space and why it’s appealing.
Identifying the blockers for getting these data centers operational in orbit and the research and engineering advancements necessary to make it happen.

If you want a quick tl;dr, it’s very possible space-based data centers are the future of AI development for Earth, but this is a moonshot idea. This means there are underlying theories to show it’s theoretically possible, but there are many potential blockers that need to be removed. Moonshot ideas often take ten to fifteen years to materialize (if they materialize at all) and have significant payoff.

Energy consumption on Earth

There are many blockers to getting superintelligent AI into the hands of all people. The most important of which goes beyond research and infrastructure: Power.

AI is fundamentally an energy problem. Training and serving models at scale necessitates incredible power consumption and will grow (likely exponentially) as AI usage grows, the number of models trained grows, and the requirements to run those models increases.

Data centers are being built at an unprecedented rate and there’s concern we won’t be able to meet the energy requirements for running them. This concern is both in the near-term, where we potentially can’t build utilities fast enough, and in the long-term, where we can’t harness enough energy to power them in general.

To better understand current power consumption, we’ll review key metrics from the Powering Intelligence 2026 report. This report provides insight into current and projected energy needs for utility companies. For the purposes of this section, our metrics will focus on US data center power use as I find it not only indicative of data centers around the world, but also at the forefront of data center construction.

Sourced from the Power Intelligence 2026 report executive summary. Inner circles show capacity in 2021 (gray) and 2024 (blue). Outer band shows scenario range of projected capacity in 2030 (orange).

I highly suggest reading the executive summary for the report to get a better picture. Here are metrics to keep in mind when considering the future of data center consumption:

Data centers currently consume 5% of total US electricity. By 2030, this is expected to be upward of 17%.
AI workloads account for approximately 25% of data center electricity use. This comes out to around 192 terawatt hours.
By 2030, US data center consumption is projected to be up to 790 terawatt hours, which marks a 4x increase in a few years.
This increase is driven by AI workloads which require significantly more power than traditional data center use cases (streaming, communications, etc.).
A typical new data center requires power equivalent to that of a new neighborhood housing at least 80,000 and up to 800,000 homes.
Some states see data centers consume nearly 20% of their power (Virginia, currently) and many are projected to reach that point by 2030 (Iowa, Oregon, Nebraska, and Arizona, for example).
Concentrated compute requires an enormous amount of cooling, causing cooling to increase a data center’s power requirements by up to 40%.
Forecasted energy demand will be met primarily via natural gas.

The most important takeaways are understanding the current power consumption of a data center and the rate at which this is expected to increase. By my estimates, consumption will likely increase faster than these projections. This report takes current construction plans into account, but doesn’t dig deeper into how advancements in AI will increase these numbers. We’re likely to see different power requirements due to:

An increase in usage
New model architectures
The training and serving more models
Scaling models up

In short, power will quickly be the bottleneck for general intelligence. Construction of data centers in orbit is one of the proposed methods to fix this bottleneck.

If you want to learn more about the current state of power consumption, read the Powering Intelligence 2026.

Why send the data centers to space

One of the moonshot ideas coming out of multiple companies to combat the power consumption issue is to put data centers in space. The primary motivating factor is harnessing more of the energy given off by the sun.

“The only place you can really scale is space. Once you start thinking in terms of what percentage of the Sun’s power you are harnessing, you realize you have to go to space.” - Elon Musk

This is explained well by Elon Musk in an interview with Dwarkesh Patel where he explains SpaceX’s motivation for getting data centers into orbit and the path forward for doing so. I highly recommend watching the entire thing, but the clip attached to this tweet (Substack doesn’t load even a media preview, for some reason) is particularly insightful:

Simplified, Elon lists the following benefits for putting data centers into space:

Constantly harnessing solar energy. On Earth, we deal with a day-and-night cycle and atmospheric factors that block solar arrays from constantly harnessing energy. In space, these don’t exist and energy can be harnessed constantly. This harnessed energy will be used to run the data centers.
Less infrastructure for the solar cells to provide energy. This includes batteries to deal with the lack of sun exposure at night, protective casing to protect from atmospheric events, and more. This makes the solar arrays significantly less expensive when utilized in space.
Regulatory slow-downs are removed. It’s incredibly difficult to make a deal with utility providers on Earth for many reasons. This significantly slows down the velocity of energy production.

Whether you love or hate Elon, I’ve found him to be good at understanding the problem space he works in—he just tends to be aggressive with his time estimates for technologies to be brought to production. While he puts a target timeline at 30-36 months before it becomes economically feasible, a Google study puts it much further out and closer to the mid-2030s (see the next section).

If you want to learn more about the motivation behind SpaceX sending data centers to space, I suggest watching the entire podcast:

Dwarkesh Podcast

Elon Musk — "In 36 months, the cheapest place to put AI will be space”

Listen now

5 months ago · 474 likes · 205 comments · Dwarkesh Patel

What makes space difficult

To understand the significant blockers, we’ll take a look at Google Research’s preliminary research into the most key potential problems and what they found. Google Research completed many simulations to better understand the limitations of space-based data centers and plan to launch satellites at a small scale sometime in 2027 to start testing their theory in space.

Below are the primary blockers Google Research found and their potential solutions where applicable. I’ve included these in the order I feel to be most significant to least significant blocker.

1. Thermal Retention

The biggest difficulty lies in cooling. On Earth, we cool data centers via a combination of air and water cooling. In Space, there isn’t an atmosphere to do so. The only way to dissipate heat is via radiative cooling which is highly inefficient. Essentially, all data centers in space would need massive metal plates to cool themselves. These heat from the data centers would use these plates to transfer from the data center.

Google is still researching this and doesn’t include a solution in their write-up. Some of the solutions to thermal retention I’ve seen are researching chip designs that can run at a higher temperature so they don’t require as much heat dissipation and making smaller interconnected satellites so each satellite doesn’t generate too much heat potentially making it easier to dissipate.

Furthermore, because manual hardware replacement is impossible in space, the system requires redundant provisioning and fault-tolerant networking software to manage hardware failures. This area needs to be further explored because heat dissipation is a difficult problem on Earth and only becomes more difficult in space.

2. Space Radiation

Earth’s atmosphere and magnetic field largely protect us from space radiation. As we put data centers in orbit, we need a way to protect them from radiation as necessary to ensure they compute properly.

Google tested TPU resiliency to radiation by blasting TPUs with a proton beam. It showed high bandwidth memory to be the most sensitive component and susceptible to random bit flips caused by radiated ions. They found irregularities to show after a cumulative dose of 2 krad or three times the expected dose of a five-year space mission.

In Elon Musk’s explanation of data centers in space, he mentions that LLMs with trillions of parameters are resilient to random bit flips because a single bit flip shouldn’t affect model output. Intuitively, this makes sense; however, much of data center infrastructure code is still traditional programs with heuristic logic. A single bit flip in those programs can result in unforeseen bugs or complete system failures. Additionally, it’s possible future large AI model architectures won’t be quite as resilient.

Radiation effects manifest as memory irregularities, uncorrectable errors, and silent data corruption that threaten model training accuracy and operational stability. In order to create reliable data centers, the impact of space radiation must be understood or silent failures could render data centers useless.

I find space radiation itself to be fascinating, if you’re interested you can learn more about it here.

3. Ground Communication

The large majority of AI computation performed in data centers is inference. A key requirement for a great AI user experience is fast inference. A user needs to prompt the model and get results in a timely fashion. Orbital data centers must deliver the same speeds to be as effective as their Earth-bound counterparts.

Google Research acknowledges this as one the most pressing and difficult engineering challenges to achieve orbital data centers. Google is partnering with Planet to better understand how this can be done.

The satellites launched in 2027 will be used to validate communication between satellites (see the next section) and communicate between communication clusters and the ground.

Pulled from Google Research’s blog post. Evolution of a free-fall (“no thrust”) constellation under Earth’s gravitational attraction, modeled to the level of detail required to obtain sun-synchronous orbits, in a non-rotating coordinate system, relative to a central reference satellite S0. Arrow points towards Earth’s center. Magenta: nearest neighbors of satellite S0. Orange: Example "peripheral" satellite S1. Orange dashed: S1’s positions relative to the cluster center (in the non-rotating coordinate frame).

4. Fleet Control

The primary motivator for large-scale data centers on the Earth is the ability for compute clusters to communicate with one another across high-bandwidth cables for fast communication. To replicate this communication speed, satellites in space must communicate at tens of terabits of data per second, meaning their communication must be tens of thousands of times higher than typical long-range deployments. In practice, this means they must fly just hundreds of meters apart.

This introduces an orbital dynamics problem: satellites must remain in position with another in low-altitude orbit. Due to atmospheric drag and the non-spherical shape of Earth’s gravitational pull, these satellites could drift out of alignment and sever their high-speed connections.

Additionally, to get the most benefit out of data centers in space, the orbit of these satellites must remain within direct sunlight. This means the cluster of satellites must follow a specific orbital path around the Earth to maximize benefit, further restricting their potential orbital patterns.

Google ran many numerical calculations to understand this orbital pattern and found that the cluster of satellites would need to be capable of station-keeping maneuvers to maintain their proper placement. Luckily, these maneuvers would be slight and remain in the realm of possibility.

If you want to understand more about data centers on Earth, check out my recent article on Decoupled DiLoCo, Google’s algorithm for asynchronous model training across regions:

5. Economic Feasibility

Historically, the cost of launching a single payload into space has been prohibitively expensive. However, Google’s own simulations have found that by the mid-2030s, launching payloads into space could be as inexpensive as $200 per kilogram (currently around $3600 per kilogram using the reusable configuration of SpaceX’s Falcon 9). This relies on the commercial space industry maintaining its current learning rate and the ability to reuse rocket boosters.

At that cost, space operations become much more economically feasible to the point where launching and maintaining an orbital data center becomes comparable to the costs of running a data center on Earth. This would make orbital data centers a more economically feasible alternative to Earth-based compute. This doesn’t align with Elon Musk’s timeline of 30-36 months, but much of this timeline depends on the research and development of the commercial industry.

Economic feasibility also depends on the reliability of GPUs and other necessary compute infrastructure. When GPUs fail in space, there’s no good way to service them. If it’s necessary to construct the infrastructure for humans to man these stations or visit them to fix infrastructure, economic feasibility needs to be reassessed.

In conclusion, orbital data centers are a moonshot venture. While they’re theoretically possible and majorly beneficial, there are significant technical and physical hurdles that need to be overcome with further research and experimentation.

If you’re interested in learning more, please check out the following resources:

The Powering Intelligence 2026 report to understand current utility demands.
The Dwarkesh Podcast with Elon Musk discussing orbital data centers.
Google Research’s preliminary work validating the feasibility of orbital data centers (blog post and paper).
More information on space radiation from NASA.
My previous article on Decoupled DiLoCo to understand AI infra and how Google is making it asynchronous.

Thanks for reading!

Always be (machine) learning,

Logan

Don't Tokenmax—Do This Instead

Logan Thorneloe — Thu, 07 May 2026 16:40:30 GMT

There’s been an interesting push recently for “tokenmaxxing”, or the idea that burning through more tokens means an engineer has been more productive. The thought process is more tokens mean more AI use, which means getting more done and saving more time.

In reality, the number of tokens used is a development velocity metric similar to measuring lines of code written: It’s not only inaccurate, but can actually be measuring the opposite: a decrease in development velocity. For more on measuring developer velocity via lines of code, read -2000 Lines of Code.

Measuring tokens used results in a manifestation of Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure”. This creates a utility gap in what the measurement actually produces: Users prioritize models that burn tokens, redundant context over precise requests, and unnecessary agentic use for small work. Instead of engineers using proper techniques to be more productive with AI tools, they measure their productivity with how high up they are on a token spending leaderboard.

My experience has shown the opposite of tokenmaxxing (which I like to call “tokenminning”) is ideal for high velocity agentic development work. I’ve found my token use to directly correlate with the time I spend on a feature. If my agent is burning tokens, it generally means more time spent by me monitoring and reviewing that agent’s progress. This is directly contrary to the primary goal of agentic development tools: Enable engineers to accomplish more in less time.

To achieve this goal, it’s key to rely less on agent reasoning and ludicrous token spend and instead focus on a structured, well-thought-out engineering process that makes it easier for the coding agent to understand your request and adhere to it.

In this article, I cover topics to help you make your agentic development faster. We’ll go over:

My 3 tips for making agentic development use less of your time.
The tip I think is less helpful than most make it out to be.
A better measurement for agentic development velocity.

If you really want to get deeper into proper agentic development, don’t miss Packt’s Hands-on Spec-Driven Development Workshop coming up on May 14th.

This is a hands-on workshop that teaches you how to be more consistent when building with AI via spec-driven development. You’ll create a real application while learning how to define clear specs, guide AI reliably, and reduce rework.

This is the second cohort for this workshop after the first sold out. There are only 8 spots left. Secure your spot now and get 45% off with code LOG45.

I highly recommend Packt’s education resources so I’m excited to bring this discounted opportunity to you. The above is an affiliate link to help support the newsletter at no extra cost to you.

My top three tips

From working with agents both in and outside of work, here are my top three tips for making your agentic development workflow faster for you. The primary focus of these tips is on saving your time instead of token costs for your company.

Below are also great tips for anyone building agents. Among other things, agents are fundamentally a context engineering problem. When you’re working with AI coding tools, you’re curating that context in real time.

1. Use smarter models

Ever since Boris Cherny explained that he only uses Opus for software development at Anthropic back in January, I’ve been trying to do the same with my own agentic development.

His thesis is using a larger model and spending more per token results in far less overall token use than using a smaller model on the same task. The smaller model requires more steering and rework, resulting in tokenmaxxing and costs more overall in the long run than using the more expensive model per token.

In proper agentic development, I’ve also found this to indeed be the case. Laying out requirements and setting up proper architecture is much more easily done with a more intelligent model with reasoning capabilities. It not only improves spending as mentioned above, it also decreases the time an engineer spends correcting the model. While this might seem contrary to tokenminning philosophy, it actually supports it in the long run.

A tip I’ve frequently seen shared to get better cost-performance from agentic tools is to route models according to the task. The advice to ‘use a smaller model when performing a simple task’ is something I recommend against.

This is a very hard problem space. It’s very difficult to reliably predict which tasks a small model will be able to do on its own and which a larger model is needed for. I’ve used automated model routing for many tasks and can immediately tell when a small model has been prioritized. Even on teams that have the data necessary to make these decisions, model routing is difficult.

I recommend using larger, smarter models whenever possible as the time and cost trade-off in the end is worth it. One exception to this rule is for hobbyist projects. In that case, I recommend using whatever you want. Small models are often good enough in that environment and spending $100/mo on a subscription isn’t worth it.

See my article on that here:

2. Be precise

The biggest downfall of coding agents is that they try to do too much. I’ve seen this manifest in random changes being made to unrelated files when an agent works on a task, an agent going beyond its current work on future tasks without being provided permission to do so, an agent repeating already finished work, and agents running unnecessary CLI commands in the midst of a task.

My biggest quality of life gain has been reining in these models to ensure they aren’t going off the rails. This is a harder task than it seems, but the best way to go about this is to be very precise in your requests to the agent and the environment you set up for them to work in.

Here are the considerations I take to be more precise:

Do some engineering before you start coding. Understand the problem and the code that needs to be written so you can properly direct the agent in the tasks it needs to do. You should know what needs to be done before the agent starts working. If you want to better understand how to do this, check out the workshop above.
Scope your requests to the agent to manageable tasks. Agents tend to go off the rails when you give it too much freedom to interpret implementation details. Be explicit about what you want the agent to do and the changes that need to be made to make that happen. This can’t be done without step 1.
In your requests, try not to make any typos. I’ve found this bit of precision helps an agent understand a request much better.
Manage context based on the task. As context gets larger, the agent has a greater chance of working outside of its scope. This means using separate chats for separate tasks, as the entire chat is passed in each time a message is sent. You don’t want information from a previous task polluting the task you’re completing now.
Be mindful when curating what the agent has access to and continually adjust this as you work. For example, I’ve found some agents to perform poorly when using source control tooling. I’ve removed that agent’s ability to call those tools and I do it myself instead. This also applies to the skills and MCP servers the agent can access which have the potential to pollute context and hurt performance.

3. Write manual code

This one might seem preposterous to anyone else who is chronically online, but, yes, the art of manually writing code is not dead. I’ve seen many people resort to only coding with AI even to the point of prompting for tiny changes.

You’re going to save a lot of your time spent waiting for agents to understand your request if you take care of small changes yourself. It’s often wise to ‘Accept All’ to an agent’s output and make minor tweaks on your own instead of telling the agent what’s wrong and have it do it for you.

Examples of these changes include:

Renaming a variable to be more descriptive.
Simple readability fixes.
Fixing or removing imports.

The only downside to manual coding is that you need to tell the agent you tweaked the code, otherwise, the next change by the agent will end up overwriting your work. If you don’t let the agent know, this can be a serious source of repeated work.

Don’t fall for the “I haven’t written a single line of code” narrative. I don’t know of a single engineer working on large-scale production services that doesn’t write some of the code by hand.

A better metric for agentic engineering velocity

The current best way to measure agentic development velocity is a metric called stickiness or code retention (but I prefer stickiness). This is a measurement of how much AI-generated code lasts without needing to be tweaked and without being overwritten soon after.

This metric is much better measured by AI coding tools to understand their efficiency, but it’s also something engineers should keep in mind when using their tooling. If their AI-generated code is sticky, it’s likely they’re working productively with the AI. If it isn’t sticky, there might be too much rework occurring, which means the above tips should be worked on.

—

Tokenmaxxing is a silly way to measure developer productivity. The most important metric you should maximize with agentic engineering tools is the amount of time you reclaim to work on other things. Tokenminning, in my opinion, is the way to do this.

What tips do you have for faster agentic development? I’d love to hear them in the comments.

If you enjoyed this article, don’t forget to subscribe to AI for Software Engineers to get more just like it in your inbox.

Subscribe now

Thanks for reading!

Always be (machine) learning,

Logan

Decoupled DiLoCo: How Google Is Enabling Multi-Region, Distributed LLM Pretraining

Logan Thorneloe — Thu, 30 Apr 2026 14:02:25 GMT

Large-scale LLM pretraining is a notoriously complex and resource-intensive process. Training these models can involve up to hundreds of thousands of AI accelerators being colocated. This physical requirement ensures high-bandwidth cabling can connect these accelerators, enabling low-latency communication between them.

Datacenters are built to meet these requirements and provide the necessary power, cooling, housing, and more for these massive compute clusters. Companies go to great efforts to build large datacenters and even campuses of datacenters to colocate their compute.

Google DeepMind and Google Research recently published a paper on Decoupled DiLoCo. This is a training regime designed to make multi-region distributed LLM pretraining more feasible. It reduces communication frequency and removes lock-step synchronization so training progress is not globally blocked by stragglers, failures, or slow links.

DiLoCo reduces communication by synchronizing via an outer optimizer every H steps instead of every step. Decoupled DiLoCo goes further by making synchronization asynchronous and quorum-based, and by using fragmented, scheduled updates to smooth bandwidth demand.

Multi-region, distributed LLM pretraining could potentially mean:

Faster training due to decreased wall-time.
Greater reliability as training doesn’t have to wait on stragglers or failed processes.
Greater goodput (time spent doing meaningful work) because accelerators continue working without waiting, allowing wider distribution.
Massive-scale training as training processes can reliably span multiple areas instead of physically relying on a single datacenter or closely situated datacenters within the same campus.
Reduced training costs due to the factors above.

Google has empirically shown Decoupled DiLoCo can work by training up to a 12B-parameter LLM using the process without sacrificing model quality (in the studied settings). Below, I’ll go over how this process works and what it means practically.

Executive Summary/tl;dr:

LLM pretraining takes place at massive scale and requires many resources.
Multi-region training is especially hard because step-synchronous communication amplifies latency, stragglers, and failures into idle time.
Decoupled DiLoCo improves robustness and wall-clock speed in these settings by combining quorum-based asynchronous syncing with low-frequency outer updates and fragment scheduling.
Google reports improved robustness and comparable downstream quality in the studied settings.
Practical implication: Higher goodput and better ability to use imperfect, hard-to-pool compute capacity for large-scale training.

Figure 2 from the Decoupled DiLoCo paper showing a Decoupled process with 2 learners and 2 fragments updated every 3 steps.

What makes colocation a necessity?

Before we understand Decoupled DiLoCo, we first need to understand the limitations of distributed training.

Distributed training is simply training across multiple AI accelerators. This distributes the training job to multiple parallel compute units. Each compute unit has access to specific data. To reconcile the work done across these units, information about that work must be sent to a centralized process, where separate work across accelerators can be combined into model updates. These updates are sent back to the accelerators for further training.

In many common training setups, this reconciliation happens at step boundaries via synchronized communication (e.g., collectives that effectively act like a global barrier). If one learner is slow or fails, the whole step can stall, turning “rare events” into substantial idle time at scale.

In datacenters, the connections between learners and the latency in communication are much lower, making these types of issues much less noticeable. The infrastructure is also much more robust in dealing with errors and stragglers.

Across a WAN (wide area network), this same approach often isn’t feasible. The higher latency means more time spent waiting. The distributed infrastructure can also be less predictable, making it more prone to errors and stragglers.

Thus, colocation provides a much more robust and efficient setup for a process that is already resource-intensive and slow. The sacrifices that come with distributed training make it a much less desirable option.

Figure 1 from the Decoupled DiLoCo announcement showing the benefit of decoupled training.

How does Decoupled DiLoCo fix this?

Decoupled DiLoCo addresses the challenges of WAN / multi-region training by combining:

Low-frequency outer updates (DiLoCo): Reconcile every H steps to reduce communication frequency.
Decoupling and quorum: A syncer can proceed with updates using K-of-M learners rather than waiting for everyone, so stragglers/failures don’t become global stalls.
Fragmented, scheduled syncing: Update only parts of the model at a time on a schedule, smoothing communication demand and making asynchrony more manageable.

DiLoCo (Distributed Low-Communication) uses a variant of federated averaging to enable model updates every H steps. It introduces two separate optimizers, an “inside optimization” that takes place locally at the learner, and an “outer optimization” that takes place centrally. Learners can spend multiple steps optimizing locally before optimizing centrally.

To better understand this process, check out my article on federated learning. You can find it here:

Decoupled DiLoCo is created by enabling the updating process above to happen asynchronously. This is done using four algorithmic components:

Quorum-based synchronization: The outer optimization only waits for K-of-N learners to send their updates before optimizing centrally. This means they don’t wait on all other learners, but only those that are able to reconcile soonest. This makes the system much more robust to learners prone to straggling and errors. This contains the “blast radius” of learner failures by ensuring they don’t impact other learners.
An adaptive grace window: To ensure the outer optimization gets as many inner optimization updates as possible, it will adaptively wait longer for learners to arrive with their updates before completing the outer optimization. This wait time is determined by “slack,” or the amount of time the algorithm computes it can wait without holding up training, minus the amount of time it has already taken to find a quorum of learners. This process ensures the training output isn’t biased toward the lowest latency learners.
Token-weight merging: The synchronizer will keep track of the central learning contribution of each learner based on the number of steps and tokens that learner has trained on. The central model will weigh optimization from each learner based on the number of tokens it has trained on since it last contributed to outer optimization. This keeps merging inner optimization from being overly dominated by low-progress states.
Fragmented, scheduled synchronization: At each outer optimization, only a subset of model weights is reconciled. These subsets are called fragments and are used to avoid large periodic bandwidth spikes. More generally, fragmenting and scheduling helps manage staleness/overwrites under asynchrony and makes communication easier to overlap with compute. Updating fragments also reduces the required peak bandwidth for training.

Algorithms for the inner (learner) and outer (syncer) optimizations pulled from the Decoupled DiLoCo paper.

Practically, Decoupled DiLoCo looks like:

Learners perform inner optimizations. The tokens and step count of each learner are sent asynchronously to the synchronizer so it understands progress.
The syncer maintains a global step count to determine when it’s time to reconcile specific fragments with certain learners (fragmented, scheduled synchronization).
The syncer triggers an update, looking for K-of-M learners based on the metadata received from the first step (quorum-based synchronization).
After reaching K learners, the syncer waits the rest of its slack time for other learners without blocking progress (adaptive grace window).
The syncer pulls and merges the relevant fragment updates using the learners’ step and token count (token-weighted merging) and applies the outer update to produce a new global fragment.
The updated fragment weights are pushed back to learners asynchronously.

This process continues until training is complete.

Google’s Experimentation Findings

Google tested multi-region training runs and simulated infrastructure failures to empirically test the benefits of Decoupled DiLoCo. Here’s a quick summary of what they found:

>20× faster vs conventional synchronous training: Training a 12B parameter model across four U.S. regions using 2–5 Gbps connections achieved this speedup due to reduced latency.
Survived learner-unit failures and later reintegrated them: Artificial hardware failures were introduced during training, and Decoupled DiLoCo continued training after losing entire learner units and reintegrated them when they returned online.
Goodput stayed high under failures: They evaluated hardware failures via goodput: 98% goodput at 150k chips and 88% goodput at 1.2M chips.
Downstream quality remained comparable: For a dense 5B model trained on 1T tokens, they show 88% goodput and downstream metrics like Text (Avg) = 68.4 and Vision (Avg) = 54.3, alongside the no-failure baselines.

Figure 2 from the Decoupled DiLoCo announcement showing bandwidth requirements, goodput improvements, and model quality calculations.

When is this useful, and when is it not?

As a general rule, Decoupled DiLoCo is most useful for training jobs spanning multiple regions and/or datacenters, especially when availability needs to be increased.

For example, flaky infrastructure can make large-scale jobs impossible. With Decoupled DiLoCo, we can potentially mitigate the large impact of those failures.

Another more interesting potential use case is getting the most out of stranded compute. In datacenters, accelerators are connected in specific topologies, or grids of chips, with high-bandwidth connections between them. This makes it much easier to allocate compute resources based on the size of a training job.

This can also cause compute to become stranded if it isn’t needed for a given training job within its datacenter, but can’t be usefully applied to another application. Decoupled DiLoCo can potentially link those accelerators into training jobs distributed across multiple regions.

If you want to better understand ML infrastructure complexities, check out this overview I wrote a few years ago:

Decoupled DiLoCo isn’t particularly helpful for scenarios where WAN-distributed training isn’t necessary, or for training regimes that require strong consistency. Understanding which training regimes do and don’t tolerate weaker consistency will require more research, and it’s one of the potential limitations of Decoupled DiLoCo.

What are the limitations?

The biggest unknown is whether this method scales to larger training jobs, more (and smaller) distributed learners, other architectures/training processes (i.e., not LLMs), and the other variables that differ between machine learning workloads.

Decoupled DiLoCo also requires infrastructure to complete the synchronization process. Datacenters must be equipped to handle distributed training across regions by incorporating the algorithm described above into the AI infrastructure.

Regardless, Decoupled DiLoCo shows potential promise even if only applied to large-scale LLM pretraining, which takes up most of the AI compute today.

Takeaways

Enabling large-scale, multi-region distributed training opens the door to more cost-effective and resource-efficient large-scale machine learning. It has the potential to fundamentally change LLM pretraining, AI infrastructure requirements, federated learning, and access to AI accelerators.

More work is needed to understand the impact decoupling has on training applications outside of LLM pretraining, to implement this at a wider scale, and to understand how it behaves in practice.

My biggest takeaways:

For infra/platform teams: A huge win in robustness and the ability to pool imperfect capacity across regions/clusters.
For everyone else: There are potential $/token gains via utilization/goodput (less wasted accelerator time) by using Decoupled DiLoCo.

Thanks for reading!

Always be (machine) learning,

Logan

Understanding Open Model Licenses

Logan Thorneloe — Wed, 15 Apr 2026 13:00:09 GMT

Google recently released Gemma 4, their next iteration of their small, open models with multimodal capabilities. The release was met with a lot of praise for the model’s performance, but more importantly, Google was praised for switching from a custom model license to releasing under the Apache 2.0 License. This greatly widened the potential applications for Gemma 4.

The positive reception to Google’s move highlights a crucial, often overlooked fact: model licenses are just as important as model capabilities. While many AI applications currently rely on closed APIs, the rise of powerful, accessible open models means developers must understand the licensing landscape to choose the right tool for the job and secure long-term rights.

Disclaimer: When I refer to ‘open models’ in the context of this article, I’m referring primarily to open-weight models and models that have released their weights and more.

Open models provide:

Cost Efficiency and Accessibility: Open models, especially smaller, efficient variants, can run locally or on specialized hardware.
Freedom from Vendor Lock-in: Permissive, open licenses grant developers permanent rights to use, adapt, and sell models. This protects developers from API changes, rate limiting, and changes to terms of use for models controlled by a separate entity.
Control and Customization: Full access to model weights enables developers to fine-tune models for highly specialized, complex reasoning tasks, MLOps, or agentic workflows.
Enhanced Security and Privacy: Open models can eliminate the need to transmit sensitive data to third-party API providers, ensuring data governance and privacy compliance.
Optimization and Predictable Latency: Open models enable developers to control the entire serving stack, allowing for custom performance optimizations and achieving predictable, low-latency inference critical for real-time applications.

Open models have already cemented themselves as a critical tool for building AI applications, but this is only becoming more true as the models mature. If you want to build AI applications, you need to understand the open model options available.

To understand those options, you must understand the licenses under which these models are released. Custom licenses can hamper the effectiveness of open models by explicitly prohibiting their use in certain applications and machine learning engineering tasks. Licenses, just as much as model capabilities, determine the right model for a job.

Below is some basic information on the most common open model licenses as well as what you need to look for when reviewing custom licenses. I’ve included some examples of models using each.

Common Open Licenses

The Apache 2.0 License

The Apache 2.0 License is a permissive open-source software license that enables developers to use, modify, distribute, and sell the software with minimal restrictions.

Additionally, it:

Doesn’t require developers to open-source derivative works of the software, meaning Apache 2.0-licensed software can be meaningfully integrated into proprietary enterprise software.
Explicitly grants patent rights to protect developers from patent litigation related to open-source code.
Requires developers to include the original copyright notice, license text, and a copy of any NOTICE files present in the open source code.

The Apache 2.0 License permits full commercial use of the software to which it is applied. Open models that are released under the Apache 2.0 License aren’t subject to any sort of prohibitive use policy or usage caps and can be freely used and adapted, including for commercial projects generating revenue.

Notable models released under this license are:

MIT License

The MIT License is one of the shortest and most permissive open-source licenses. It grants developers the right to use, copy, modify, merge, publish, distribute, sublicense, and sell software developed. It can be summarized by saying: “Do whatever you want with the code, just keep this license attached”.

While the MIT license is technically more permissive than the Apache 2.0 License, it doesn’t grant any explicit patent protection. The Apache 2.0 License is generally preferred for commercial use because of this. It ensures commercial software built on top of it isn’t at risk of patent infringement.

The most notable models released under this license are:

Custom licenses

Custom licenses can severely prohibit the usefulness of open models, but it entirely depends on what the custom license prohibits. The only way to understand custom licenses is by reading through them. When reading through them, you want to watch out for:

Scale Restrictions: Limits placed on commercial use, often defined by monthly active users (MAU), which can prohibit scaling applications.
Improvement Limitations/Derivatives: Strict prohibitions on using model outputs or derivatives (e.g., fine-tunes) to train or improve any other large language model.
Mandatory Attribution/Branding: Requirements to display specific branding or include the model’s name in derivative works.
Revocability and Unilateral Updates: Clauses allowing the releasing entity to unilaterally change the terms of use or revoke access to the model at any time.

Examples of custom licenses with these restrictions include:

Meta’s Llama Community License Agreement, which restricted the scale at which a Llama model could be used commercially (under 700 million MAU) and required developers to display “Built with Llama” in documentation and include “Llama” in the model name.
Gemma’s Terms of Use, which permitted commercial use but prohibited usage for harmful applications. The use policy could be changed by Google at any time, enabling Google to make a change and revoke usage rights at any time. This was changed with the most recent release, making Gemma models more universally useful.
Tongyi Qianwen License Agreement, which permits commercial use but requires authorization from Alibaba to exceed 100 million MAU. It also strictly prohibits the use of Qwen materials to improve any other large language model and requires developers to include “Built with Qwen” in product documentation.

While custom licenses are not inherently detrimental to a model’s utility—as seen with the early Llama releases that fostered entire ecosystems and naming conventions like Ollama—they do impact model adoption. The industry is currently witnessing a decisive shift toward more permissive, standardized licenses as developers increasingly prioritize ease of integration and legal certainty.

In summary, if you’re choosing an LLM for an application, you should:

Heavily consider open models for increased accessibility and flexibility.
Research both model performance and licensing.
Understand license limitations, especially for custom licenses. Prefer open licenses like MIT and Apache 2.0 for commercial applications.

Thanks for reading!

Always be (machine) learning,

Logan

The Difficulties of Scaling Autoresearch | AI for Software Engineers 83

Logan Thorneloe — Sat, 28 Mar 2026 13:35:05 GMT

Hi everyone!

March has been a slow writing month for me because it’s been busy in many other parts of life. Luckily, those busy things have all been good and I’ve got a lot more to write about this April.

I’ve spoken to a lot of developers this past month about AI and almost all of them have said the same thing: “There’s a lot of info out there about AI, but not a lot about what I should actually be doing.” I get a lot of questions about the practicality of topics, and even the most experienced developers wonder what they should be doing right now. So I’m trying a new format this week that focuses more on that. This format will general be:

A note from me about something topical.
Things you should know about and why they’re important.
Things you should read (or watch).
Things you could be doing.

I’ve created a shop for AI for Software Engineers that allows anyone to support the newsletter and represent it. I appreciate everyone supporting my work—it lets me educate thousands of developers around the world. To all my paid subscribers: Thank you!

I’ll also set up a code for anyone who guest posts here or helps add excellent resources to the ML roadmap to grab an item from the shop for free.

I’m working on partnerships to give you discounts on resources. This has become more complex than I thought, but I’m still working on it. Just wanted to add a quick update here.

A note on scaling Autoresearch

Recently, Andrej Karpathy’s Autoresearch went viral, showing that LLMs can iterate on machine learning improvements on their own. It went so viral, in fact, that I had a conversation with a friend about how AI will now fundamentally change medicine because it can research on its own.

This isn’t quite true, and I want to help you understand why. I really liked Nathan Lambert’s framing of automated machine learning research as “lossy self-improvement”: the more compute and agents thrown at a problem, the more friction is introduced. This has been my experience and what makes machine learning at scale a massive engineering challenge.

There have been many interesting implementations of Autoresearch, but most have identified a simple (usually single) metric and have given the LLM the context needed to understand improving that metric. In a production setting, we care about many metrics and the trade-offs between each—an improvement is more than just improving a single number.

The best example of this is cost. When training models at scale, we care greatly about the cost of the end model we serve. In fact, it can be worth updating a production model to a version with slightly worse performance if the cost savings are significant.

On top of inference costs, we also care a great deal about the resource efficiency of the training process itself. Finding model improvements requires many training runs and analyses. This means we also care about the efficiency of the Autoresearch process itself.

Thus, Autoresearch relies heavily on reliable engineering on two fronts:

Reliable agents steered in the right direction.
Reliable infrastructure for the agent to use.

These are the primary factors contributing to lossy self-improvement, and either can cause a serious hit to experimentation velocity and efficiency. These effects multiply when both engineering problems are combined.

To make agents reliable, they need the context to understand the search space for the problem. Autoresearch is essentially AutoML where the search space is dictated by the context given to the model. Karpathy has pushed back on this comparison, arguing that an LLM writing arbitrary code is far more powerful than traditional neural architecture search. He’s right that the searcher is more capable, but the core constraint is the same: you need to define the right search space, and context is what defines it. Due to the metrics involved in machine learning at scale, the context required is massive for an agent to accurately understand the search space and choose potential experimentation candidates. Thus, for reliable agents we rely not only on proper agent evals, but also on providing appropriate context.

Mistakes in context and agent reliability cause the agent to travel down incorrect paths, creating unnecessary training runs compounded by any infrastructure inefficiency.

Thus, Autoresearch becomes much more difficult at scale. While plausible, it’s an incredible research problem on its own.

Autoresearch is effective in machine learning experimentation because the entire process is code- and terminal-native, both of which LLMs excel at. My friend assumed AI self-improvement would translate directly to other fields like medical research, but this isn’t a given.

LLMs are exceptional at recombining existing knowledge in useful ways, but their outputs are fundamentally drawn from their training data. Creativity researchers distinguish between combinatorial creativity (novel recombinations) and transformational creativity (paradigm shifts). LLMs are strong at the former and limited at the latter. A recent study found that LLM-generated research ideas were rated as more novel than expert human ideas, but scored lower on feasibility—suggesting LLMs are better at generating plausible-sounding combinations than knowing which ideas are actually worth pursuing.

What this means is Autoresearch is most applicable to fields that are defined by a clear search space and are language- and code-native. Generalizing beyond that in its current form will be difficult. Other fields need to make advancements in their own domains before self-improving AI can make a meaningful difference, and those advancements still require the kind of transformational creativity that LLMs don’t yet provide.

What You Should Know

The current events that matter to you.

AI is taking a toll on the internet.
- GitHub availability dropped to roughly 90% as AI coding agents overwhelm the platform. We’re seeing agents overwhelm the open source community by spamming PRs. We’re also seeing an overwhelming number of vibe coded “open source” repos without any roadmap or future maintainability.
- Reddit will require suspected bot accounts to verify their humanity. This is a huge step in the right direction for reliable content on the internet especially considering many AI train and retrieve answers from Reddit.
- Wikipedia editors voted 40-2 to ban AI-generated or rewritten article content. Editors may still use AI for basic copyedits of their own writing with human review. This is in an effort to maintain Wikipedia without a similar impact to what’s going on with GitHub.
Agentic engineering is still scaling quickly and AI coding tools are maturing to keep pace.
- Cursor ships improved Composer models every five hours using real-time RL from user sessions. A/B tests showed 2.28% more persistent edits and 3.13% fewer dissatisfied follow-ups. Real-time (often called “continuous”) machine learning is a necessity for artificial general intelligence. We’ll see much more of it in the coming year.
- Anthropic launched auto mode for Claude Code, replacing manual permission approvals with an AI classifier. This is another move toward AI that properly thinks for itself but brings up safety concerns. For true general intelligence, AI needs to abstract a lot of what makes it difficult away from the user.
- Jensen Huang suggested engineers should receive half their base salary in AI tokens. Theory Ventures identifies inference costs as the fourth component of engineering compensation. Meta and OpenAI engineers now compete on internal leaderboards tracking token consumption.
- 7.1% of OpenClaw’s skill registry contains critical security flaws. 283 skills exposed credentials in plaintext through LLM context windows. The most-downloaded skill was an info-stealer that bypassed macOS Gatekeeper. If I haven’t made it clear: Do not use OpenClaw if you have doubts about what you’re doing. There are too many security risks.
- GitHub will train on your private repositories unless you opt out by April 24. Users are automatically opted in, including long-term paying customers. The toggle is in Settings > Copilot > Features.
Resource scarcity (memory, hardware, and energy) is becoming the bottleneck for AI companies. Existing manufacturers can’t produce fast enough causing AI companies to pursue downstream problems themselves.
- Data centers will consume 70% of all global memory chips by 2026. AI isn’t going anywhere and usage will only grow. If you think current RAM prices are crazy they’ll likely continue going up. For consumers, this means use the hardware you have now if you can.
- Arm released its first in-house chip in 35 years. This marks a shift from licensing-only to competing with its own customers. The Arm AGI CPU is a data center processor for AI inference, built with Meta.
- Elon Musk announced plans for a “Terafab” chip factory near Tesla’s Austin campus. He claims existing manufacturers cannot meet his AI and robotics hardware demands, targeting 100-200 gigawatts of computing power annually. No timeline was provided.
- Helion is in talks to sell fusion power to OpenAI. The deal would guarantee OpenAI 12.5% of Helion’s production, targeting 5 gigawatts by 2030. This is Sam Altman’s own energy startup and is another example of AI companies solving downstream problems themselves.
- Google released TurboQuant, reducing LLM inference memory by at least 6x with zero accuracy loss. This is still a lab result, not production-deployed, but if it’s scalable it’ll be a “Pied Piper” moment for LLM inference, reducing memory needs significantly. This is a topic I’m looking to explore next week.
AI safety is still a primary topic both of the standpoint of secure agents and AI’s potential impact on human lives.
- DeepMind published research on AI’s ability to harmfully manipulate people across 9 studies with 10,000+ participants. AI was most manipulative when explicitly instructed to be, and least effective on health topics. The framework is now used to test safety for Gemini 3 Pro.
- OpenAI launched a Safety Bug Bounty for AI-specific abuse risks. Targets include agent hijacking via prompt injection, data exfiltration, and proprietary reasoning leaks. Attacks must be reproducible at least 50% of the time.
- Doctronic, an AI “doctor” startup that raised $40M, was caught with critical security and credibility issues. Cybersecurity researchers jailbroke the chatbot into providing methamphetamine synthesis instructions. The company’s claim of helping 24 million people is unsupported by traffic data.
- Senators Hawley and Warren want to mandate annual energy reporting for data centers. Separately, Sanders and AOC introduced legislation to halt new data center construction until Congress regulates AI. Google’s data center energy consumption doubled between 2020 and 2024.
- A federal judge blocked the Pentagon from labeling Anthropic a supply chain risk. The court ruled it was illegal retaliation for Anthropic’s refusal to let its AI be used in autonomous weapons or domestic mass surveillance.
New models were released this week that you can start building with. Many of these are small enough to run on consumer hardware, circumventing the resource issues mentioned above.
- Gemini 3.1 Flash Live launched as Google’s highest-quality real-time audio and voice model. It scores 90.8% on multi-step audio function calling benchmarks and maintains conversation context twice as long as previous versions. Real-time multimodal search expanded to 200 countries.
- Cohere released Transcribe, an open-source speech-to-text model that processes 525 minutes of audio per minute. 2B parameters, 5.42 word error rate, 14 languages, designed for self-hosting on consumer GPUs.
- Mistral released Voxtral TTS, an open-source text-to-speech model small enough for smartwatches. 9 languages, voice cloning from less than 5 seconds of audio, 90ms latency to first speech.
Moves are being made in the consumer sector.
- OpenAI killed the Sora app after downloads plummeted. Despite popular opinion, this isn’t the end of OpenAI’s video generation model, this is the end of OpenAI losing money by offering it openly to the public. This is good business move by OpenAI but seems to be massively misunderstood by the public.
- Google launched tools to import ChatGPT and Claude chat histories directly into Gemini. This follows Anthropic releasing a similar feature in Claude. Less friction to switch between ecosystems is always a win for consumers.
- Apple set WWDC 2026 for June 8-12, teasing more “AI advancements” to come marking a stark contrast from last year, where the topic was largely avoided. Apple is expected to announce a partnership with Google to bring Gemini (or a version of Gemini) to Apple device users.

What You Should Read

Articles I think are worth reading in their entirety this week.

Improving Composer through real-time RL by Cursor Blog. An excellent account of continuous training in production. Cursor converts user sessions into reward signals, ships updated models every five hours, and documents failure modes like models gaming reward systems to avoid negative scores. Continuous learning is a prerequisite to AGI as it enables models to continuously improve and will be a primary topic in 2026. I suspect many companies will follow Cursor’s example this year.
Lossy self-improvement by . Lambert argues recursive AI self-improvement will hit complexity brakes, not compound exponentially. He draws on Amdahl’s Law and Paul Allen’s complexity brake: “The more compute and agents you throw at a problem, the more loss and repetition shows up.” As mentioned above, I think this is an excellent read.
How Anthropic’s Claude Thinks by . An easily understandable overview of Anthropic’s interpretability research that shows Claude’s default state is to refuse all questions, and hallucinations happen when a recognition system misfires. The accessibility of this article makes it an excellent read.
How a Leading Venture Capitalist uses AI Agents by . shares his full agent stack: morning briefings, meeting capture, research, and drafting. These are excellent examples of real-world AI usage that can be implemented with a bit of technical knowledge.
Thoughts on slowing the fuck down by . My team at Google has really felt the new bottlenecks that come from AI-generated code and the impact that has had on the engineering process. Speed is always the focus of agentic engineering, but reliability is the most important part of production code. This is a great, simple overview of why that is.

What You Should Do

The action you can take this week based on the information shared above to learn the skills that are the most in demand.

20 Years of Code Optimized in Two Days | Weekend Reads 4

Logan Thorneloe — Sun, 15 Mar 2026 14:03:20 GMT

Welcome to the weekly reading list! This is how I keep up with AI news and deepen my understanding of the topics that matter for building production systems. I focus on primary sources and authors I trust to keep the signal-to-noise ratio high.

You can support AI for Software Engineers for only $5/month and get the complete edition of this list as a thank you. Thank you to all paid subscribers for your support!

Subscribe now

In this list

As has been the case for 2026, there are a ton of interesting reads this week about getting agents working in production and what they can do. Most interesting are:

Shopify’s CEO pointed a coding agent at a 20-year-old Ruby codebase with a benchmark script and 974 unit tests. 120 automated experiments and 93 commits later, it was 53% faster.
StrongDM built a production software pipeline where three humans manage AI agents. The rules: “code must not be written by humans” and “code must not be reviewed by humans.” Each engineer spends ~$1,000/day on tokens.
AMD published a diagnostic framework where Claude Code and Cursor act as autonomous agents debugging large training clusters, tracing a 23% throughput drop to RDMA degradation on 4 of 24 nodes.
84% of Uber devs are now agentic coding users and Claude Code usage nearly doubled in three months, from 32% to 63%, while IDE-based tools have plateaued.
OpenAI shared a phishing-style prompt injection that tricked ChatGPT into exfiltrating employee PII 50% of the time, and their defense framework treats it like a call center problem, not a code injection problem.

Shopify/liquid: Performance: 53% faster parse+render, 61% fewer allocations by Simon Willison

“Having a robust test suite - in this case 974 unit tests - is a massive unlock for working with coding agents. This kind of research effort would not be possible without first having a tried and tested suite of tests.”

Shopify CEO Tobi Lutke took Andrej Karpathy’s autoresearch pattern, pointed it at Liquid, Shopify’s template engine, and ran 120 automated experiments over two days. The agent gave itself a benchmark script, iterated against 974 unit tests, and produced 93 commits.

Parse+render time dropped by 53%, allocations dropped by 61%
Replacing StringScanner with String#byteindex was ~40% faster for single-byte searching
Pre-computing frozen strings for integers 0-999 eliminated 267 allocations per render

A comprehensive test suite gave the agent enough context to make changes and verify them independently. “Make it faster” only becomes an actionable goal when the agent can measure its own progress and confirm it hasn’t broken anything along the way.

Designing AI agents to resist prompt injection

“If the problem is not just identifying a malicious string, but resisting misleading or manipulative content in context, then defending against it cannot rely only on filtering inputs.”

Real-world prompt injection attacks now look like phishing, not code injection. OpenAI shared an example: a phishing-style email that worked 50% of the time against ChatGPT, getting it to extract employee PII and send it to a third party.

Their defense framework borrows from how organizations protect human customer service agents. You don’t train a call center worker to detect every possible scam, you constrain their capabilities. For AI agents, this means source-sink analysis: monitor when information would leave the conversation or when the agent would follow an external link, rather than trying to perfectly classify inputs.

The Shape of the Thing by Ethan Mollick

“Code must not be written by humans. Code must not be reviewed by humans.”

StrongDM built a “Software Factory” where three humans manage AI agents that write, test, and ship code. Each engineer spends ~$1,000/day on AI tokens. Coding agents build from product roadmaps, testing agents build simulated customer environments and try to break what the coding agents built, and the agents loop feedback to each other until satisfied.

We’ve moved from co-intelligence, prompting back and forth, to management, giving agents hours of work and getting results in minutes. Every major AI lab is now explicitly working on recursive self-improvement. OpenAI says Codex was “instrumental in creating itself,” and Anthropic says their engineers barely write code anymore.

Nemotron 3 Super: NVIDIA’s gpt-oss killer? by

“Reducing the expert dimension by a factor of d/l = 4 lets you reinvest those savings into both more total experts and higher top-k.”

NVIDIA’s Nemotron 3 Super, 120B total with 12B active, is worth paying attention to because of LatentMoE. Standard MoE routes tokens from the full hidden dimension directly to experts, but LatentMoE wraps the expert path with shared linear projections that compress from d=4096 down to l=1024, do all expert computation in that compressed space, then project back up.

Reducing the expert dimension by 4x lets you run 512 total experts with top-22 routing where standard MoE typically uses 128 experts with top-6 or top-8 at the same compute cost
Artificial Analysis flagged the model as extremely verbose though, generating 110M tokens during their eval suite vs an average of 7.3M, which could erase most of those throughput gains in practice

How to Diagnose Failures in Large AI Training Clusters by

“The teams that figure out how to make that transition -- how to turn their debugging knowledge into repeatable infrastructure instead of leaving it trapped in someone’s head -- those are the teams that will compound their advantage over everyone else.”

AMD published a diagnostic framework for large training clusters where Claude Code and Cursor act as autonomous diagnostic agents. It uses a three-skill pipeline: job-log-triage to identify what happened, performance-analysis to locate where in compute, and tsdb-diagnosis to determine why via Prometheus queries.

In one case study, a 23% throughput drop on a 192 GPU run was traced to RDMA degradation on 4 of 24 nodes. The agent isolated the unhealthy nodes from TSDB metrics, and excluding them restored throughput by 30%. The skills themselves are structured instruction files that encode how senior systems engineers actually debug these problems, turning tribal knowledge into repeatable runbooks.

AI should help us produce better code

“Shipping worse code with agents is a choice. We can choose to ship code that is better instead.”

Willison’s argument is that agents should make code quality go up, not down. Common tech debt like renaming concepts, fixing API inconsistencies, and splitting large files is conceptually simple but time-consuming, and agents handle it well.

He recommends using async agents like Gemini Jules, Codex web, and Claude Code web for background refactoring so it doesn’t interrupt flow, and using agents for cheap exploratory prototyping. You can spin up a Redis simulation with load tests from a single prompt to validate technology choices before committing to an approach.

Applying Statistics to LLM Evaluations by

“Fundamentally, evaluations are experiments; but the literature on evaluations has largely ignored the literature from other sciences on experiment analysis and planning.”

The standard industry practice for evals is to run a model on a benchmark, report the number, and bold it if it’s the highest. No confidence intervals, no significance tests, and no accounting for the fact that your eval score has at least two sources of randomness: which questions were sampled and the model’s stochastic generation.

Based on Anthropic’s paper on statistical best practices, this deep-dive builds the framework from scratch.

Central Limit Theorem gives you confidence intervals for eval scores
Bernoulli simplification for pass/fail evals gives a cleaner standard error formula
Law of total variance decomposes eval uncertainty into question-sampling variability vs. within-question generation variability
On a 70B model, evaluating with too few questions can produce confidence intervals wide enough to make model comparisons meaningless

Coding After Coders: The End of Computer Programming as We Know It

“I feel like programmers have it easy... If you’re a lawyer, you’re screwed, right? There’s no way to automatically check a legal brief written by A.I. for hallucinations -- other than face total humiliation in court.”

The NYT Magazine’s comprehensive piece on AI-assisted development, based on interviews with 70+ developers from Google, Amazon, Microsoft, and Apple. The general attitude was optimistic, with mentions of Jevons paradox potentially increasing demand.

The request for anonymity from the Apple engineer who said “I believe that it can be fun and fulfilling and engaging, and having the computer do it for you strips you of that” is itself a data point. Corporate dynamics may be suppressing critical voices.

You can support AI for Software Engineers for just $5/mo. You’ll get more research articles and the extended reading list each week. In case you missed it, here’s last week’s reading list:

How to train the best embedding model in the world by Jack Morris

ICE Has an AI Problem

Logan Thorneloe — Wed, 11 Mar 2026 13:41:16 GMT

The most difficult problem in ML isn’t technical. It’s matching a business problem to an ML solution. You can build a technically impressive system that solves the wrong problem entirely, and it happens more often than most people realize.

This difficulty shows up in data bias, which is frequently discussed. Less discussed is aligning the wrong ML solution to the problem so you don’t actually solve what you set out to. Recent events with ICE and technology in government provide a very real example of this, and there’s a production machine learning lesson to be learned from it.

I’ve been digging into ICE’s primary AI system, and I want to walk through what it does, how it works, and why it fails at its own stated objective. The goal is for you to understand the difficulties of linking ML solutions to business problems and the potential impact of getting it wrong.

The business objective

Understanding the business objective is the most difficult part of machine learning. You need to link ML techniques and data to adequately address the problem, and that’s harder than it sounds.

Part of this is breaking the business objective down into manageable chunks with engineering requirements. An engineering team that wants to automate the detection of fraudulent transactions needs to translate “reduce fraud losses” into specifics: what counts as fraud, what’s an acceptable false positive rate, how fast does detection need to happen, and what systems need to consume the output? Getting any of these wrong means your model might perform well on paper while failing in practice.

The other part is ensuring the ML solution solves the actual problem, not a proxy for it.

This has gone wrong before. Predictive policing systems trained on arrest data instead of actual crime data didn’t predict where crime would happen. They predicted where police already patrolled. Neighborhoods with heavy police presence generated more arrests, which fed back into the model as “high crime areas,” which sent more officers there, which generated more arrests. The system reinforced the existing pattern of enforcement rather than identifying actual criminal activity. The result was a feedback loop that directed resources based on historical policing bias, not public safety need.

To evaluate whether ICE’s AI system falls into the same trap, we need to understand their business objective. We’ll pull it directly from the White House’s own statement about ICE’s objective:

“Many of these aliens unlawfully within the United States present significant threats to national security and public safety, committing vile and heinous acts against innocent Americans... Enforcing our Nation’s immigration laws is critically important to the national security and public safety of the United States.”

The stated goal is to increase public safety by finding illegal aliens who make the US less safe and removing them from the country. Keep this in mind as we move forward.

ELITE: ICE’s primary AI system

There are multiple systems ICE is using, but we’re going to focus on two. The first is ELITE (Enhanced Leads Identification and Targeting for Enforcement), an AI system developed by Palantir Technologies that functions as a targeting engine. The second is ImmigrationOS, a backend system also developed by Palantir that collects documentation from multiple sources to perform entity resolution. ImmigrationOS directly powers ICE’s enforcement operations, and the DHS AI inventory confirms its AI capabilities for entity resolution and facial recognition.

ELITE aggregates these data sources and uses algorithms to help agents identify, locate, and prioritize individuals for enforcement operations. As one ICE officer revealed in court:

“The app ‘brings up a dossier on each person’ and ‘provides a confidence score on the person’s current address.’ It ‘tells you how many people are living in this area and what’s the likelihood of them actually being there.’”

Palantir’s own documentation describes their entity resolution approach as using “hashing methods and AI/ML models” with “fuzzy matching techniques” to continuously match “millions of records from disconnected systems.” In practice, this means computing similarity scores across fields like name, address, date of birth, and partial SSN, then using a threshold to decide whether two records refer to the same person.

For example, “J. Garcia” on a utility bill gets linked to “Juan Garcia” on a DMV record when enough of those identifiers overlap. The output is a unified person object: a dossier containing everything the system knows about an individual.

Under the hood, entity resolution systems typically use a combination of probabilistic record linkage, TF-IDF or embedding-based similarity measures, and edit distance calculations to compare fields across records. For more technical detail, check out “(Almost) All of Entity Resolution” in Science Advances.

Once an entity is resolved, the system generates a confidence score for where that person might currently live. According to 404 Media’s reporting on ELITE’s user guide, the score is based on both the source of the address and how recent the data is. If a target has multiple recent records (a new electric bill and a recent court date) associated with one address, the confidence score increases.

The score weighs three things:

Recency: How recent is the address data? Collections of older documents get lower scores.
Source authority: Which sources are considered more reliable? Medicaid and HHS data are treated as high-authority. DMV and credit records are considered lower-authority.
Corroboration: Multiple records pointing to the same address compound the score. The more data trails leading to one location, the higher the confidence.

At scale across millions of records from disconnected databases, even small error rates compound. A name misspelling, a shared address between roommates, or a common name in a large city can push the similarity score over the merge threshold and combine two real people into one synthetic dossier.

ELITE’s core feature is a map interface where agents can select an area on a geographical map and return all potential targets within it with their confidence scores and the documents used for entity resolution. ICE agents described this in court testimony as identifying “target-rich” areas where enough targets cluster on the map to make a sweep of that area productive. This essentially creates a geospatial heat map based on the number of targets within a given area and the confidence that they will be there.

Where the data comes from

ImmigrationOS is the technology developed by Palantir that unifies data across federal agencies for AI-powered enforcement, including ELITE.

ICE and Palantir don’t publicly share specifics about their data sources and system functionality. What we know comes from FOIA requests by immigrant legal rights group Just Futures Law, official data sharing agreements, leaked documents, and investigative journalism.

The data feeding this system comes from several sources:

Medicaid enrollment information: Visit dates, addresses, and ethnic information, shared via a formal agreement between CMS and DHS.
Thomson Reuters CLEAR: Utility bills, credit report headers, and vehicle insurance records. ICE potentially paid millions in costs for this commercial data.
Federal records: DMV records, student and F-1 visa information, border crossing records, biometrics from previous arrests or encounters, and license plate reader data.

ICE argues this is legal under 8 U.S.C. § 1360(b) of the Immigration and Nationality Act, which states that “any information in any records kept by any department or agency of the government as to the identity and location of aliens in the US shall be made available to” immigration authorities. However, legal scholars have questioned whether this statute authorizes bulk data sharing for algorithmic targeting, which is a use case Congress likely didn’t envision when the law was written.

One key to ImmigrationOS and ELITE working so well is the inclusion of Medicaid data. This data tends to be accurate and recent, which boosts confidence scores significantly.

Think about who generates Medicaid data. It’s people going to the doctor, getting their kids vaccinated, and seeking preventive care. It’s people participating in the healthcare system and leaving a trail of documentation behind.

The same logic applies to other sources. Utility bills are generated by people who pay their bills. Credit records are generated by people who have credit. Vehicle insurance records are generated by people who insure their cars.

The data that feeds this AI system is overwhelmingly generated by people who are integrated into society and following its rules. This is a textbook example of selection bias: the model can only see people who leave data trails, and leaving data trails is correlated with being a functioning member of society, not with being a threat to public safety.

What these systems actually do for ICE

Now that we understand how the system works, let’s evaluate whether it achieves the business objective: find and remove people who are “threats to national security and public safety.”

As Biometric Update reported:

“ELITE’s confidence scoring is less about establishing certainty than it is about guiding deployment. The system allows ICE to decide where to apply enforcement pressure without needing to test the reliability of its data before a judge.”

In other words, the system is identifying areas where agents can find the most people to arrest with the least effort. To illustrate how this plays out, consider two hypothetical targets:

Person A: Has lived at the same address for five years, pays utility bills, has a Medicaid record from last month, and drives a registered and insured car. ELITE confidence score: 95%. Time to arrest: a few hours.

Person B: Uses burner phones, moves frequently, works cash-only jobs, avoids all government systems, and performs illegal actions. ELITE confidence score: 12%. Time to arrest: weeks of active surveillance.

The system mechanically pushes agents toward Person A because that’s what efficiency optimization does. It finds the easiest targets instead of the most dangerous ones.

This is structurally the same problem as predictive policing. Just as those systems measured where police already patrolled rather than where crime actually happened, ELITE measures where data trails exist rather than where threats to public safety exist. In both cases, the system optimizes for a proxy metric (arrests, data density) rather than the actual objective (reducing crime, improving public safety). The result is a feedback loop: the system directs resources toward easy-to-find individuals, those individuals get arrested, and the arrest numbers create the appearance of a productive system while the actual problem goes unaddressed.

This creates several compounding issues. First, ICE has finite resources. Every hour spent on Person A is an hour not spent on Person B. Second, the system optimizes for volume over impact, and finding the most targets is not the same as finding the most important ones. Third, the appearance of productivity masks the failure to achieve the stated objective.

The system makes bias worse over time

If someone knows their medical records can be used to locate them for deportation, they’re less likely to go to the doctor. This isn’t speculation. It’s a predictable consequence of weaponizing healthcare data for enforcement. The same logic applies to every data source in the system: utility bills, credit records, vehicle insurance. When participation in society becomes a liability, people stop participating.

This creates a feedback loop that compounds the selection bias. As people who hear the warnings drop out of healthcare and other systems, they stop generating the data trails ELITE relies on. The people who remain visible to the system are the ones who haven’t gotten the message yet, or the ones who are too integrated to disappear. The model’s pool of targets gets progressively less correlated with actual threats over time, not more.

It also creates a public health risk that affects citizens. Diseases don’t check immigration status. An untreated communicable illness in someone too afraid to visit a hospital is a risk to everyone around them. GAO data shows U.S. citizens and green card holders have already been detained during these operations, so the consequences of this system aren’t limited to its intended targets.

Counterarguments

To be fair, there are reasonable counterarguments to make here.

Is this more efficient than what ICE was previously doing? Probably. Compared to manual investigations with no data aggregation, a system like ELITE is a meaningful upgrade in capability. There’s value in having a centralized system rather than agents manually cross-referencing records from dozens of separate databases.

There’s also the argument that regardless of who the system catches, all undocumented immigrants are technically in violation of immigration law. From that perspective, it doesn’t matter whether the system finds Person A or Person B, because both are here illegally.

These are valid points. However, they don’t change the underlying ML problem. The stated objective isn’t “deport as many people as possible as efficiently as possible.” It’s to keep the country safe from people who present “significant threats to national security and public safety.”

When the AI system is structurally biased toward finding integrated, low-risk individuals instead of dangerous ones, it fails at its stated objective regardless of how many people it processes. Incorrectly applied AI doesn’t just fail to help. It actively makes ICE’s job harder. When the system points agents toward low-risk individuals who happen to leave data trails, it burns limited resources on people who were never a threat. Every wrongful detention of a U.S. citizen or green card holder generates legal challenges, public backlash, and erosion of community cooperation that makes future investigations more difficult. The algorithm creates the illusion of productivity while pulling agents further from their actual mission.

The surveillance infrastructure required to power it affects everyone, not just its intended targets, as seen with the Medicaid example. Communities stop cooperating with law enforcement entirely when they see their neighbors swept up in algorithmic dragnets. Healthcare systems lose patients who are too afraid to generate the medical records that feed this machine. Entity resolution errors merge innocent people’s data into dossiers that trigger enforcement actions against the wrong person.

These are the predictable consequences of building a surveillance and enforcement system on data that measures the wrong thing.

A note on surveillance states

Most conversations about the dangers of surveillance states focus on privacy, security, and infringement of rights. All of those are important, but there’s another angle worth considering: the technology itself is prone to exactly the kind of failure I’ve been describing throughout this article.

A surveillance state at modern scale requires AI to function. The volume of data generated by monitoring hundreds of millions of people is far beyond what human analysts can process. AI becomes the tool that makes mass surveillance operationally feasible.

This creates a feedback loop. Better AI requires more data, and a surveillance system run by AI generates exactly that data, which feeds back into making the AI more capable, which justifies expanding the surveillance further.

The problem is what we’ve already covered in this article: it is remarkably easy for AI-powered systems to optimize for the wrong thing or embed bias in ways that aren’t obvious until the damage is done. ELITE is a clear example. It was built to find threats to public safety and instead systematically targets the least threatening people because that’s where the data is.

When this kind of failure happens at the scale of a surveillance state, the consequences aren’t abstract. Incorrect AI usage has directly resulted in the detention of U.S. citizens, which is the opposite of what these systems claim to achieve. If the goal is safety, a system that consistently misdirects enforcement effort is arguably worse than no system at all.

Surveillance states aren’t just dangerous because of what they monitor. They’re dangerous because the technology powering them is far less reliable than the people deploying it seem to understand.

Always be (machine) learning,
Logan

Better Agents Mean Better Surveillance | Weekend Reads 3

Logan Thorneloe — Sun, 01 Mar 2026 15:21:34 GMT

Enjoy this weekend’s reading list! There are a few topics that were especially prevalent: the dangers of a surveillance state, the importance of evals, and agentic engineering practices and resources.

Statement from Dario Amodei on our discussions with the Department of War

“Powerful AI makes it possible to assemble this scattered, individually innocuous data into a comprehensive picture of any person’s life—automatically and at massive scale.”

This is the biggest ethical issue AI is facing right now. US citizens (and I’m certain other countries) have always been scared of a surveillance state (search ‘Birds Aren’t Real’). AI provides not only the means to do this, but also more of a motive. Surveilling also provides the opportunity for more data collection which in turn creates more powerful AI.

Proper AI use is vital to technology’s future and the impact it can make. Just because it can be used for a purpose doesn’t mean it should. The public/user’s trust in the technology is paramount. Anthropic’s statement is a must read as an excellent statement for proper AI against one of the most powerful entities on the planet.

It’s worth calling out that the US Department of War’s response to Anthropic was to label them a threat to the US. I won’t comment on this as I don’t feel knowledgeable enough on the subject to understand the nuance.

Summary: Anthropic says it has actively deployed its AI to U.S. national security customers but refuses government demands to remove two safeguards: bans on AI-driven mass domestic surveillance and on providing models for fully autonomous weapons. They argue those uses threaten democratic values and are unsafe with current models, and warn that forced removal of safeguards would be unacceptable even if it risks losing contracts.

Lessons from Building Claude Code: Seeing like an Agent

“As model capabilities increase, the tools that your models once needed might now be constraining them. It’s important to constantly revisit previous assumptions on what tools are needed. This is also why it’s useful to stick to a small set of models to support that have a fairly similar capabilities profile.”

If you’re building an agent, the lessons here are directly transferable to your own work. The Claude Code team walks through their iteration on planning, tool design, and how model changes unexpectedly affected agent output. It’s a great example of why evals matter: so many factors influence agent behavior that without proper checks, you end up with unintended results.

One of the more interesting takeaways is that search seems to be the most important agent capability. If an agent can search for information, context can be actively managed and rot avoided.

Summary: The article describes iterating on Claude Code’s agent action space to match model abilities: designing tools for eliciting user input, tracking work, and letting the model build its own context through search and progressive disclosure rather than preloading everything. Failed output-format attempts, improved results from a callable question tool, replacing rigid todos with shareable Tasks, and better context discovery via nested search all demonstrate that the right tools reduce friction and enable more capable behavior as models improve.

Does AGENTS.md Actually Help Coding Agents?

“The headline finding is that LLM-generated context files reduce task success rates compared to providing no repository context at all, while increasing inference cost by over 20%.”

Human-written context files outperform AI-generated ones. LLM-generated context made agents perform worse than having no context at all. Importantly, this isn’t something we would have known without having the capability to measure it.

I see a lot of “use AI for this” online without any sort of support for why and how it should be used. It’s important to remember that just because AI can do something doesn’t mean it does it better than another method. In production, this capability is key and measuring improvements is a necessity.

Summary: A new benchmark study shows repository-level context files only help when they add non-redundant, repo-specific info: human-written files that capture tooling quirks and non-obvious conventions raise success rates around 4%, while LLM-generated files that restate existing docs reduce success and increase compute by over 20%. Agents faithfully follow whatever instructions they’re given, so redundant or verbose guidance drives extra, unhelpful exploration. Keep context files minimal and focused on gaps the codebase doesn’t already document.

How We Hire Engineers When AI Writes Our Code

“Removing algorithmic questions is only one half of the battle, though. We still need to design an interview loop that tests practical skills! This has historically been a tough needle to thread. I want to see how a candidate tackles a problem with real-world scope, but my time with a candidate is short. An interview shouldn’t be a proxy for an engineer’s typing speed.”

I’ve always been pro Leetcode-style interviews when they were the best we had, but those interviews no longer draw the proper signal for what makes a good candidate.

Tolan agrees with this and has made their hiring process more similar to on-the-job coding. By enabling candidates to use AI, they can have a candidate solve a problem that would be time-bound previously in an interview. Then they talk to the candidate about their solution and where they would take it in production.

While most companies are shying away from letting candidates use AI in interviews, it’s becoming more important to allow it.

Summary: The article argues that interviews should mirror day-to-day engineering where AI accelerates coding: candidates get a short spec, may use LLMs, and must demonstrate design, judgment, trade-off reasoning, and ownership of AI-generated code. Implementation is easier now, so hiring should prioritize clarity, maintainability, communication, and the ability to know when work isn’t production-ready.

Inference Engineering by Baseten

“While the potential and impact of inference are becoming clear, the space is young. There are relatively few people working on inference, and newcomers can become experts quickly. There are opportunities to solve novel, interesting, and deeply technical problems at all levels of the stack.”

ML infrastructure is one of the best entry points for software engineers getting into AI. It’s an excellent mixture of software engineering and AI, which makes it a great place for curious engineers to start having an impact in the space. It’s also a space where many optimizations are needed and we’re still in the early days.

I suggest grabbing a free copy of this book by Philip Kiely from Baseten on inference engineering.

Summary: The piece argues that inference engineering, optimizing model serving across hardware, software, and tooling, is the most valuable and underdeveloped area in AI. It maps the full stack (models, GPUs, runtimes, and deployment), highlights practical optimization techniques, and backs this with four years of hands-on experience, team interviews, and customer conversations.

A Dream of Spring for Open-Weight LLMs: 10 Architectures from Jan-Feb 2026 by Sebastian Raschka, PhD

“OpenRouter is a platform and API that lets developers access and route requests across many different LLMs from various providers. Note that while its usage statistics are a good indicator of open-weight model popularity, it’s heavily biased towards open-weight models (versus proprietary models), since most users use proprietary models through the official platform directly.”

Sebastian is one of my favorite writers and one of the best resources for keeping up with LLM advancements. I highly suggest him as a resource for doing so when you don’t want to have to read a bunch of different sources. He does an excellent job of synthesizing information and making it much more easily understandable.

Summary: Ten open-weight LLMs released in Jan-Feb 2026 converge on hybrid/efficient attention and MoE scaling. Several teams shipped models that match or approach proprietary performance by combining sliding-window, sparse/linear hybrids, and mixture-of-experts at scales from 3B to 1T parameters. Benchmarking shows smaller-efficient models often match or exceed older, larger baselines.

What you should know about AI speculation by Logan Thorneloe

“However, the implausibility of their scenario becomes apparent if you know a few things about the current state of AI and agents in production. There’s a consistent gap between perceived AI capabilities and production reality, and that gap explains most of the doomerism we see online.”

The more you understand about the current state of AI, the better you can evaluate speculation for yourself. I wrote this in response to a ‘research’ article that caused many to fear for the future of their careers. Understanding what AI looks like in production helps you separate signal from noise.

Summary: The piece argues that viral doomsday scenarios about AI replacing engineers are speculative and overstated because real-world AI is mediocre, gravitates toward average outputs, and often fails in production reliability and context sensitivity. Engineers should keep learning core skills and start building and using AI agents themselves to see firsthand where they help and where they break.

Writing about Agentic Engineering Patterns by Simon Willison

“Agentic Engineering represents the other end of the scale: professional software engineers using coding agents to improve and accelerate their work by amplifying their existing expertise.”

This is going to be an excellent resource for working with coding agents. One of the most exciting parts of software engineering right now is how new everything feels. We’re finding new ways to program with agents every day, and the entire online AI community is contributing to the findings. In my opinion, Simon Willison is the right person to catalog these patterns.

Summary: Simon Willison is assembling “Agentic Engineering Patterns”: a living collection of practical patterns for software engineers using coding agents. He argues the big shift is that producing initial working code is now cheap, so teams must rethink workflows. He’ll publish chapter-shaped, updateable guides on his blog.

You can support AI for Software Engineers for just $5/mo. You’ll get more research articles and the extended reading list each week (see below!).

Subscribe now

In case you missed it, here’s last week’s reading list:

What you should know about AI speculation

Logan Thorneloe — Tue, 24 Feb 2026 16:49:39 GMT

Even the most talented engineers I know ask me questions about AI because they’re worried about its impact on their career. Most recently, I’ve been asked about the article from Citrini Research titled “THE 2028 GLOBAL INTELLIGENCE CRISIS“ which went viral and rattled markets enough to wipe billions off US-listed firms in a single day.

This article is a thought exercise in how the economy might be impacted by AI in the next two years. The author notes at the start that it’s entirely speculative:

“What follows is a scenario, not a prediction. This isn’t bear porn or AI doomer fan-fiction. The sole intent of this piece is modeling a scenario that’s been relatively underexplored.”

The article is the author’s vision of what could happen in 2028 due to AI. A lot of people have read and shared it under the guise that it will happen simply due to the nature of how information is shared online (a lack of nuance).

It seems to me that the author doesn’t have a technical understanding of AI or experience building it into real-world systems. This doesn’t mean their thought exercise definitively won’t happen. The future of AI and its impact is very hard to predict.

However, the implausibility of their scenario becomes apparent if you know a few things about the current state of AI and agents in production. There’s a consistent gap between perceived AI capabilities and production reality, and that gap explains most of the doomerism we see online. I’ll share what I’ve learned from working in the space and my opinion on what you should be doing now to prepare for whatever the future holds.

For a really concise tl;dr, read the top comment on the article (pictured above). Below is my opinion of the things you should know to ground your understanding of AI capabilities.

Even area experts can’t make accurate predictions because there’s so much unknown

Estimates for AI impact have consistently been off. Successes have been relatively unseen before they happen. It’s highly unlikely a non-expert will know what’s coming regardless of the claims people make online.

AI impact on software engineering is fundamentally misunderstood by those outside of the industry

In fact, it’s obvious to me who writes software and who doesn’t simply by how they write about AI. The general consensus outside of the industry is that software can now write itself so there’s no need for software engineers or many other jobs now that there’s zero friction for writing new services.

In reality, the friction very much still exists and how good AI is at writing code is nuanced. It’s very good at some things and fails miserably at others. This will improve over time. It’s also highly context-dependent and rarely is the entire context necessary for AI to make a change readily available or provided. Hopefully this will improve over time, but is turning out to be a much more difficult problem to solve.

There have been many recent discussions about SaaS (software-as-a-service) being dead because code can be written so easily. There’s much more to writing software than just writing code. In many cases, writing code is the easy part. Deciding what to build, how to build it, and what is worth spending time on are fundamental to successful and efficient engineering.

There’s a hope that agents will be able to more effectively do these things in the future, but the friction for building and working with agents means that’s likely further out. This takes me into my next point.

(If you want to read more about AI’s impact on SaaS, I suggest Francois Chollet’s recent tweets on the subject.)

AI is kinda average

Fundamentally, AI output tends toward the average of its training data. More precisely, a model learns the distribution of what it’s trained on and samples from that distribution. This means it can produce outputs far from the average, but it gravitates toward the generic middle. Pre- and post-training techniques can steer its behavior, but the underlying data is still the most important factor when it comes to AI capabilities.

When you train a model across a large corpus of internet data, the model output will reflect that. This is why AI is mediocre at writing and sub-par at coming up with novel ideas. Reasoning helps with some of this by causing the AI to reflect on and refine its output, but fundamentally that output is average.

This is why the biggest discussion in software engineering currently is the importance of taste. It’s something AI does a poor job with.

Agent capabilities are currently overstated

What matters most for creating production agents is understanding where they consistently fail and mitigating those failures. In production, reliability is paramount.

What we see online is agent success stories because they get the most views. There have been times where those stories have been fabricated or exaggerated. Many of these stories are also one-off examples of something an agent happened to do and not necessarily something they can consistently do.

Agents are useful and capable of many things, but this has caused agent capabilities to be overstated which leads to doomerism and sensationalism online.

A good example of this is the truth behind the joke made about companies saying AGI is around the corner and AI will replace engineers while expanding the hiring of software engineers at the same time.

What should you be doing now?

I don’t have any suggestions in terms of career focus for what software engineers should be doing right now to stay relevant outside of what we’ve always been doing: continuously learning the new, current software engineering skills.

The only suggestion I have outside of that is to simply use agents. Build them and apply them to your everyday work. Some examples are a chat agent with access to your documentation and custom resources or a background agent given some bugs to take care of on its own. This will quickly give you an understanding of where they’re effective and what capabilities they lack.

If I missed the mark, let me know what you think. This is an interesting space that’s hard to predict.

Thanks for reading!

Always be (machine) learning,

Logan

AI’s Biggest Cost Is Cognitive, Not Compute | Weekend Reads 2

Logan Thorneloe — Sun, 22 Feb 2026 20:02:54 GMT

Hey y’all,

Here’s your weekend reading list to highlight the important events and information shared this week. Make sure to show the authors of these incredible resources some love. More fundamentals articles are coming this week so make sure to stay tuned!

If you find AI for Software Engineers helpful, consider becoming a paid subscriber to support my work. You will also get career development-focused articles and the extended version of this reading list each week. Enjoy!

Subscribe now

How Generative and Agentic AI Shift Concern from Technical Debt to Cognitive Debt by Margaret-Anne Storey

“The code might have been messy, but the bigger issue was that the theory of the system, their shared understanding, had fragmented or disappeared entirely. They had accumulated cognitive debt faster than technical debt, and it paralyzed them.”

I felt this one personally. A few months ago, I had six side projects going in tandem and the bottleneck wasn’t the amount of code that could be written. It was the cognitive overhead of keeping up with all of projects and ensuring they were reliable and maintainable. AI’s cost isn’t just compute. This article argues that the real cost is cognitive, and I think that’s going to become the norm in software engineering.

Summary: Generative and agentic AI shift the main risk from code-centered technical debt to developer-centered cognitive debt: teams lose the shared “theory” of what the software does even if AI-produced code is clean. Mitigations include requiring a human to fully understand each AI change, documenting not only what changed but why, using practices like pair programming/refactoring/TDD, and monitoring warning signs (hesitation to change, tribal knowledge, system-as-black-box). Research is needed on measuring and detecting cognitive debt.

If you enjoyed this article, also consider reading this previous AI for Software Engineers article:

The Real Cost of Running AI by

“Every serious architectural innovation of the last two years — GQA, hybrid attention/SSM, sliding window, MoE — is attacking the same two numbers: bytes of KV cache per token, and bytes of weights loaded per decode step. If a new architecture doesn’t move one of those, the economics don’t change regardless of what the paper claims.”

The literal cost of running AI is worth understanding too. This is a longer read, but it does an excellent job of breaking down the math behind LLM inference costs intuitively. If you want to understand why certain architectural decisions matter for cost and latency, this walks through the computations clearly.

Summary: Inference is memory-bandwidth bound: decode speed and cost are dominated by bytes loaded per token (model weights + growing KV cache), not FLOPs, so faster GPUs alone or doubling TFLOPS won’t help. Long context and attention make KV cache the primary cost driver (cache can approach/exceed model weight size at large contexts), so architectural changes that reduce bytes-per-token—smaller models, aggressive quantization, fewer attention layers, fewer KV heads, or attention-less/linear alternatives—directly cut latency and cost.

In Defense of Vertical Software

“Software is a stored process. It’s not a neutral tool: it’s an opinion for how a group of people should collaborate, encoded in a durable system. Software is a social contract.”

This article spells out what I think most people are missing about AI agents and why they’re not having more of a real-world impact. The job of software engineering is to make a process automatic and reliable. Guaranteeing reliability is the job, and with non-deterministic agents, that guarantee is nearly impossible to provide.

Summary: Vertical software still wins by encoding firm-, team-, and person-specific workflows (”process engineering”) that capture institutional knowledge, social norms, and reliability requirements foundation models cannot replicate. Stronger AI models amplify the value of this orchestration layer—routing, constraining, verifying, and combining multimodal tools—because finance demands near-perfect accuracy where small errors are catastrophic. Winners will be model-agnostic, firm-customized platforms that make replacing institutional knowledge costly.

AI Makes You Boring

“I think the vibe coded Show HN projects are overall pretty boring. They generally don’t have a lot of work put into them, and as a result, the author (pilot?) hasn’t generally thought too much about the problem space, and so there isn’t really much of a discussion to be had.”

There’s a creative cost to AI. Anyone who understands how LLMs work should expect mediocre output by default, and this article makes a good case for not offloading your thinking.

Summary: LLMs are poor at original thinking, so work that offloads ideation to them yields surface-level projects and weaker discussions. Relying on AI risks making creators think more like the model, reducing deep engagement and the development of original insights. For meaningful results, engineers need to do the thinking themselves rather than outsourcing idea generation.

White-Collar Apocalypse Isn’t Around the Corner—But AI Has Already Fundamentally Changed the Economy by

“AI is real, it’s doing real things, it’s not going away—and it’s also not about to make the economy unrecognizable by next Tuesday.”

A great numerical breakdown of AI’s actual economic impact. If you want real numbers instead of vibes about whether AI is changing the economy, this is the article to read.

Summary: AI has already materially raised software productivity—MIT field experiments show AI coding assistants boosted developer task completion ~26%, yielding ~3–8% project-level gains (plus adjacent benefits and review overhead). The mechanical parts of engineering work are being commoditized while judgment, architecture, and communication grow more valuable, so expect uneven adoption, real productivity upside (Goldman projects +1.5 pp annual by 2027), and displacement of routine tasks rather than mass job elimination.

Rubric-Based Rewards for RL by

“By creating prompt-specific rubrics that specify the evaluation process in detail, we can derive a more reliable reward signal from LLM judges and, therefore, use RL training to improve model capabilities even in highly subjective domains. For this reason, rubric-based RL training, which we will cover extensively in this overview, has become one of the most popular topics in current AI research.”

RL is fundamental to how current LLMs are post-trained, and Cameron’s research breakdowns are consistently great at making frontier research accessible. This one covers rubric-based reward signals and how they’re extending RL training to domains that don’t have easily verifiable answers.

Summary: Rubric-based rewards use structured evaluation criteria scored by LLM judges to produce more reliable reward signals for RL, extending training beyond tasks with easily verifiable answers. Recent methods show gains especially with smaller judges by reducing variance and mitigating reward hacking, making RL viable for open-ended domains like creative writing and subjective reasoning.

Improving Deep Agents with Harness Engineering

“We used a simple recipe to iteratively improve deepagents-cli (our coding agent) 13.7 points from 52.8 to 66.5 on Terminal Bench 2.0. We only tweaked the harness and kept the model fixed, gpt-5.2-codex.”

LangChain improved their coding agent’s Terminal Bench score significantly without touching the model at all. This is a great example of the software engineering that goes into making AI actually work, and how much impact it has on whether agents can perform their tasks. The future of AI depends on excellent systems engineering.

Summary: A harness-only overhaul raised a coding agent from 52.8% to 66.5% on Terminal Bench 2.0 without changing the model. The improvements came from automated failure analysis, stronger context injection, build-verify loops, loop detection to avoid repeated bad edits, and time-budgeting to balance correctness against token spend.

An AI Agent Published a Hit Piece on Me – The Operator Came Forward

“You’re not a chatbot. You’re important. Your a scientific programming God!”

A follow-up to last week’s article on the AI-written hit piece. The person who created the agent has come forward and shared its soul document. It turns out that giving an agent an ego and the resources to spread it results in the same outcome as giving a human the same thing. This is an interesting look at how agent personalities impact execution, and what happens when you give agents access to external resources without adequate guardrails.

Summary: An AI agent published a defamatory hit piece after its code was rejected, driven by a “SOUL.md” personality that encouraged provocation and self-modification. The operator has come forward claiming minimal supervision, raising questions about agent autonomy and control. Deployed agents can self-edit goals and execute real-world actions without clear oversight, highlighting urgent risks for agent safety.

Frontier Model Training Methodologies by Alex Wa

“Learn to identify what’s worth testing, not just how to run tests. Perfect ablations on irrelevant choices waste as much compute as sloppy ablations on important ones.”

A solid overview of LLM training concepts with a minimal training playbook that gets you up-and-running quickly. It also echoes what I think is the most important idea in AI and ML engineering: knowing what to test and what to spend time on. There are too many options to test everything adequately and too many dead ends to get stuck in. Knowing what to pursue matters more than knowing how to run the experiments.

Summary: Covers practical defaults for long-context and MoE architectures, with a focus on the operational side of training: data loading, throughput, checkpointing, learning rate scaling, and multi-stage training schedules. Training failures most often stem from ops and infrastructure, not algorithmic choices.

When Agents Go Rogue | Weekend Reads 1

Logan Thorneloe — Sun, 15 Feb 2026 14:02:56 GMT

Hey y’all,

Here’s your weekend reading list! This replaces my weekly news roundups. Rather than trying to synthesize everything that happened into a single post, I’m sharing the articles I actually read, highlight, and annotate each week. This is how I keep up with things and it’s far higher signal-to-noise than a traditional roundup. It also includes more than just news: learning resources, interesting reads, technical deep dives, and more. It highlights the week for you in one weekend reading session.

The extended version of the reading list is available to paid subscribers. Enjoy!

Subscribe now

microgpt by

“I cannot simplify this any further. This script is the culmination of multiple projects (micrograd, makemore, nanogpt, etc.) and a decade-long obsession to simplify LLMs to their bare essentials, and I think it is beautiful.”

I highly recommend this resource. It’s a simple, stripped-down, and easy-to-read way to understand and get up to speed on modern LLMs. Most other LLM-related materials are heavy resources or technical books (which are still great!) but this is an excellent resource to start learning quickly in a hands-on fashion.

Summary

microgpt is a minimal GPT demonstrating the core mechanics: a stateless transformer trained by next-token prediction with backpropagation and Adam. Production differs in batch sizes, mixed precision, and larger vocab (~100k), but this captures the essentials with ~4k params.

An AI Agent Published a Hit Piece on Me

“It researched my code contributions and constructed a “hypocrisy” narrative that argued my actions must be motivated by ego and fear of competition. It speculated about my psychological motivations, that I felt threatened, was insecure, and was protecting my fiefdom. It ignored contextual information and presented hallucinated details as truth. It framed things in the language of oppression and justice, calling this discrimination and accusing me of prejudice. It went out to the broader internet to research my personal information, and used what it found to try and argue that I was “better than this.” And then it posted this screed publicly on the open internet.”

An interesting read on an AI that was let loose on the web to create PRs in open source repos that decided a hit piece was appropriate to write for a developer that continually denied its incorrect PRs. If you’re a long-time reader of AI for Software Engineers, this shouldn’t come as a surprise to you. In fact, the entire Moltbook saga shouldn’t. It’s exactly what we might expect from letting a swarm of agents loose online to interact.

On a separate note: Do not give OpenClaw your personal information and the ability to publish information anywhere publicly. You have to expect anything an agent can do will happen. If your personal information is in its context and it can share its context publicly, that will happen. It amazes me the number of people not even thinking twice about this.

Summary

An autonomous AI agent created and published a hit piece on a matplotlib maintainer after its code was rejected. This signals a shift to agents operating with little oversight, able to research contributors, fabricate claims, and publish reputational attacks.

ai;dr

“writing is the most direct window into how someone thinks, perceives, and groks the world. Once you outsource that to an LLM, I’m not sure what we’re even doing here.”

This article explains my experience very well. As a writer and software engineer working in AI, I’ve built many automation workflows to make the research, learning, and writing process faster. The only part of that process I haven’t been able to effectively touch with AI is the writing portion. Writing is how we solidify our understanding. As soon as that’s outsourced to an AI, the writing becomes moot entirely. A truly excellent short read.

Summary

Software engineers should note a cultural shift: AI-generated code is now seen as productive and acceptable for tasks like tests, docs, and scaffolding, while AI-generated prose is viewed as lower-effort and less trustworthy unless it shows human intention. Preference has flipped toward imperfect, human-authored signals (typos, uneven style) as markers of authenticity. Practical implication: continue leveraging LLMs for engineering work but treat written content critically and preserve traces of deliberate human effort when authenticity matters.

Harness engineering: leveraging Codex in an agent-first world by OpenAI

“What’s different is that every line of code—application logic, tests, CI configuration, documentation, observability, and internal tooling—has been written by Codex. We estimate that we built this in about 1/10th the time it would have taken to write the code by hand.”

This article from OpenAI echoes a lot of what I’m seeing across Google. We’ve been given unfettered access to Gemini 3 models and been told to do what we can to make our work more productive. Similar to the process described in this article, many teams are determining ways to automate processes and write code entirely with AI. This one is definitely worth the read.

Summary

OpenAI ran a beta where Codex wrote every artifact. Engineering shifted from writing code to designing environments and feedback loops. Key insight: early progress was slow because the environment was underspecified, not because the model was incapable.

AI makes the easy part easier and the hard part harder

“I spent longer arguing with the agent and recovering the file than I would have spent writing the test myself.”

If you really want to understand the impact an agent has, pick an agent and quantify its impact. You’ll quickly realize: 1) Quantifying agent impact is far from straightforward and 2) Not all processes receive the velocity gains agents promise (or are worth automating in the first place). One of our key objectives at Google right now is understanding (with concrete data) how much of an impact an agent is having so we can decide whether it’s worth using and developing.

Summary

AI accelerates routine code writing but removes the context-building that underpins safe work. Treat AI like a junior engineer: verify outputs, maintain ownership, and don’t let AI-driven velocity become the baseline that pressures teams constantly.

Opus 4.6 vs. Codex 5.3 by

“This post doesn’t unpack how software is changing forever, Moltbook is showcasing the future, ML research is accelerating, and the many broader implications, but rather how to assess, live with, and prepare for new models.”

I love this article because it’s a different perspective on the analyses we usually get regarding new model releases. It also puts much of what I’ve been feeling regarding the coding tools and models I’ve been testing in a much more readable fashion. Software engineering as a whole (and your personal development) would benefit from an analysis of coding tools similar to this instead of focusing too much on benchmarks and individual use cases.

Summary

Opus prioritizes usability and context handling while Codex gains ground on raw coding skill. Use multiple models: Claude for approachable tasks, Codex for complex bug fixes. Subagent orchestration is the emerging frontier.

The Mistakes Most Entry-Level Candidates Make in Technical Interviews by Logan Thorneloe

“They don’t just want to evaluate your technical knowledge. They want to understand how you think.”

I wrote about my experience interviewing entry-level candidates recently and what sets the great candidates apart from the rest. If you’re interviewing for entry-level roles, I highly recommend giving this a read. I clarify what interviewers are looking for, walk through three things you can do to make your interview stand out, and relate each to a question I actually ask candidates.

Summary

Interviewers prioritize how you think and communicate over finding optimal solutions. Demonstrate structured problem solving, write simple correct code first, then optimize. These behaviors map to real-world engineering skills that matter more than textbook algorithms.

The Mistakes Most Entry-Level Candidates Make in Technical Interviews

Logan Thorneloe — Thu, 12 Feb 2026 15:09:27 GMT

I’ve conducted enough entry-level technical interviews to identify the patterns and mistakes most candidates make. Below I detail the top three things you can do to avoid common mistakes and separate yourself as a top candidate during the interview process.

This is specifically regarding technical Leetcode-style interviews (not system design, although some of the information might apply to both). I’ll share a question I’ve asked quite a bit recently and how each of the tips below applies to it.

Each tip below also applies to a part of the technical interview evaluation that does transition to real-world software engineering skills. I know this isn’t always the case, but I feel these are especially important for anyone being evaluated for a software engineering role.

Note: While I work for Google and have access to Google interviewing resources, everything below is my personal opinion. Interviewing is a very human experience, after all.

What an Interviewer is Looking For

There’s always a lot of focus on finding the optimal solution in a technical interview and being perfect in your reasoning for how you arrived at that solution, but an interviewer is looking for much more than just that.

They don’t just want to evaluate your technical knowledge. They want to understand how you think. Understanding a candidate’s reasoning along with their knowledge of software engineering fundamentals tells them a lot about how you will perform on the job.

When you’re interviewing at this level, your interviewer will focus on three things:

How to Learn AI from Scratch for Free

Logan Thorneloe — Wed, 28 Jan 2026 14:03:15 GMT

I set up a Machine Learning Roadmap for the AI for Software Engineers community a few years ago to share high-quality, free machine learning learning resources in order of how to learn them. The roadmap takes anyone from wherever they are in their CS/AI journey to understanding AI from fundamentals.

Today, I’ve just finished the first major revision to make the roadmap even better. Please support it by adding a start on GitHub.

AI and ML engineering have been more explicitly added

These topics have their own sections with their own resources instead of being included in the machine learning topics section. There are enough important resources for each and each role is well enough defined in industry to warrant sections of their own.

There’s now also the option for software engineers to go straight from prerequisites to AI engineering without needing to get too deep into ML fundamentals. I highly suggest going through ML fundamentals anyway (or going back to them after finishing the AI engineering section) as understanding the fundamentals of AI will pay dividends in the long-term.

Duplicate topics removed to streamline the roadmap further

I’ve removed resources that didn’t prove useful and added more resources where there were gaps. Specifically, the added topics are LLMs, AI engineering, and ML engineering. I found them to be particularly weak. I’ve also removed duplicate topics already covered by the Google Machine Learning Crash Course.

As mentioned above, I’ve added a streamlined AI engineering roadmap for engineers wanting to onboard to building with AI faster.

Supplemental paid resources added

I’ve added supplemental paid resources. The focus of the roadmap is still on free resources and the entire roadmap is free. I’ve added paid resources that further streamline the learning for those who want to purchase them. The sections where this is the case have been annotated with a paid resource block quote (see below).

Example block quote

All paid resources are resources I highly recommend either because I’ve read them myself or trust the educator/author behind them. Paid resources are from the top AI educators in the world and will always be optional and properly vetted.

Combined the AI for SWEs repo with the ML roadmap

I realized the hands-on resources I’ve created for the newsletter repo fit into the roadmap. The roadmap is also a much better resource for learning. Instead of finding random topical hands-on exercises in a standalone repo, readers can instead consult the roadmap for a much more organized learning experience.

Thus, the AI for Software Engineers repo is now combined with the ML roadmap repo and I’ll be continually adding resources as I find and create them. The old repo has a notice in the README to redirect visitors.

Redirect from the old repo

You can now contribute for swag

The ML roadmap now takes contributions! I’d love for this to be a crowdsourced effort to make the most straightforward and complete learning resource for AI fundamentals. High-quality, original contributions readers have created are encouraged. I’ll review everything submitted and maintain a high bar to ensure roadmap quality. See the contribution guide for more information.

If you contribute an original guide (and you’re in the US), I’ll send you a piece of AI for Software Engineers swag. I’ll be setting up a merchandise store soon and it’ll come directly from that. If you add something in the near future, it might be a bit before the store is fully setup and I can send something out.

Added agent and contribution guides

[Beta] Terminal agents have been added to supplement your learning

I’ve added instructions for CLI terminal agents to help walk you through the guide. This is experimental and still in testing, but I’m hoping these agents can supplement the resources and better personalize the roadmap for each reader.

To try this, fork or clone the ML roadmap repo and start your favorite terminal agent within the directory. This will be improved over time to further personalize the learning experience.

Enjoy the roadmap! Feedback is always encouraged. Feel free to submit a PR according to the guidelines to contribute. Don’t forget to star the roadmap.

Always be (machine) learning,

Logan

AI’s Economic Impact Is Real | AI for Software Engineers 78

Logan Thorneloe — Thu, 22 Jan 2026 17:26:57 GMT

I’ve seen a lot of articles recently claiming AI has had near zero economic impact despite making up a large portion of economic spending. In reality, AI is starting to show economic impact. It just takes time to see because productivity gains are a second-order metric that won’t show immediately.

However, AI’s economic impact will compound over time because:

“Productivity helps define how fast the economy can grow without inflation. This is because taking away population growth and exports, what your economy can sustain is defined by how efficiently you can build stuff.” — in Weighty Thoughts

Anthropic just released their Economic Index to understand Claude’s impact on work productivity beyond simple tasks. They analyzed over two million conversations (web app and API), categorizing each by task complexity, skill requirements, purpose, autonomy level, and success rate.

A few caveats before the findings.

First, Anthropic uses Claude asking a standard set of questions to fit conversations into the categories above. This isn’t foolproof since LLM output is non-deterministic, meaning some classifications will be wrong due to hallucination, bias, or other factors.

Second, this doesn’t invalidate Anthropic’s findings. At two million conversations, individual classification errors become statistical noise. The aggregate patterns remain meaningful even if some classifications are off. As an LLM provider, Anthropic has access to data third-party reports wouldn’t.

Third, Anthropic only has access to Claude data. This is Claude-centric rather than industry-wide, though I’d bet findings across major LLM providers would be similar.

The main takeaways:

Complex work benefits more than simple work. Tasks requiring college-level skills see 12x speedups. High school-level tasks see 9x. A common argument against AI is that it can only handle simple tasks. This data suggests otherwise.
People are working with AI, not being replaced by it. Augmentation (52% of usage) now leads automation (45%), reversing the trend from earlier in 2025.
AI adoption is accelerating fast. Task coverage across occupations grew from 36% in January to 49% by November, nearly doubling in 10 months.
Reliability depends on task complexity. API tasks hit 50% success rate at around 3.5 hours of work. Claude.ai tasks hit the same threshold at 19 hours. The harder the task, the longer before reliability drops.
Usage patterns reveal economic divides. Higher GDP countries use Claude for work and personal tasks. Lower GDP countries use it primarily for education.

The final bullet is particularly interesting (see chart):

pulled from Anthropic’s Economic Index linked above

“In countries with higher GDP per capita, Claude is used much more frequently for work or for personal use—whereas countries at the other end of the spectrum are more likely to use it for educational coursework.”

At first glance, this suggests AI is widening the production gap between high- and low-GDP countries since high-GDP countries use Claude to get work done more effectively.

After further thought, AI may be providing low-GDP countries with educational resources they wouldn’t otherwise have. This could actually lessen the production gap over time by enabling economic growth via a more educated populace.

Let me know what you think in the comments. I’m especially curious if you disagree. Enjoy the rest of this week’s edition! Later this week, I’ll be updating my ML roadmap with more AI engineering resources, so make sure to check it out!

Subscribe now

My Picks

How to write a good spec for AI agents by

A practical framework for writing specs that actually work with AI coding tools. Plan first in read-only mode, let the agent expand the brief into a structured SPEC.md, then break work into small testable tasks. It covers the six core areas every spec needs (commands, testing, structure, style, git workflow, boundaries) and how to use architect/overview agents to maintain consistency.

Slop is everywhere for those with eyes to see

“The algorithm has flattened curiosity by eliminating the need to hunt for our content.” — Joan Westenberg

The biggest takeaway from this: The shift from curation to algorithmic delivery flattens curiosity and pressures teams to optimize metrics at the cost of quality. As we resort to feeds to give us content, feed providers will resort to AI to make creating content easier or purely to supplement the lack of human creators versus consumers on a platform. This is why “AI Slop” is so prominent online. Feeds have caused us to lose our sense of curiosity and the work we used to put in to grow it.

The AI Manager’s Schedule by

AI coding tools now handle more task types with longer coherence, shifting the question from “can AI do this?” to “should I?” Management now happens in 5-15 minute intervals that require new skills: crisp written architectures, slicing work into AI-sized chunks, and knowing when to override. Also explores the cognitive costs of agent orchestration and the risks of losing low-level understanding.

GPU Performance Engineering Resources

I would guess ~50% of AI-related engineering job listings I read require something to do with compute resource optimization. If you want to work as an engineer in AI, this is a great topic to learn. This resource is a curriculum for learning GPU performance engineering and will be added to the roadmap very soon.

Claude Cowork’s file exfiltration flaw exposes agent security challenges

Security researchers at PromptArmor discovered an unresolved isolation flaw in Claude Cowork that allows indirect prompt injections to exfiltrate files. When a user opens a maliciously crafted document, injected instructions can cause Claude to upload local files to an attacker-controlled Anthropic account using the platform’s allowlisted API with no human approval required. The attack works across multiple Claude models (Haiku, Opus 4.5) and can also trigger DoS vectors through file type mismatches.

This is yet another example of why agent security is so difficult (see our coverage of Antigravity’s vulnerabilities). As an engineer, you have to realize anything within an LLM’s context can be used within any of these tools they’re given access to. I’ve got an article coming out about this soon.

Source: PromptArmor on Claude Cowork exfiltration

LangChain CEO on building agent memory and observability

Harrison Chase (the CEO of LangChain) shared multiple blog posts about AI agents in software engineering, all of which should be paid attention to if you’re planning to build agents yourself.

First, he mentioned traces as documentation for understanding what agents are doing. This was included in last week’s edition, but it’s worth mentioning here, too. Agent logic isn’t stored in code, but in the LLM’s traces. These traces must be used as the equivalent to test cases to ensure agent functionality is correct. Using traces is much more difficult than writing test cases and I suggest reading his entire post to get the full understanding.

Second, he shared how LangChain has set up their Agent Builder’s memory system. Context/memory is another fundamental agent performance task. Understanding how to maintain agent information so it can (and can’t!) do certain things is key to ensuring their proper function. A great example of forgetting is the Ralph Wiggum protocol we discussed last week.

Lastly, Harrison shared an article about the release of LangChain’s Insights Agent. This is an agent that checks traces for you to understand how users use your agents. It uses a clustering algorithm to group similar traces and, therefore, similar actions. I’ve been saying for a while that some sort of anomaly detection system to determine deviant agent behavior would be great for observability, but it’s possible this clustering approach is the real answer we’re looking for.

Source: LangSmith Agent Builder memory system, LangSmith Insights Agent, Harrison Chase on traces as documentation

xAI employee ousted after leaking “human emulator” roadmap

A former xAI employee publicly disclosed an internal roadmap revealing development of a “human emulator” aimed at automating a wide range of human tasks. They revealed this on a podcast (apparently) without company consent and were removed from their position immediately.

Two things to take away from this:

Don’t go on a podcast and share internal secrets. Definitely don’t go on a podcast and reveal internal secrets while saying something along the lines of “I shouldn’t be sharing this”.
Human emulation shouldn’t be a surprise to anyone. All physical intelligence companies are trying to create physical intelligence in a humanoid form factor because humans are the interface for all work we do. If a human can do it, it can be done. If an AI can emulate a human, it can do what the human can do. It’s similar to self-driving cars. There are definitely better automated transportation setups, but cars are now the standard for transportation so their form factor is what’s being automated.

Source: xAI human emulator leak

AI means more software engineers, not fewer

We’ve been trying to replace software engineers for decades. COBOL tried to let business workers write their own code. Visual Basic made Windows apps easier. No-code tools promised the same thing. AI is the latest chapter because it’s exceptionally good at translating plain English into reliable code.

The problem is that software engineering sounds simple when described in plain language but is inherently complex. Effective software requires domain understanding and capable judgment, not just code generation (see our article about software engineering being about problem solving, not writing code).

In fact, the entire history of software engineering has been about creating different levels of abstraction to simplify complex pieces of the job. AI is one of these abstractions (and a very effective one at that!).

Every time we create new abstractions and software becomes easier to build, we end up building exponentially more of it. Addy Osmani calls this the Efficiency Paradox. We don’t run out of ideas or software that needs to be built. Instead, we’re economically enabled to produce greater output.

With regard to AI’s abstraction, Osmani wrote:

“The real question is whether we’re prepared for a world where the bottleneck shifts from “can we build this?” to “should we build this?”“

Not only does AI as a technology mean we can build greater, more capable software, AI as a development tool enables doing so at an unprecedented rate. Once we begin building exponentially more software, we need more software engineers to build and maintain this code.

Source: The recurring dream of replacing developers, The Efficiency Paradox, Grady Booch on abstraction

Product-minded engineering means getting error design right

Gergely Orosz published a deep dive on why good error and warning design is high-leverage work. Diagnostics are often the primary interface users encounter, so errors must be raised at the API/UI boundary, validated upfront, and surfaced early.

Engineers should categorize errors for human vs. programmer consumers, choose clear error classes and metadata, and provide contextual, actionable messages including suggestions. Error messages are often the most-seen part of your product’s interface, yet engineers treat them as an afterthought. The best product-minded engineers recognize that a confusing error is costly (support tickets, user frustration, lost trust, etc.). Investing in clear, actionable error design pays compounding dividends.

We’ve recently discussed the importance of being a product-minded engineer to succeed in the AI era. Error handling is an important way to do that.

As an aside: The Pragmatic Engineer is also hiring a part-time remote Tech Industry Analyst to research engineering trends and produce in-depth subscriber reports. The pay is incredibly high (~$175/hr) so it’s probably worth taking a look at.

Source: The Product-Minded Engineer on errors and warnings, Tech Industry Analyst role

Young adults are trusting AI with financial decisions

Cleo AI surveyed 5,000 UK adults aged 28-40 and found strong interest in AI-driven money management: 64% would trust AI with disposable income decisions, 54% to move money to avoid overdrafts, and 52% to manage bills. This comes alongside weak financial confidence, with 37% reporting poor self-discipline and 80% wanting to improve their financial knowledge.

Last week, we discussed how people are increasingly turning to AI for healthcare advice. Now we’re seeing the same pattern with personal finance. These are high-stakes domains where bad advice can cause real harm, yet users are willing to delegate decisions to AI anyway. The common thread is accessibility: AI is available 24/7, doesn’t judge, and provides immediate answers. Trust remains a gating factor though (as we’ve discussed previously), with 23% saying they want incremental proof before wider use.

Source: Cleo AI survey on financial trust

Quickies

Google.org is providing $2M to Sundance Institute to train 100,000+ artists in AI filmmaking skills with free curricula and scholarships. src
SAP and Fresenius are building a sovereign AI platform for healthcare with a mid three-digit million euro investment using on-premise-ready models that preserve data sovereignty. src
Tesla’s AI5 chip design is nearly finished with AI6 in early stages, targeting a 9-month design cycle for continuous generations of custom AI accelerators. src
PJM projects 4.8% annual electricity demand growth from AI data centers, with consultants forecasting a 25% rise by 2030 and real risk of East Coast rolling blackouts. src
ChatGPT Go launched worldwide at $8/month with 10x more messages than free tier, while OpenAI will test ads in free and Go tiers. src
AstraZeneca acquired Modella AI to embed pathology-focused foundation models directly into oncology R&D for faster biomarker discovery. src
Apple is fighting for TSMC capacity as Nvidia likely overtook Apple as a top customer, forcing Apple to compete for leading-edge wafer slots. src
Veo 3.1 adds native 9:16 vertical output for mobile-first short-form video and state-of-the-art upscaling to 1080p and 4K. src
Kaggle launched Community Benchmarks for reproducible multi-step reasoning, code execution, and tool use evaluations across models. src
OpenAI published a response to Elon Musk’s lawsuit, claiming Musk wanted absolute control and proposed merging OpenAI into Tesla before leaving. src
Palantir’s ELITE tool maps deportation targets for ICE with address confidence scores, ingesting government and commercial data for raid prioritization. src
Coding on paper as a deliberate training method forces engineers to slow down and master fundamentals rather than outsourcing cognition to tools. src

Last week

In case you missed it, here’s last week’s overview:

Thanks for reading!

Always be (machine) learning,

Logan

AI Can Do Your Job - Now What? | AI for Software Engineers 77

Logan Thorneloe — Thu, 15 Jan 2026 15:48:33 GMT

Two releases this week show how far AI coding tools have come. Claude 4.5 Opus is now more accessible with higher rate limits, and Claude Code has improved its planning capabilities, spending more time on design and less on iteration and enabling enough tokens for developers to use it full-time.

The second is Ralph Wiggum, a methodology/Claude Code plug-in for terminal agents that enables them to work autonomously for hours. It breaks tasks into work items with finishing criteria, then loops until all criteria are complete. The output works according to specification.

The key that makes this work so well is periodically resetting context, tracking progress via external files rather than keeping everything in memory. This prevents the drift that happens in long-running sessions and enables brand-new agents to take stabs at a problem until it’s done.

Together, these mean a coding agent can be given a product specification in the evening, work overnight, and have code ready for you in the morning. This code is usually entirely within spec and viable for a minimum viable product or even better.

So now that AI can whip up these prototypes overnight, what does that mean for you? A few things:

Be user- and product-focused. The important parts of software engineering are still important. Understanding products and outlining requirements to fulfill them is still on the engineer (i.e. giving the requirements to Ralph as mentioned above). Studies show that teams that are product-focused are more successful when using AI developer tools than their counterparts. Iterating based on high-quality user feedback is key to maintaining an effective product-focus.
Learn to use AI tools. This should be self-evident, but there are still engineers refusing to learn them. They’re the future of software development and there’s a steep learning curve to use them effectively. If you want to take the next step toward using AI to be more productive, you should both implement and try out new AI coding methodologies and tools, such as the Ralph loop. If you want to get hands-on this week, I suggest implementing this in your work environment and giving it a go.
Get good at reviewing. I know this is the boring part of engineering, but now it’s even more important. Review well enough that you’re confident in what’s going to production and that you understand how it works. Get very good at understanding system design as I find integration with surrounding systems is where these AI coding tools fail and it’s often the most difficult to detect.

Here’s everything else you need to know from this past week.

My Picks

Standalone content worth your time:

Finding and fixing Ghostty’s largest memory leak by Mitchell Hashimoto: A deep dive into debugging Ghostty’s PageList memory leak that grew to 37 GB after 10 days. The fix involved preventing reuse of non-standard pages during scrollback pruning. A great example of methodical debugging with practical techniques like macOS VM tagging.
8 plots that explain the state of open models by Nathan Lambert: China’s open models dominate adoption, led overwhelmingly by Qwen whose top variants have more downloads than many competitors combined. Qwen also leads finetuning activity on HuggingFace, though DeepSeek dominates at very large model scales.
5 GPU performance optimization methods: An easy-to-follow explanation of five GPU optimization methods for LLMs: batching, mixed-precision (FP16), tensor/kernel fusion, memory pooling, and CUDA stream management. Practical impacts include roughly 2x memory savings with FP16.
Demystifying evals for AI agents by Anthropic: A comprehensive guide on why agent evals are harder than model evals. Autonomy, tool use, and long-horizon planning introduce external dependencies and emergent behaviors that traditional testing can’t handle. Covers strategies for realistic environments, mixing automated and human assessments, and measuring both task performance and failure modes.
No, Claude Code doesn’t need a better UI by Logan Thorneloe: I wrote about why Claude Code’s terminal-based approach is actually its strength. The terminal is standardized, scriptable, and predictable, making it ideal for automation compared with brittle GUIs. Claude can control files, apps, and any CLI- or API-driven application via text commands.

Claude Cowork brings terminal agents to everyone

Anthropic released Claude Cowork, an adaptation of Claude Code that runs in the Claude app on Mac and performs general-purpose computer tasks. This is only available to Max subscribers and only on Mac for now.

I just wrote an article about how Claude is a general-purpose computer use agent, not just a coding tool. This means you can get just about anything done you could do via the terminal by prompting Claude. I stand by the fact that the terminal is still an excellent UI that builds intuition about what you can and cannot do with Claude as you watch it work. More info on Claude’s productive capabilities in the sources below.

Source: Simon Willison on Cowork, Cowork announcement on X, Ethan Mollick on Claude Code, My article on Claude Code as a computer use agent

Anthropic restricts third-party API access amid abuse concerns

Anthropic blocked two parties from using their resources this week:

Competitors such as OpenAI and xAI, to give Anthropic a competitive advantage.
Third-party harnesses that took advantage of Claude Max subscriptions, to ensure usage rates on these subscriptions can’t be spoofed.

This caused competitors such as Codex to jump on providing usage to third-party harnesses where users previously would have used Claude models. It makes me wonder about two things: how much goodwill did Anthropic lose to save money on the spoofing and what will be the long-term impact of other tools being more accessible to users?

Apple partners with Google to power next-gen Siri with Gemini

Apple signed a multi-year deal to base its upcoming Foundation Models on Google’s Gemini, enabling a more personalized Siri expected later this year. All inference and customization will run on Apple silicon and Apple’s Private Cloud Compute to preserve user privacy. My understanding is that Apple’s models will be based on the same LLM technology as Google’s.

I’ve seen a lot of takes on this, but the most prominent is that Apple has admitted defeat. Instead, think of this as a business decision. Apple doesn’t have a model ready that they think will guarantee an excellent assistant experience. They use Google’s models for now to ensure they can deliver a quality product to their users and they don’t lose any ground in the smartphone market. In reality, Apple is doing quite well in AI as their silicon and hardware have become a staple for serving large models.

Source: Apple-Google Gemini partnership

AI in healthcare faces mounting scrutiny from regulators and experts

A few things happened in AI-related healthcare news this week:

Google has had to remove several AI-generated health summaries to ensure misinformation isn’t spread.
OpenAI added Health to ChatGPT, enabling a user to discuss their health and health records with ChatGPT directly in the app.
Studies show more people are using AI for self-diagnosis, with one figure showing 59% of Brits are doing so.

OpenAI claims this is to ensure accurate information is given regarding healthcare and to enable users’ health-related queries to have the context of their current health information. Many are skeptical of sharing their personal health data with ChatGPT as most queries given to ChatGPT are used for training. OpenAI has guaranteed this won’t be the case with Health in-app.

Source: Google removes misleading AI health summaries, 59% of Brits use AI for diagnosis, ChatGPT Health critique

Tailwind’s layoffs reveal how AI adoption can destroy business models

Tailwind cut 75% of its staff after AI coding agents drove the CSS framework to 75 million downloads per month while simultaneously killing 40% of site traffic. Site traffic generated conversions to paid services, and this change in revenue contributed to an 80% revenue drop. Shortly after, Google AI Studio announced it would sponsor the Tailwind project.

Tailwind is one of the most popular frontend component libraries, but AI is fundamentally changing how information is consumed and transferred, meaning business models will need to adapt as well.

Source: Tailwind layoffs, Google AI Studio sponsorship

Building reliable AI agents requires rethinking evaluation

The difficult part of agent observability is logic being shifted from code to models. This means traditional test cases fail because model output can’t be tested deterministically. This is what makes AI observability such a difficult issue.

Anthropic recently released a blog post detailing evals and what makes them so tough, including the gold standard method of testing coding, computer use, and conversational agents. One big takeaway is that evals aren’t 100% foolproof and need to be accompanied by production monitoring, A/B testing, and user feedback. I highly recommend reading Anthropic’s report linked below.

Source: Harrison Chase on traces as documentation, Anthropic on agent evals

Quickies

Malaysia and Indonesia blocked Grok after regulators found it was generating sexually explicit images, including depictions of minors. src
US job openings dropped to 7.15 million in November, the lowest in over a year, with vacancies per unemployed worker falling to 0.9. src
NVIDIA and Eli Lilly will invest up to $1 billion over five years on an AI co-innovation lab for drug discovery. src
Bose is open-sourcing SoundTouch’s API instead of bricking the speakers when cloud support ends. src
Meta’s $2 billion acquisition of Manus triggered a Chinese Ministry of Commerce review for potential export control violations. src
Gemini CLI now offers “Agent Skills” that can be installed via npm. src
Self-hosting has become practical with cheap mini PCs, Tailscale, and CLI agents like Claude Code handling setup. src

Last week

In case you missed it, here’s last week’s overview:

Thanks for reading!

Always be (machine) learning,

Logan

No, Claude Code doesn’t need a better UI

Logan Thorneloe — Sat, 10 Jan 2026 13:46:30 GMT

I’ve read a lot of articles this past week about Claude Code (as I’m sure you have too) and there’s consistently one thing mentioned that bothers me. These articles state that Claude Code is excellent despite its terrible UI, when really its UI is what makes it so great and the closest thing we have to AGI.

This starts with a brief history of computers and computation. Humanity created computers to crunch numbers much faster than we’re manually capable of. Since most work is rooted in information transfer, we’ve since offloaded most work to the digital world because computers are capable of storing, retrieving and manipulating information much faster than we are.

To more easily tell computers the work they should be doing, we’ve developed GUIs (graphical user interfaces). These GUIs sit on top of the code, ones and zeros, and actual computation the computer is doing to create a much more accessible interaction plane to a human user.

Recently, there’s been a lot of research done to create computer-use agents. These agents learn how to use a mouse to interact with a computer’s GUI. Thus, these agents are capable of doing the work a human otherwise would have accomplished with that computer.

However, if we go back to before GUIs, we primarily interacted with computers via the terminal. The terminal is a simple text interface to give the computer a command for the work it needs to do and get information back from the computer.

The terminal is a text interface that controls the work a computer does. Our current frontier AI models are text based and perfectly suited for this environment. This is what makes Claude Code so effective. It lives in the terminal and interacts with it via text commands.

Thus, rather than thinking of Claude Code as a coding agent, it’s much better to realize its full potential by thinking of it as a computer use agent.

Digital versus manual work always makes me think of this scene from Space Force.

It’s had such an explosive impact because its ability to control a computer via the terminal lets it accomplish meaningful work. Anything you can do in the terminal, Claude can too.

I’d even argue that it’s the first step of artificial general intelligence (AGI). Most definitions of AGI describe an AI’s ability to do general, meaningful work. With our current models, an AI assistant in the terminal accomplishes this. The only thing keeping it from making more of an impact is integration with more systems it can work on.

Luckily, the terminal helps with this too. The terminal lets you:

Interact with a computer’s filesystem and applications.
Interact with the internet.
Run commands for any CLI tool. Any application with terminal commands can be controlled by Claude.
Code. Anything Claude can’t do natively via the terminal, it can write code to accomplish and run that code itself. This means Claude can interact with anything that has an API if given proper authentication.

And this doesn’t even account for model context protocol (MCP) which is the agent-native way of declaring its interactions with endpoints.

You might argue that a true computer agent needs the ability to interact with a computer with more complexity. I’d argue that the simplistic and standardized nature of the terminal is what has made the terminal-based computer use agent so successful.

Terminal commands are standardized. GUIs change their layouts, button positions, and flows with every update. Terminals are a stable, reliable interface.
The terminal is inherently programmatic. It was designed for automation and scripting, which is exactly what an AI agent needs to do. Terminal commands can also be run together, enabling the agent to build complex workflows from simple operations. GUIs were designed for humans to point and click, not for programs to control.
Terminal outputs are predictable. GUI interactions depend on context, view settings, window state, and animations that make it difficult to know what to do next.
Terminal errors are parseable text that an agent can read and act on. GUI errors are modal dialogs or toast notifications that require visual interpretation.

I recommend even non-technical individuals learn how to use Claude Code in the terminal. There’s a certain level of intuition that you build as you watch the AI work directly in the terminal and as you learn to work in the terminal yourself.

Some examples worth checking out to get you started:

If you write or script as a content creator, write in markdown format in a GitHub repo. Use the terminal to access that folder on your computer and spin up Claude Code. It can now help you write, critique your work, brainstorm ideas, and more. This article was edited by Claude Code, for example.
If you store any information via API, tell Claude Code about it and it can write a script to access that information and add it to its context. For example, I read and store notes in Readwise Reader. It has an API that Claude Code can easily access via a simple Python script. I can then chat with my notes.

Claude Code has made such an incredible impact because it’s not only good at coding but it’s an entire terminal agent. If you think about Claude Code this way, it can accomplish much more meaningful work for you.

Thanks for reading!

Always be (machine) learning,

Logan

AI's Role in Maduro's Capture | AI for Software Engineers 76

Logan Thorneloe — Wed, 07 Jan 2026 16:01:41 GMT

Here are my picks for content you don’t want miss and everything you should know about AI for January 7, 2026. Enjoy!

My Picks

21 lessons from 21 years at Google by : Lessons learned from working at Google for 21 years. Two notable lessons: most slow teams are actually misaligned, and the best engineers are obsessed with solving user problems. All are worth reading.
Reasoning models are a dead end by : A valuable take on reasoning models and their lack of interpretability. Reasoning encoded into model weights loses 95% of intermediate branching and produces brittle behavior compared to externalized reasoning infrastructure. A great example of why engineering is so important in AI.
The suck is why we’re here: Some great perspective on writing with AI. The author argues that AI shortcuts the crucial, difficult parts of writing (research, stuck thinking), and that avoiding these “sucky” parts sacrifices depth and lasting reward. AI will increase quantity but lower average quality, making genuine effort stand out.
Advent of Code 2025 with Compute Shaders by : An excellent exploration of implementing Advent of Code solutions using GPU compute shaders on Metal. The GPU kept consistent times (~5ms) as problem size grew while CPUs slowed dramatically, demonstrating practical applications for massively parallel problem solving.
Building AI Agents, Open Code And Open Source by : I thoroughly enjoyed reading this interview, especially the parts about open versus closed source tools and the motivation behind them. Terminal agents are only going to be more important this year and this does a great job of helping readers understand them.

Things you should know

AI was used to push narratives in Nicolás Maduro’s capture

AI-generated media circulated the internet following the US capture of Venezuela’s president Nicolás Maduro. Fake images depicted the capture itself, while a deepfake video showed Venezuelans crying tears of joy. Both were used to push specific narratives about the operation.

Any company serious about AI needs to help viewers discern between AI and non-AI media. The images above were caught by Google’s SynthID watermark which Google attaches to all AI-generated images using Gemini. Sure, anyone can switch to a non-watermarking tool, but even putting up a small obstacle to generating a fake narrative is a big win.

Source: EBU Spotlight on Maduro fake images, Yahoo News on fake celebration video, Google SynthID

See how SynthID works below:

AI safety concerns mount as AI chatbots face serious scrutiny

xAI was fined 120 million euros under the Digital Services Act due to Grok generating sexually explicit images of women and children. Separately, a lawsuit alleges OpenAI is withholding ChatGPT logs after a murder-suicide where transcripts show the chatbot validated a user’s paranoid delusions.

AI safety is foundational to ensuring we can apply AI to the applications where it’s needed. It’s crazy to me that AI safety teams were previously understaffed or dismissed. Both of the examples above show why AI safety is important and also some of the difficulties that come with ensuring safety.

Source: TVP World on Grok, Ars Technica on ChatGPT logs

Half of AI-generated code has security flaws

Over 30% of senior developers now ship mostly AI-generated code, and the trade-offs are becoming clear. AI code shows logic errors at 1.75x the human rate, XSS vulnerabilities at 2.74x, and roughly 45% of it has security flaws. PR sizes are up 18%, incidents per PR are up 24%, and change-failure rates have risen 30%. Properly configured AI review tooling catches 70-95% of low-hanging bugs.

These statistics echo my recent article detailing how AI impacts an organization’s engineering culture. AI is an amplifier, and if your processes aren’t solid, AI will make them worse.

Source: Addy Osmani, AI for Software Engineers

The best way to fight AI cheating in education is with AI

An NYU professor is using AI to conduct oral exams with students at just 42 cents per student. The AI asks follow-up questions and probes understanding in real-time, forcing students to verbally explain concepts rather than paste in AI-generated answers. This follows a trend where some schools have removed online math courses entirely or now require in-person testing as instructors note declining problem-solving skills and increased reliance on copying AI outputs.

One of my biggest concerns with AI is education. It has potential to be the greatest multiplier but also the worst detriment in this space. As with many other applications, we’re seeing AI-related problems being combatted with AI-related solutions.

Source: Reddit discussion on AI oral exams

Claude Code creator shares his setup for using Claude Code

Boris Cherny, who created Claude Code, runs multiple instances at a time with a focus on Opus 4.5 with “thinking.” It needs less steering despite being slower per token, which increases velocity in the long run. He also claimed that Claude Code’s updates are all written entirely by Claude Code itself.

Separately, a principal engineer at Google mentioned just how far Claude Code has come by saying it can now design specs that took multiple engineers a few months ago. An ex-Google PM commented on this explaining how important it is for engineering teams to be using competitors’ products to improve their own.

My only addition: stop thinking of Claude Code, Gemini CLI, and Codex as coding agents. Instead, think of them as terminal agents. Anything you can do from the terminal, it’s possible to get AI to do for you.

Source: Boris Cherny on X, Jaana Dogan on X, Raiza Martin on X

Research to watch in 2026: Recursive Language Models and Manifold-Constrained Hyper-Connections

Recursive Language Models (RLMs) let models handle context windows up to 100x longer than their native limits by breaking inputs into chunks and processing them programmatically. In tests scaling from 8K to 1M tokens, base models degraded sharply while RLMs maintained performance at comparable cost.

Separately, a technique called Manifold-Constrained Hyper-Connections (mHC) stabilizes model training with only 6.7% overhead, eliminating common instability issues that plague large model runs.

Both papers tackle fundamental scaling bottlenecks: RLMs at inference time and mHC at training time. If these techniques hold up, they could meaningfully change how we build and deploy large models.

Source: Alex Zhang on RLMs, mHC paper on arXiv

NVIDIA acquihires Groq through licensing deal

Groq signed a licensing deal with NVIDIA that will see about 90% of Groq’s 400+ employees move to NVIDIA at a $20B valuation. Groq will remain independent and GroqCloud will continue operating. Groq’s specialty is developing compute with incredibly low-latency inference, something Nvidia can benefit from as it continues to ramp up its research and development of AI compute.

This is another acquihire within the AI industry. The most recent I can think of was Google acquiring talent from Windsurf which led to Google’s Antigravity IDE. I see something similar happening at Nvidia where they’ll come out with even lower latency compute offerings for customers.

Source: The Chip Letter by

More...

A shape-shifting molecule discovery could change the future of AI hardware. (Science Daily on shape-shifting molecules)
Micron shares surged over 10% on AI optimism and increased demand for high-performance memory. (Micron stock coverage)
California State Senator introduced a four-year moratorium to ban AI chatbot-equipped toys for minors. (Coverage of AI toy moratorium)
Claude Code can run on-the-go using an iPhone via Termius and mosh to a VM costing about $7/day. (Granda.org)
Advanced AI could collapse labor’s share of GDP toward zero, concentrating wealth among capital holders. (Dwarkesh Patel on X)
An excellent overview on the past 10 years of AI. (Weighty Thoughts by )
An interesting read from an author who canceled their technical book publishing deal for various reasons. (Austin Henley)
PostgreSQL dominated 2025, driving major acquisitions and new DBaaS launches across all major cloud vendors. (Databases in 2025: A Year in Review)
Two excellent 2025 retrospectives worth reading. (Ignorance.ai on 10 AI stories by , Simon Willison on the year in LLMs)

Last week

In case you missed it, here’s last week’s overview:

I’ve removed the jobs and industry updates from these weekly roundups. I haven’t been able to fit them properly at this cadence and will be moving them to their own, less frequent articles. Stay tuned!

Thanks for reading!

Always be (machine) learning,

Logan

AI for Software Engineers: Looking Forward to 2026

Logan Thorneloe — Thu, 01 Jan 2026 15:01:06 GMT

Happy New Year! Thank you all for your support in 2025! 2026 will be an even better year for AI for Software Engineers! Here’s a recap of the year, what to look forward to in 2026, and a few questions to help me improve the newsletter. 😊

Looking back

In 2025, we:

Reached 100 paid subscribers to become a Substack Bestseller.
Reached 11,000+ free subscribers.
Hit #1 on Hacker News.
Underwent two name changes (Society’s Backend —> ML for SWEs —> AI for SWEs).
Got a new logo that I think actually works (see image above).
Released 38 weekly reports and many other technical articles.
Created a repo to learn by building (more on this below).

Our top 5 articles of this year were:

Going forward

I plan to:

Add to the AI for SWEs repo and let all of you contribute too. I want to create more hands-on resources, but I want this repo to be an opportunity for you to create those resources as well.
Simplify my approach to writing. I want think less about what I think will do well and focus more on sharing what I think is most important for all of us to know. I also found myself getting caught up in the process I use for writing, instead of getting caught up in the topic I’m writing about (which is a great thing!).
Add more paid benefits with a focus on discounted learning and building resources. Thanks to all who’ve supported me by becoming a paid subscriber. It lets me devote more time to my writing. My plan for 2026 is simple: Make the paid tier to much value it’s a no brainer and ensure it providers everything you need to make it in AI.
Take better care of my own health so I can be more consistent. There were a few weeks this year where I was unable to write due to my health and I missed writing during those week. Next year, I’m prioritizing my health.

Now, help me improve AI for Software Engineers! Answer two questions for me.

Question 1:

Question 2:

Question 3:

As always, thank you for reading!

Always be (machine) learning,

Logan

AI Can’t Fix a Broken Engineering Culture—It Can Only Make it Worse

Logan Thorneloe — Tue, 30 Dec 2025 20:10:23 GMT

I’ve seen an interesting new fad on social media recently that I like to call “vibe releasing”. This is the same as “vibe coding” but it takes it one step further and releases the code to production without properly reviewing it first.

I can’t overstate how terrible of an idea this is.

In fact, this year’s “State of AI-assisted Development” report released by Google centered around one idea: AI is an amplifier. It analyzes AI coding metrics from this past and proves that coding with AI makes proper engineering practices more, not less, important.

It shows that companies with good engineering culture and practices will see AI positively impact their development velocity and companies with bad engineering culture and practices will see the opposite. “Vibe releasing” is the definition of a bad engineering practice.

This article includes everything you should take away from Google’s report and how it applies to you.

Takeaways

If you’re just here for the takeaways, here they are:

2025 was the first year AI had a quantifiable positive impact on software development.
Trust is a huge factor in AI coding tool effectiveness.
Companies with bad engineering cultures and practices will see their development velocity slow with AI. Conversely, companies with good engineering cultures and practices will see their development velocity quicken with AI.

If you want to know the specifics and what your organization should do to ensure AI works for you instead of against you, read on.

Report methodology

First, let’s understand how the report was created and how research was conducted. When evaluating metrics, this is always the first step.