AI is already having a seismic impression on how software program is written, with a lot of the grunt work of programming now carried out by swarms of brokers and subagents. However as builders experiment with new interfaces and type elements for human-AI collaboration, it’s change into exhausting for even probably the most superior AI labs to maintain up.
The present pattern is for agentic software program growth — techniques the place AI brokers can work independently on coding duties — epitomized by the Claude Code and Cowork apps. Within the meantime, OpenAI has been steadily constructing out its Codex device, which launched as a command line tool final April and expanded to a web interface one month later.
Now OpenAI is taking a significant step towards catching up. On Monday, the corporate launched a brand new macOS app for Codex, integrating most of the agentic practices which have change into in style previously 12 months. The brand new app is designed to work with a number of brokers in parallel, integrating agent skills and different state-of-the-art workflows. The launch additionally comes lower than two months after the launch of GPT-5.2-Codex, OpenAI’s strongest coding mannequin, which the corporate hopes might be sufficient to tempt over Claude Code customers.
“When you actually wish to do subtle work on one thing advanced, 5.2 is the strongest mannequin by far,” CEO Sam Altman instructed reporters on a press name. “Nevertheless, it’s been tougher to make use of, so taking that degree of mannequin functionality and placing it in a extra versatile interface, we expect goes to matter fairly a bit.”
Whereas Altman’s confidence in GPT-5.2 is comprehensible, coding benchmarks inform a extra difficult story. GPT-5.2 does maintain the top spot on TerminalBench (a check measuring how properly AI handles command-line programming duties), a minimum of as of press time. However brokers from Gemini 3 and Claude Opus have logged roughly equal scores — decrease, however inside the margin of error of the benchmark. Outcomes from SWE-bench, one other coding benchmark that assessments AI’s skill to repair real-world software program bugs, are related, displaying no clear benefit for GPT-5.2. Nevertheless, agentic use circumstances have been troublesome to benchmark successfully, and state-of-the-art fashions can differ considerably in consumer expertise.
The Codex app additionally comes with a variety of recent options that OpenAI says will assist it obtain parity or, in some circumstances, outpace the assorted Claude apps. The Codex app will enable for automations that may be set to run within the background on an automated schedule, with outcomes positioned in a queue to be reviewed when the consumer returns. Customers may choose totally different personalities for the agent — from pragmatic to empathetic — relying on their working model.
However for the corporate, the largest promoting level is the sheer pace of growth that’s made doable by AI. “You need to use this from a clear sheet of paper, model new, to make a very fairly subtle piece of software program in a number of hours,” Altman mentioned. “As quick as I can sort in new concepts, that’s the restrict of what can get constructed.”
Techcrunch occasion
Boston, MA
|
June 23, 2026

