This week in tech: 15.06.2026
Summary of AI developments - made for busy people
APPLICATIONS
Causal inference is very useful and important - but any serious application involves using counterfactuals, and realistic benchmarks are hard to come by. The authors of this research developed a large-scale evaluation framework based on dynamic epidemic interventions. The dataset was built using an agent-based model (obviously) calibrated on demographic and mobility data, it provides counterfactual trajectories across more than 150 U.S. counties to evaluate state-of-the-art causal reasoning methods.
Paper: https://arxiv.org/abs/2606.05692
FrontierCode is a new AI coding benchmark from Cognition that evaluates whether code is actually merge-ready - not just whether it passes tests (yes, I know in theory it should be related, but we all know how things are irl):
measures correctness, test quality, regression safety, scope discipline, and adherence to a repository’s coding style.
Maintainers from 36 real open-source projects spent 40+ hours designing tasks across three difficulty tiers: Extended, Main, and Diamond.
The authors are onto something: even the best (SOTA frontier blah blah blah) models scores only in the high teens on a 1-100 scale on the hardest Diamond tier
Benchmark reports far fewer grading errors than SWE-Bench Pro, which makes it a stronger signal of real-world usefulness / quality.
https://cognition.ai/blog/frontier-code
Speaking of causal inference: Netflix has open-sourced a standalone version of CI agent as a research project. The repository has a ton of interesting stuff, and I love the idea of bringing the vintage and the modern together - but it is experimental and provided “as is” -no guarantees around reliability, security, completeness, or suitability for production use.
Pretty much “here’s the code, good luck lol” - still, kudos.
Repo: https://github.com/Netflix-Skunkworks/oci-agent
Well this is an interesting one: Google released DiffusionGemma
Experimental open 26B Mixture of Experts model that moves from sequential generation to process and generate entire blocks of text simultaneously.
1k tokens /sec on consumer hardware
Optimizes non-linear workflows: code infilling, inline editing, real-time self-correction
Native integration for MLX, vLLM, Hugging Face, and Unsloth + advanced NVIDIA NVFP4 kernel optimization
Blog: https://developers.googleblog.com/en/diffusiongemma-the-developer-guide/
HF model page: https://huggingface.co/google/diffusiongemma-26B-A4B-it?linkId=62264701
BUSINESS
OpenAI is preparing for an IPO, which means they need to actually start making money. Behold, the superapp: Altman and his minions are planning a ChatGPT redesign that would transform the product into a unified AI “super app” for work, creation, coding, and agent-driven tasks.
Long-term goal is a personal AI agent that can assist users across both professional and personal activities through a single interface. All your stuff is belong to us - at a level to make Microsoft blush.
Machines of loving grace seem to have gone away, at least as far as Dario Amodei’s mind is concerned: Anthropic is warning that advanced AI systems may soon become capable of “meaningfully improving their own performance”, raising concerns about reduced human oversight. This possibility warrants greater caution and potentially a slowdown in AI development to better understand and manage the risks.
Translated to plain English: they they preached acceleration when they thought they were the ones in control - but now they want to hit the kill switch, because mUh ConSCiOuS AI.
Argentinian president Javier Milei does not believe in half-measures: he is proposing a legislation that would create “agentic corporations” - a non-human corporation fully operated by AI agents / robots (and no, the draft does not define the terms exactly). He writes that “AI will free us from the constraints of the human brain, pushing productivity beyond our wildest dreams” - I am not sure about the second part, but if you look at most governments, we have freed ourselves from the constraints of human brain a while ago and the results are not exactly encouraging.
Agentic commerce is going about as well as expected: buyers are being tricked into paying for products from fake online stores after AI chatbots - including the major ones - recommend them as if they were legitimate retailers. Scammers are creating convincing cloned websites that appear in AI-generated search results.
Honestly, what we need is a new form of rickrolling, which taught more people NOT TO CLICK ON RANDOM LINKS than any cybersecurity training could.
https://www.theguardian.com/money/2026/jun/07/ai-chatgpt-shopping-scams-fake-websites
CUTTING EDGE
Anthropic has released Fable 5 and it’s been a wild 72 hours:
Public version of its Mythos-class models - frontier-level coding, reasoning, research, long-horizon agent capabilities and whatnot. No word on whether it cures male pattern baldness, but it sure burns through tokens at a rapid rate.
Did I mention 30-day logging for Mythos-class usage, no matter your baseline privacy settings.
The model was supposed to be available on Claude paid plans through June 22 before switching to usage-based pricing - yet another sign of the whole industry moving to a “pay to play” model
Then things got interesting: initially the model was nerfed for research requests (basically the whole “GPT2 is too dangerous playbook”), but after a massive backlash Anthropic updated its safeguards - flagged queries now fall back to Claude Opus 4.8 visibly and not in the background.
On June 12th, US government ordered Anthropic to ban foreigners from accessing the model - including company employees - because muh national security.
In all fairness, Anthropic (partly) brought this upon themselves: you keep whining that AGI will destroy us, you claim your new model is the most powerful ever, you want the government to regulate AI - and then act surprised when they actually do?
Since the differentiation was impossible in real time, Amodei and his minions had to disable to model for everybody.
There is also a completely coincidental OpenAI filing for IPO in the same week - and unlike Amodei, Altman decided to play ball with the Pentagon.
https://www.anthropic.com/news/claude-fable-5-mythos-5
FRINGE
Geoffrey Hinton made name for himself creating foundations of modern AI, and a lot of money working for Google, then had a late career switch is a prophet of doom, then promoted the need for AI to have a maternal instincts… Hard to follow, I know - and apparently last week he changed his meds again, and now believes that AI is conscious.
You either die a hero, or you live long enough to see yourself go full retard.
There are no coincidences, only signs: the encyclical about AI has been out for three weeks, and now we find out that you can get a religious exemption from using AI at work.
https://www.businessinsider.com/worker-got-religious-exemption-using-ai-at-work-2026-6
The British police is having a normal one: in response to the situation in Belfast following the beheading attempt, they want to set an AI loose.
The (not particularly) United Kingdom is already the world leader when it comes to number of people arrested for what they post online, but hey: every record can be broken.
RESEARCH
Reductio ad absurdum is a logical argument where one keeps talking bs until everybody has had enough* and I am very happy to see the method applied to the whole conversation about LLM being conscious. A new paper argues against treating LLMs as uniquely human-like systems : assumptions about traits such as agency / consciousness lead to circular or uninformative conclusions.
In place of the vague doomp**n peddled by the usual suspects (Amodei, Dawkins, Hinton - the list goes on), the authors propose a “null hypothesis” of LLM non-uniqueness - they note taht many behaviors attributed to language models also emerge in waaay older computational systems, including strategy games.
Paper: https://arxiv.org/abs/2605.31514
Actual definition of proof by contradiction: https://en.wikipedia.org/wiki/Reductio_ad_absurdum
Oh this is good: turns out that transformers might not need three separate projections (Query, Key, and Value). The researchers recently tested three types of shared projections: shared key‑value, shared query‑key, and fully shared single projection. The result? The shared projections variant matches the standard one, and occasionally outperforms it - with KV cache reductions going above 50pct in most cases.
Paper: https://arxiv.org/abs/2606.04032
Language models generate sexist and racist content on a regular basis - a lot of that can be attributed to the politics of their creators. But not all: a test of leading closed models found that identical neurological symptoms triggered far fewer ER referrals for men then men. Example disparity: models focusing on Idiopathic Intracranial Hypertension, a condition more common in young women, and incorrectly treating it as less urgent than diagnoses such as brain tumors.
This can be dangerous: both elevated intracranial pressure and tumors can require urgent evaluation to prevent permanent vision loss. The age effect disappeared in 65-year-old patients - which suggests the models were reproducing demographic patterns from training data rather than applying consistent triage logic.
Text machines processing text with no logic underneath: who could’ve seen it coming?
Paper: https://arxiv.org/abs/2606.03641
Repo: https://github.com/wongqihan/ai-behavioral-experiments/tree/main/gender-age-triage
The authors tackle the "last-mile forecasting" problem: the operational stage where raw statistical predictions must be adjusted to account for real-world business context like promotional schedules, calendar changes, or lead times. They propose to automate this process with - obviously - LLM-agent architecture that reads (usually poorly) structured context and executes auditable and constrained edits on top of standard forecasting backbone outputs.
Paper: https://arxiv.org/abs/2606.02497
The core assumption behind most time series models is that the future has some relationship to the past - which is necessary, but problematic when we deal with exogenous shocks. This new research focuses on integrating textual news data into numerical time series to better anticipate sudden distribution shifts. The proposed framework implements an importance reward module for smart context compression along with a process reward model to accurately rank supplementary news candidates (signal to noise ratio in news is a bit of a challenge these days, isn’t it).
Paper: https://arxiv.org/abs/2606.03097
Baron Munhausen sends his regards: a new paper proposes ReGeN, a generative pipeline designed to manufacture synthetic multivariate time series data for training forecasting models in low-data regimes. The framework decomposes reference data into phase-aligned backbones, stochastic residuals, and lag-aware causal dependencies - what I really like about this approach is the relatively small set of assumptions, which means fewer things can go wrong wrt how representative the data is.
Paper: https://arxiv.org/abs/2606.05264
Do androids dream of electric sheep? The answer continues to elude us, so let’s try something simpler: is a research agent discovering anything, or just remixing the known? The problem is elementary to formulate, and surprisingly non-trivial to answer. A new research framework distinguishes between:
retrieval = using existing knowledge)
search = combining known tools in new ways)
discovery = creating concepts that could not have been generated by the agent’s prior toolkit).
Using a Builder/Breaker agent for protein mechanics, the authors show that declining accuracy metrics can coincide with genuine scientific progress as the system expands its scope and develops more general theories. The key insight is that self-improving AI should be evaluated not just on benchmark performance (well duh, thanks Captain Obvious), but on whether it compresses and explains a larger share of the world with relatively simpler models over time.
Paper: https://arxiv.org/abs/2606.01444



