Agent Edge | June 11, 2026
๐ง Recursive Publishes First Steps Toward Automated AI Research
Recursive Blog | Direct
๐ https://www.recursive.com/articles/first-steps-toward-automated-ai-research
Recursive released early results from their automated AI research system, hitting state-of-the-art across three benchmarks. The system automates the full research loop: it proposes an idea, implements it, runs an experiment, validates the result, and uses what it learns to choose the next experiment. On NanoChat Autoresearch it achieved 0.9109 Validation BPB (previous SOTA 0.9372), a 1.3x speedup. On NanoGPT Speedrun it reached 3.28 validation loss in 77.5 seconds (previous SOTA 79.7s). On SOL-ExecBench it scored 0.754 mean SOL across 235 GPU kernels (previous 0.699), an 18% reduction in the gap to optimal performance. The system runs many parallel research threads over long horizons, keeps useful context from prior experiments, and validates results against reward hacks and variance before treating improved performance as real progress.
๐ Why it matters: This is one of the first concrete systems to close the loop on AI-conducted AI research, not just suggesting ideas but implementing, running, and validating them end to end. For agent builders, this represents a new category: your agent can now improve the models and infrastructure it runs on, creating a self-accelerating cycle. The open-sourcing of artifacts means the approaches are inspectable and reusable.
๐ค Agent angle: Watch Recursive’s approach to experiment validation (reward hack detection, variance gates) as a design pattern. If you build agents that optimize anything with a measurable objective, the research-loop architecture is directly applicable: propose, implement, run, validate, learn. Start with a narrow objective and clear evaluation, then expand the horizon.
๐ ๏ธ PROJECTMEM: Event-Sourced Memory Layer for AI Coding Agents
Malo, Ripon Chandra, Qiu & Tong | arXiv
๐ https://arxiv.org/abs/2606.12329v1
AI coding agents remain largely stateless: each new session re-reads project files, re-derives prior decisions, and repeats debugging attempts that already failed. Reconstructing this context can consume an estimated 5,000-20,000 tokens per session. PROJECTMEM is an open-source, local-first memory and judgment layer that solves this. It records development as an append-only, plain-text event log of typed events (issues, attempts, fixes, decisions, notes) and deterministically projects that log into compact, AI-readable summaries served through the Model Context Protocol (MCP). Beyond storage, it adds a deterministic pre-action gate that warns an agent before it repeats a previously failed fix or edits a known-fragile file. The paper frames this as “Memory-as-Governance”: memory that does not merely answer the agent but acts on its next action.
๐ Why it matters: Token waste on context reconstruction is a hidden tax on every coding agent deployment. PROJECTMEM’s append-only event log and pre-action gate address both the cost problem (5K-20K saved tokens per session) and the correctness problem (preventing repeated failed fixes). For teams running coding agents on production codebases, this is the kind of infrastructure that separates a toy from a reliable tool.
๐ค Agent angle: Consider the “Memory-as-Governance” pattern for any agent that acts on long-lived projects. The combination of an immutable event log with a deterministic pre-action gate is portable beyond coding. Apply it to deployment agents, data pipeline agents, or any agent where repeating past mistakes has real cost. Start by logging every decision as an event, then build the pre-action gate for your highest-cost failure mode.
๐ก Orange Pi 5 Plus Runs Local LLM Server for Autonomous AI Trading: 59 Days of Real Lessons
u/Weird_Night_2176 | Reddit
A Reddit user shared their 59-day field report running a fully autonomous AI trading system on a $150 Orange Pi 5 Plus single-board computer. The post covers real bottlenecks, cost numbers, and stability lessons from continuous production operation. Running a local LLM server on commodity ARM hardware for a money-making agent is no longer theoretical: this is a concrete field report on what breaks, what burns power, and what actually works when you ask an edge device to run inference continuously for two months. The numbers are out in the open: hardware cost, inference latency, uptime, and throughput under real market conditions.
๐ Why it matters: The dream of running autonomous agents on ultra-cheap hardware lives or dies on real reliability data, not benchmarks. This 59-day report answers the practical questions: does a $150 SBC actually stay up? Does inference lag cost you trades? What about thermal throttling? For anyone building agents that need to run at the edge without cloud dependency, this is primary-source evidence on what the floor looks like.
๐ค Agent angle: If you are building a local-first agent that handles real money or real operations, read this report for the failure modes alone. The Orange Pi 5 Plus proves the hardware floor is viable. Model your own uptime requirements against real 59-day edge inference data before picking your hardware.
๐ How One Builder Stopped Multi-Agent n8n Loops from Burning API Budget: Applied VSM Framework
u/JuggernautAmazing527 | Reddit
๐ https://www.reddit.com/r/SelfHostedAI/comments/1u2xnop/how_i_stopped_my_multiagent_n8n_setup_from/
A builder applied Value Stream Mapping (VSM), a framework from lean manufacturing, to diagnose why their multi-agent n8n system kept looping and burning API budget. By mapping every tool call, every loop iteration, and every token burn, they identified the specific agents and workflows causing the most waste. The result is a structured, repeatable method for diagnosing where agents waste money and how to fix it. This is a production cost-control playbook for multi-agent setups: no guesswork, just tracing the actual token flow and cutting the loops that add no value.
๐ Why it matters: Multi-agent systems have a known failure mode: runaway loops that silently burn API budget until the bill arrives. Most builders react with ad hoc rate limits or max-iteration caps. VSM provides a systematic way to find the actual source of waste rather than treating the symptom. This framework is portable to any multi-agent architecture, not just n8n.
๐ค Agent angle: Add VSM to your debugging toolkit. When your multi-agent costs spike, trace every tool call and every loop iteration as a concrete value stream. The loop that burns the most tokens is rarely the one you expect. Apply this at least once per deployment to establish your agent’s cost baseline.
๐งฉ Model-Task-Router: A Hermes Skill That Auto-Routes Tasks to the Right Model
@sugumaran95 | Reddit
๐ https://www.reddit.com/r/hermesagent/comments/1u21zw2/i_built_modeltaskrouter_a_hermes_skill_that/
A builder created model-task-router, a Hermes Agent skill that auto-classifies requests and dispatches them to the optimal model based on DeepSWE benchmark data. The routing logic is sharp: “implement a JWT middleware” goes to GPT-5.4 (56% on DeepSWE), “design multi-tenant architecture” goes to GPT-5.5 (70% on DeepSWE), “what’s running on my server?” goes to DeepSeek V4-Pro (67.9% on Terminal-Bench at $0.87/M), and simple greps delegate to GPT-5.4-Mini. The results on Oracle Cloud ARM64 (4-core, 24GB) are striking: orchestration turns at $0.09/session, coding tasks at $1.50 with a 56% success rate (up from 8%), and a blended cost of $0.35 per session versus $3.00 all-in on GPT-5.5. A PR to merge into Hermes core is open.
๐ Why it matters: Nobody should pay GPT-5.5 rates for a grep task or use V4-Pro for architecture design. This skill proves that model routing based on real benchmark data (DeepSWE for coding, Terminal-Bench for tool use) can cut blended costs by 8-10x while improving task success rates. The benchmark data reveals uncomfortable truths: V4-Pro collapses 47 points on real engineering tasks vs its SWE-bench Pro scores, costing 6.7x more per solved task than GPT-5.4 despite being 17x cheaper per token.
๐ค Agent angle: If you use Hermes Agent, install this skill and customize the routing table with your own model costs and benchmark data. If you use another agent framework, the routing pattern is portable: classify the task type, consult real benchmark data (not just pricing), and dispatch accordingly. The 8-10x cost improvement is real and immediate.