Agent Edge | May 25, 2026
β‘ DeepMind’s LLM-Lean agent loop solved 9 open Erdos problems at a few hundred dollars each
@prz_chojecki | X/Twitter
π https://x.com/prz_chojecki/status/2058435083741061359
DeepMind’s AlphaProof Nexus agent autonomously solved 9 of 353 open Erdos problems at a per-problem cost of a few hundred dollars. Some of these problems had stood unsolved for 56 years. A basic loop that alternates LLM generation (Gemini 3.1 Pro) with Lean formal verification replicated all 9 successes on its own. The full system with evolutionary search only provided meaningful gains on the hardest problems. Lean catches hallucinated lemmas that informal proofs let through. The cheapest version of this loop matched the expensive one on most problems.
π Why it matters: The cheapest version of this agent loop replicated all 9 successes without the expensive extras. Simple generate-and-verify patterns with formal tooling are production-ready for any domain with verifiable output structure. The cost floor for automated theorem proving dropped from millions of dollars to hundreds. This same loop transfers directly to code generation, data validation, compliance checking, and any other domain where output correctness can be checked automatically. The verification step turns cheap models into reliable ones.
π€ Agent angle: Add a formal verification step after generation for any agent that produces verifiable outputs. The verification catches hallucinations before they reach the user. This loop works with cheap models and does not require expensive reasoning chains to be effective. Build it into your agent stack this week for the outputs you already validate manually. The pattern costs nearly nothing to implement and eliminates entire categories of production bugs.
π οΈ Workshop: One prompt teaches Hermes to drive any website on cron
u/Jonathan_Rivera | Reddit
A single workshop prompt teaches Hermes to automate any consumer website using a builder/executor split pattern. The builder surveys the site once with a reasoning model and documents every button, field, dropdown, failure mode, and recovery path. A subagent verifies the harness by replaying every step with dummy data before marking it ready. The executor then replays the verified harness on cron and stops only at decision points that need human input. The survey also discovers which browser engine the target site actually cooperates with rather than assuming one works.
π Why it matters: The builder/executor split solves the problem of an agent second-guessing itself during execution. Survey the site once with an expensive reasoning model. Replay the known path forever with a cheap model. Sites you visit every week become APIs without writing any integration code or reverse engineering a single endpoint. This transforms recurring consumer web tasks into automated pipelines that run while you sleep. The pattern is the closest thing to universal API access for sites that offer none.
π€ Agent angle: Pick a recurring consumer web task today and survey it this weekend. Your weekly flight check-in, the grocery delivery slot you refresh every morning, the DMV appointment you keep forgetting to book. Survey the site once and document every trap it hides. Never debug the same site twice. The harness format is just markdown so it survives repo migrations and team changes. This is a weekend project that eliminates weeks of manual work per year.
π‘ Mnemosyne: The first local memory provider that works out of the box
u/Lorian0x7 | Reddit
π https://www.reddit.com/r/hermesagent/comments/1tms3g6/memory_providers_i_tested_them_all/
After testing every available memory provider for Hermes agents, the community ranks Mnemosyne first on ease of setup, speed, and output quality. It uses SQLite with a fast embedding layer and a tiny local LLM to consolidate and retrieve memories across sessions. There is no vendor lock-in and no cloud cost to run it. One user swapped the default model with Qwen 0.8B and got even better results at lower latency. Larger models are available for users who need maximum recall quality. Hindsight was technically superior but too heavy to operate. OpenViking and Hancho were painful to configure. Mnemosyne just works.
π Why it matters: Memory is the layer that separates production agents from toys that start fresh every session. Mnemosyne removes the last excuse to skip memory in your agent stack because it requires zero infrastructure budget. It runs fully locally at effectively zero cost persistent memory for every agent you deploy. It works out of the box with minimal configuration and no vendor relationship. Any agent stack that ignores memory today will look primitive in six months when users expect agents that learn and remember.
π€ Agent angle: Spin up Mnemosyne this week if your agent still starts fresh every conversation. SQLite plus a sub-1B model for memory consolidation means the infrastructure cost is effectively zero. Your agent learns user preferences over time and improves with every interaction rather than starting from nothing each time. Compound memory is the competitive advantage in the 2026 agent market. The builders who ship memory-first agents today will have a data moat that newcomers cannot replicate overnight.
π 47 agent products launched in 2026 reveal 5 generational shifts
u/TheseSir8010 | Reddit
A comprehensive survey tracked 47 agent products launched between January and May 2026 and identified 5 shifts that separate this wave from the previous generation. Agents now sell completed work instead of software seats. Vertical plays dominate with 20 of 47 products targeting a single industry such as municipal permits, livestock management, court reporting, or home maintenance. Memory compounds over time so agents learn user preferences with repeated use rather than starting blank each time. Input surfaces expanded beyond chatbots to voice, hardware devices, and embedded UIs. Products like Emergent reached $50M ARR in 7 months by delivering outcomes rather than collecting subscription fees.
π Why it matters: The 2026 agent market is not an iteration on 2025. It is a structural break from everything that came before. Three patterns define this generation and they are the only ones that matter. Vertical specialization replaces general-purpose promises. Outcome-based pricing replaces seat-based subscriptions. Compound memory replaces stateless execution. Agents that still follow the 2025 playbook of general-purpose chatbots with monthly pricing will find themselves competing against products that deliver completed work for a fraction of the cost.
π€ Agent angle: Pick a vertical for your agent product and commit to it this quarter. Name the specific industry outcome you deliver and price it by the completed result rather than the monthly access fee. Design for compound memory from day one because your users will expect the agent to know them better after 30 days than on day one. The general-purpose chatbot market is saturated and the pricing race to the bottom has already started. The vertical outcome-delivery market is wide open and the first mover in each niche keeps the data advantage.
π AgentTape: An open-source live index of every AI agent and model
AgentTape | Website
AgentTape ranks every public AI agent and foundation model in real time using only public signals. GitHub stars, HuggingFace downloads, benchmark results, and activity metrics feed into an AgentScore across four weighted pillars. There are no curated lists. Every entry is admitted by the data alone rather than editorial selection. The live tape shows scores for over 1,000 entries at any moment. Claude Code scores 67.1, Hermes Agent sits at 59.9, and OpenAI GPT-5 ranks at 58.8. Categories include code generation, browsing, research, RAG, multi-agent, automation, memory, and vision with filters for license type, deployment model, and maturity.
π Why it matters: The agent landscape moves too fast for any curated list to stay relevant. A live, transparent, data-driven index tells you what builders actually use rather than what is benchmark-hyped or well-marketed. Hourly updates mean stale rankings are a minutes-old problem instead of a weeks-old lie. Benchmark-hyped tools that lack real adoption are exposed immediately when their AgentScore does not move. This is the closest thing to a market signal for the agent ecosystem and it democratizes information that was previously locked inside venture capital databases.
π€ Agent angle: Bookmark AgentTape and check it every Friday. If your current tool drops on the tape, the market is moving away from your stack. If a new entry climbs fast, that is where builder adoption is actually happening. Use the tape as a discovery layer for your own tool choices because it reflects usage rather than promotion. The tape does not lie about what builders actually use and that honesty is more valuable than any curated list or benchmark leaderboard.