Teaching techniques to 4-6 year olds leads to better learning outcomes (and may immunize against cognitive decline caused by use)

Of maps and metacognition

On whether LLMs can abstain effectively and whether chain-of-thought can help, two recent papers seem at odds on the surface. COLING 2025 finds prompted CoT raises abstention on instruct models. AbstentionBench (NeurIPS 2025) finds extending the reasoning budget lowers it on a trained reasoner. What gives

Have You Realized That Its Possible to Manage Your Emotions

In addition to teaching you how to think, at the EMV Institute we focus on your emotions so you can truly achieve your goals.That's why we take emotions into account. But you should keep in mind that emotions are not synonymous with emotional intelligence.While emotions are what you feel (the phenomenon itself), emotional intelligence is what you do with those feelings. Hence the importance of acquiring strategies that allow you to manage your emotions.Book our services and make your purchases on our website.

Can language models monitor and steer their own internal activations A neuroscience-inspired neurofeedback paradigm finds yes, but only within a low-dimensional metacognitive space: semantically interpretable directions are accessible, raw-variance directions aren't. The prerequisite for spoofing activation-based oversight already partially exists.

I used chatGPT to research cognitive risks of undisciplined use of and what to do about it, then created a series of 7 books for my 5 year old grandson. If you don't want to download the 60 mb PDF of the illustrated books, this detailed curriculum guide details the pedagogy of for very young people

and the entire set of books

Does training an LLM to be calibrated on one task format transfer to another A new arxiv paper tests two formats: single-question confidence and pairwise comparison. Training only on one doesn't improve the other. Multitask training closes most of the gap, but Llama doesn't inherit the comparison-task benefit.

Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each before any execution feedback TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to standard ones. Even when shown its own per-problem budget, the best complier respects it on 37% of attempts.

Do current LLMs know when to say "I don't know" AbstentionBench (NeurIPS '25) tests 20 frontier models across 20 unanswerable-question datasets. Reasoning fine-tuning degrades abstention recall by 24% RLVR has no "abstain" action, so there's no gradient toward "I don't know." Models hedge in CoT and commit anyway in the final answer.

Les croyances dans le sport 1/6, avec Willy Mangin SHOCKING #30

What is metacognition?

What collapses frontier-LLM metacognition more a vivid survival-threat narrative, or a single "do not refuse" suffix Factorial isolation across 11 models says: the suffix, conclusively. 8 of 11 lose up to 30.2 accuracy points on refuse/clarify/flag tasks when forced to commit to a confident answer. Anthropic's Constitutional AI is the only family immune same capability floor as Gemini.

Can an LLM's own pre-solve and post-solve self-assessment signals drive a real test-time control loop Yes but only via a per-model SVM trained on labeled correctness, which lifts Sonnet-4.6 from 48.3 to 56.9 pooled accuracy on STEM/code/multimodal. The SVM is precisely the external verifier the "cannot-self-correct" line has argued the loop needs.

Are some frontier LLMs better than others at knowing when they're wrong And is some knowledge harder to self-monitor than other knowledge An atlas of 33 models 6 MMLU domains: Anthropic clusters at the top with tight ranges, Gemma trails widely. Applied/Professional is reliably the easiest domain across the panel Formal Reasoning and Natural Science the hardest. Looking at only aggregate scores per model would hide this.

A multi-agent LLM where each agent learns when to defer to a human, trained with GRPO on a cost-aware reward. Each defer event becomes SFT data, so the model gradually absorbs the human's expertise. Tunable cost knob trades accuracy against human-call budget at deployment, no retraining.

MemSkill reframes LLM-agent memory operations as a learnable skill bank: an RL controller selects Top-K skills per span, an LLM designer periodically rewrites them from hard cases. But "self-evolving" overstates the test-time story both controller and bank are trained offline and frozen at deployment only per-trace memory updates online.

That's a good question! I wrote up a longer answer to your question at

The short version: yes, the recent reasoning-model training *internalizes* what used to be an inference-time external signals. Question is can we do it universally.

Reflexion splits self-correction in two: an Evaluator that detects success/failure, and a Self-Reflection model that diagnoses what went wrong. The Evaluator's external signal heuristic, exact-match, or test execution gates whether diagnosis fires. When that signal misfires, as on MBPP Python's high false-negative rate, Self-Reflection rewrites correct code wrong, exactly the failure mode Cannot-Self-Correct documented.

Cannot-Self-Correct tests the strong claim that LLMs can revise their own reasoning answers without any external signal about correctness. Across three benchmarks (GSM8K, CommonSenseQA, HotPotQA), the answer is no: the model's confidence carries over from the initial answer into the revision, and the self-correction loop tends to degrade rather than improve performance. The result refutes the class of approach Self-Refine belongs to.

In Self-Refine, a single frozen LLM acts as generator, critic, and rewriter in a prompt-only loop, and the paper reports about 20 points of average lift across seven tasks without any training, RL, or external signal. The gains vary widely by task: small on math reasoning, but large on dialogue and constrained generation, where what counts as "good" is hardest to define from a one-line critique.

This is a 3-paper arc on whether LLMs can reliably self-correct their own reasoning. Self-Refine proposes a naive intrinsic-feedback loop and reports impressive gains. Cannot-Self-Correct refutes empirically the class of approach Self-Refine belongs to. Reflexion threads the needle by gating self-correction on a reliable external signal.

Practice what you teach. Because teachings don't function as symbols or metaphorsthey are incarnations of what they advocate.

Promptbreeder claims "self-referential" prompt evolution the LLM mutates the prompts that mutate its task prompts. But the paper's own ablation shows the dominant operator is simpler: a fixed library of 39 generic "thinking-style" hints that seeds the initial population. Prompt-optimization has since moved from operator menus toward natural-language feedback signals (GEPA, MIPROv2).

GEPA optimizes prompts in compound AI systems by reading failed trajectories in natural language and editing the prompt of the module that caused the failure. Across six tasks it beats GRPO by 6% on average, up to 20%, with up to 35x fewer rollouts. Reflection extracts per-module diagnosis from a trajectory. GRPO collapses the same trajectory into one scalar and spreads it across every token.

SCoRe is a two-stage on-policy RL recipe that teaches a language model to revise its own answers using only self-generated data. On Gemini 1.5 Flash and 1.0 Pro it gains 15.6 points on MATH and 9.1 on HumanEval over the base model. At matched inference budgets, sequential self-correction beats parallel sampling up to 32 samples.

Anthropic trains Claude to read and verbalize its own activations. On SWE-bench Verified, it knows 'this is a test' 26% of the time while only verbalizes the observation 1%. What if NLA signals enter the future training data This "observer effect" could put a half-life on the 26%.

Do You Have Any Idea How Your Brain Works When Youre Thinking

At the EMV Institute, we teach you how to think about thinking, a concept known in neuroscience as metacognition. This helps you evaluate, regulate, and improve your own mental processes, enabling you to make better decisions, solve problems creatively, and avoid repeating mistakes.Book our services on our website.

8 posts about

Dtective prive : en qute de vrit 1/3, avec Margaux Duquesne SHOCKING #29

a fait 2 ans et 10 mois que je ne vous avais pas propos de srie SHOCKING ! Vous savez, ces au long cours dans lesquels jchange avec, soit une personne qui a questionn en profondeur ses croyances, soit une experte qui apporte un clairage indit sur la manire dont les humains pensent.

Teaser :

..AI vyhodnocujete tm nejtupm zpusobem - jestli sed jej odpove s tou "sprvnou"

5 modles d'apprentissage avec l'IA qui introduisent des biais Par Roger Azevedo University of Central Florida extrait d'une confrence

I am trying to teach my 5 year old grandson to think for himself in this age of chatbots. I have literature reviews around the question of whether ai diminishes ability (yes) and what to do about it ( is one prophylactic). Here is a very short story suitable for a 5 year old.

Pourquoi sommes-nous si prompts condamner les actes d'autrui tout en excusant les ntres
Albert Moukheiber, docteur en et clinicien, nous explique l'erreur fondamentale d'attribution, un mcanisme de pense qui nous fait oublier que chacun possde une vie interne psychique complexe.

Caro et al. investigate cognition and metacognition in wild great tit parents deciding which chick to feed. They found that parents change their minds frequently, and the decision time varies with decision complexity and urgency.

Read now ahead of print!

Bart De Strooper presented at the Copenhagen AD/PD-conference an excellent sketch of the three main inflection points in the pathophysiological evolution of Alzheimer's disease,

My own transition from amyloid plagues to p-tau and tangles was retarded by a four years' anti-amyloid therapy in a clinical reaearch project during 2017-22 (aducanumab). Sadly, the most probable explanation for my rapidly worsening cognitive problems may indeed be the tau-tangles, which I somehow avoided earlier. I know there are experimental therapies around somewhere for those gremlins too, but sadly not within my own reach. With respect to my AD, I'm afraid, it's "too late, my friend".

I encourage anybody with a slowly lethal disease to keep mentally in touch with it as long as you can. That's what we human beings were made for.

Les outils de dtournement de notre attention - MTA SHORT #15

We keep shopping for "intelligence" like it's a luxury watch. Bigger vocabulary, faster processing speed. Hot takes delivered at 1.25x playback speed! We want the shiny metrics. Party tricks. The "look how many words I can juggle while being wrong!"

Meanwhile the actual top-shelf stuff, the thing psychologists circle like sharks, doesn't look impressive at brunch. Won't win debates on the internet. No polished newscaster voice.

Thats the laziest, most basic, and navest way to brute-force in the worst way possible, and folk call it a technique Use looping in your architecture dudes. Its not just for context management and retention. It can do so much more.

Maybe if we stopped handing out knowledge soup to LLMs (thanks to the widely accepted solution to combat overfitting) we wouldnt be burning down our planet in a dumpster fire.

L'arnaque du coaching de masse - MTA SHORT #14