Blind Validation: Can 40 Answers Recover a Known Grid?

What We Tested

Any assessment can sound plausible on paper. The harder question is whether it can get back to the structure it claims to measure.

That is the job of the ICOSA-D40. Forty items on a 7-point bipolar scale are supposed to recover a 20-cell grid: four capacities crossed with five domains. Everything downstream depends on that recovery. If the items drift, the formations drift. The gateways drift. The clinical picture drifts.

This test asked a simple question: if the grid is already known, can the assessment find it again without being shown the answer?

We ran that check on 165 synthetic clinical personas. Each persona had two source objects:

a hand-built Icosa grid, used here as ground truth
a 1,500- to 2,000-word vignette written from inside the condition

An AI read only the vignette. It never saw the grid. It answered the same 40 questions a person would answer, in character, one item at a time. Then the recovered grid was scored against the source grid.

This is not human-outcome validation. It is an internal round-trip test. The point is narrower and still important: can the item set preserve the intended structure when all it gets is lived narrative?

The Persona Set

The 165 personas cover the full severity range, from secure attachment and recovered conditions through active psychosis.

Band	Personas	Examples
Thriving	33	secure attachment, recovered depression, recovered anxiety, ADHD-thriving
Steady	31	ADHD variants, mild-to-moderate depression, generalized anxiety, early burnout
Strained	49	panic disorders, PTSD, OCD, somatic symptom patterns, active eating disorders
Burdened	36	severe anxiety, active substance use, dissociative states, severe personality pathology
Severe	16	active psychosis, schizophrenia, psychotic mania, severe OCD, DID in activated state

The important point is not just count. It is spread. The test includes healthy profiles, defended profiles, flooded profiles, numbed profiles, and presentations where subjective experience badly misstates structural reality. If the assessment can hold its shape across that range, it is measuring something real.

Results

Across all 165 personas and 6,600 individual item scores:

Mean MAE: 0.89 - less than one scale point of average error on a 7-point scale
82% of answers land within one point of the expected value
38% are exact matches
69% directional accuracy - the answer falls on the correct side of the midpoint

Accuracy declines with severity, but it declines cleanly rather than collapsing.

Band	Personas	Mean MAE	Within 1
Thriving	33	0.57	92%
Steady	31	0.89	82%
Strained	49	0.96	79%
Burdened	36	1.02	77%
Severe	16	1.02	77%

Two result patterns matter most.

First, health reconstructs almost perfectly. Twenty-one personas came in below 0.60 MAE. secure-continuous hit 0.00: every answer matched. The five recovered personas - depression, anxiety, eating disorder, trauma, and substance use - averaged 0.14.

Second, the error curve is orderly. One hundred seven of 165 personas, about 65%, stayed below 1.0 MAE. Even at the severe end, the assessment usually kept the broad structure right and missed on magnitude, not direction.

Where It Misses

Distressed people underrate capacity

The largest bias sits in capacity items: what the system can do, or how strongly a function is operating. Across the full set, capacity items undershot ground truth by -0.43 on average. Domain items undershot by -0.21.

That makes sense in practice. When a function is active in a painful way, people do not report it as capacity. They report it as suffering.

Anxious body-monitoring does not feel like strong tracking of bodily state. It feels miserable. Codependent emotional scanning does not feel like sharp emotional pickup. It feels like losing yourself in someone else. A person whose focus is locked by OCD may sincerely say meaning is gone, even when the system is still assigning crushing significance to everything.

The grid sees activity. The person feels impairment. Both descriptions can be true at once.

Meaning gets scored too low

Eight of the 40 items sit in the spiritual or meaning domain. Those eight items account for six of the ten worst-performing questions, with average error 24% higher than the rest of the assessment.

The direction of the miss is consistent. People score this domain lower than the grid expects, especially when the grid says meaning is moderate or high.

The thinking traces make the pattern clear. About 30% of the misses came from personas who had live sources of meaning but refused to count them: creative work, time in nature, service, religious practice, even a nightly ritual tied to family memory. The internal logic was usually some version of, “Yes, that matters to me, but I don’t think that is what the question means.”

The rest were in crisis. They could name former sources of meaning and still score low because those sources felt unavailable in the present.

That is useful. It suggests a real self-report risk, not just a synthetic quirk. If a client has meaning in work, beauty, duty, family, or craft but does not use spiritual language, the score may come in lower than clinical observation would predict.

Defenses hide structure

The hardest personas were not simply the sickest ones. They were the ones whose first-person report was farthest from the structural pattern underneath it.

PTSD-intrusion profiles often lead with numbness even when the grid encodes flooding. Social anxiety can look like low attunement from the inside even when the person is scanning every micro-expression in the room. Narcissistic presentations may report low awareness of others while still reading the room with precision.

The assessment meets the same wall a clinician meets in a first session: the person reports the defense first.

That is not a failure of the item set. It is the terrain the item set has to cross.

The constructs stayed separate

A weaker assessment would fail in a different way: it would blur capacity and domain into one vague thing. That is not what happened here.

Across 6,600 thinking traces, capacity items elicited reasoning about ability, tracking, or access. Domain items elicited reasoning about intensity, presence, or volume. The misses were calibration misses. There was no evidence that the assessment confused what it was asking.

What the Result Means

This test does not prove everything. It does not replace human validation, and it does not tell a practitioner to ignore clinical judgment.

It does show three important things.

The item set carries the intended signal. Forty responses were enough to recover a 20-cell clinical grid with sub-one-point average error across a wide synthetic set.

The misses are interpretable. They cluster where clinicians would expect trouble: distressed people underrating capacity, non-spiritual people discounting meaning, defended presentations reporting the wall instead of the wound.

The distortion is informative rather than random. When the recovered grid drifts, the drift itself tells you something about self-report under stress.

That is the practical bar. The D40 does not need to replace a clinician’s eye. It needs to get close enough to structural reality that the resulting map is worth working from.

On this test, it does.

See your own formation

Discover how your twenty harmonies are organized — and where your centering path leads.

Take the Assessment →