Meta-Analysis

Computational Validation of the Icosa Personality Model: A Program-Level Synthesis

Icosapersonality assessmentpsychometricsmeta-analysisaggregated evidenceeffect sizeresearch programsystematic review

Scope and Framing

This paper synthesizes 77 synthetic-evidence studies conducted within the Icosa personality model’s internal benchmarking program, spanning 13 research domains and producing 441 individual hypothesis tests. The program evaluates the computational architecture of a 4x5 personality grid model — four capacities (Open, Focus, Bond, Move) crossed with five experiential domains (Physical, Emotional, Mental, Relational, Spiritual) — that derives higher-order constructs including coherence bands, formations, traps, basins, fault lines, gateways, resonance trees, and centering paths from this grid geometry.

The entire evidence base is synthetic. Every profile was computationally generated by the Icosa engine. Every outcome was measured within the model’s own parameter space. No human respondents participated. No clinical populations were tested. No treatment outcomes were assessed. The studies collectively characterize the internal behavior of a computational scoring architecture: whether its formulas implement their specifications faithfully, whether its constructs differentiate from each other as designed, whether its vocabulary carries semantic grounding in distributional space, and whether its higher-order outputs remain coherent under adversarial and extreme conditions.

This synthesis does not constitute external validation. It constitutes a systematic internal audit. The central question it addresses is not “does the Icosa model describe human personality accurately” but rather “does the Icosa model behave as its mathematical specification claims, and are there structural incoherences, dead zones, or boundary failures that would need resolution before external testing could be interpretable.” That distinction governs every interpretive claim that follows.

The synthesis draws on three tiers of source material: program-level aggregation statistics covering study counts, finding distributions, and anchor effects; per-study summaries providing title, abstract, domain, evidence type, and signal profiles for each of the 77 studies; and domain-level synthesis abstracts from 13 whitepapers that provide the interpretive layer for each research area.

Study Taxonomy and Evidence Classes

The 77 studies distribute across seven evidence types, each carrying different epistemic weight and different vulnerability to circularity.

Formula verification (38 studies, 49.4% of the program). These confirm that the engine implements its own mathematical rules faithfully. When a formula verification study reports that trap count inversely predicts coherence, this establishes that the scoring code does what the scoring specification says — not that traps correspond to clinically meaningful personality dysfunction. Formula verification is the backbone of the program and accounts for the majority of its uniformly positive findings. Its inherent limitation is that it cannot discover anything the formula does not already encode.

Synthetic benchmark (16 studies, 20.8%). These test structural hypotheses within the model’s simulated parameter space. Unlike formula verification, synthetic benchmarks can surface unexpected behavior: dimensional structures that compress differently than predicted, noise degradation curves that reveal measurement boundaries, or round-trip translation losses that identify where the architecture discards information. The crosswalk compression studies, the measurement engine comparison, and several dyadic mechanism benchmarks fall in this category. Synthetic benchmarks carry more discovery potential than formula verification but remain model-internal: they characterize the geometry of computational output, not the geometry of human personality.

Semantic grounding (11 studies, 14.3%). These use embedding-based similarity analysis to test whether the model’s construct vocabulary — trap names, formation labels, gateway descriptions, capacity-domain definitions — carries meaningful structure in distributional semantic space. Semantic grounding studies provide the closest link between the model’s internal architecture and external interpretive meaning, but they test against language model embeddings, not against human judgment or clinical observation. A construct that clusters meaningfully in embedding space may still fail to differentiate clinical populations. Still, a construct that fails in embedding space — that cannot be distinguished from random labels by a language model trained on the full breadth of written English — faces a higher burden of proof for any claimed clinical meaning.

Applied synthetic (9 studies, 11.7%). These test whether structural features of the Icosa grid predict clinically relevant proxy outcomes — compensation detection, stress forecasting, intervention prioritization, differential pattern separation, termination signaling, dyadic lock-buffer forecasting — within the model’s own simulator. Applied synthetic studies sit at the boundary between internal consistency and utility demonstration. They show that the model’s constructs can do nontrivial work within its own simulation environment, which is necessary but not sufficient for real-world applicability.

Simulator comparison (1 study). The centering plan policy benchmark compared gateway-based intervention ordering against random ordering using Monte Carlo simulation with the paths engine’s stateful simulator, testing whether the model’s prioritization logic yields trajectory advantages within its own dynamics.

Stress test (1 study). The scale-sensitivity study examined whether cross-capacity and cross-domain variance measures predict coherence as expected under extreme conditions, testing boundary behavior of the aggregation formula.

Internal consistency (1 study). The coherence-resonance convergent study tested whether the model’s two primary summary metrics relate to each other as their computational pathways would imply.

The distribution is top-heavy: formula verification and synthetic benchmarks account for 70% of all studies. This is appropriate for an early-stage internal audit — you verify the engine works before asking whether it measures anything real — but it means the program’s strongest claims are about implementation fidelity, not about construct validity.

Quantitative Overview

Across all 77 studies and 441 hypothesis tests, 364 findings (82.5%) were fully reportable: they reached final reportable significance after Holm-Bonferroni correction and met pre-registered practical significance thresholds. Another 10 findings were significant but fell below practical thresholds, 19 were exploratory-positive, 2 were FDR-only, 1 was raw-only, and 45 were null. Zero findings were classified as not evaluable.

Eleven studies included circularity governance flags, totaling 22 governed analyses where the tested relationship shares computational ancestry with a metric it purports to validate. All 77 studies achieved complete provenance documentation. Those governed analyses account for 5.0% of the 441 findings, and they were handled appropriately — reported as implementation fidelity checks or shared-anchor benchmarks rather than independent discoveries.

The statistical test distribution reflects the program’s design:

Pearson correlation: 141 tests (32.0%)
Spearman rank correlation: 118 tests (26.8%)
Permutation-paired: 52 tests (11.8%)
Correlation comparison: 35 tests (7.9%)
t-test: 22 tests (5.0%)
Permutation-independent: 19 tests (4.3%)

The remaining tests span PCA, bootstrap, chi-square, logistic regression, Monte Carlo, McNemar, cross-validation, Fleiss kappa, ICC, Kendall W, Krippendorff alpha, Mann-Whitney, paired-t, TOST equivalence, and multiple-test composites. The predominance of correlation-based tests is consistent with a program focused on structural relationships within a continuous parameter space.

Domain-by-Domain Benchmark Patterns

Geometry (6 studies, 16 findings, 0 nulls)

The geometric foundation of the model — whether the 4x5 grid structure is real in its own output space — receives uniformly positive results. PCA on the 20 center health scores yields 18 effective dimensions, confirming that the grid is not collapsible to a smaller factor structure: the model’s claimed 20-dimensional space uses most of its dimensionality. The hot-core/cool-periphery distinction shows strong alignment with coherence (r = 0.94), and within-profile core-periphery correlation confirms that geometric position carries structural meaning within the model’s scoring. Capacity independence (low inter-row correlation) and domain independence (low inter-column correlation) are confirmed at the formula-verification level. A single synthetic benchmark tested whether the d40 assessment recovers geometry beyond row and column marginals; it confirmed that the full 20-cell structure carries recoverable information that simpler summaries discard. Two findings carry circularity flags (expected formula behavior), appropriately governed.

The zero-null rate across geometry is expected: these are largely formula verification studies testing whether a deterministic scoring system produces the output its equations specify. The real question — whether this 20-dimensional structure corresponds to anything in human personality — remains entirely unaddressed.

Structural (22 studies, 143 findings, 17 nulls)

The structural domain is the largest and most heterogeneous research area, spanning formula verification, synthetic benchmarks, and semantic grounding studies across the model’s full vocabulary of constructs, topological features, and dynamic subsystems.

Formula verification studies confirm that capacity and domain dimensions remain distinct (PCA shows no single component dominates), that fault lines predict coherence loss and trap formation as specified, that mirror pairs maintain bounded structural polarity, that repeller cells resist centering, that wells and sources show the predicted coherence asymmetries, and that direct pathology (traps) carries stronger coherence impact than cascading disturbance (resonance propagation). These are uniformly positive and establish that the engine’s construct relationships are internally consistent.

The synthetic benchmark studies (fault-line cascade, Mental dampening, directed resonance, resonance tree) extend beyond formula verification by testing whether emergent patterns behave as the theory predicts under synthetic population variation. All produced reportable findings. The Mental-dampening study confirmed that Mental-domain centering substitutes analytical processing for felt experience in the model’s resonance transmission — a structural prediction with direct clinical interpretation if the model proves externally valid.

The semantic grounding studies produce the domain’s most informative results and its highest concentration of mixed evidence. Across the structural domain, 38 findings were not fully reportable: 17 null, 1 below threshold, 19 exploratory-positive, and 1 FDR-only. Much of that mixed signal sits in the semantic grounding layer. The pattern vocabulary study confirmed that patterns sharing capacities or domains cluster semantically and that centered patterns convey health rather than pathology. The severity ordering study confirmed that coherence bands (Severe through Thriving) are organized by severity in embedding space. The clinical-patterns study confirmed grounding for traps, compensations, basins, and fault lines, though two findings were exploratory-only. The grid-reality study confirmed row/column semantic structure but produced one null and one below-threshold finding. The topology study — testing whether gateways, hot core, resonance tree, healing power, and transformation paths are grounded in semantic space — produced one reportable finding, one exploratory-positive finding, one FDR-only finding, and four nulls. The crosswalk validity study (MBTI and Enneagram mappings) produced zero reportable findings: all 11 analyses returned either null (4) or exploratory-only (7).

The clinical-skepticism study directly confronted adversarial challenges: that Icosa is just Big Five with extra steps, that Bond is just attachment theory, that the grid adds nothing over existing frameworks, and that the definitions are Barnum-generic. Of 13 tests, 7 were reportable, 4 null, and 2 exploratory. This is the most mixed result in the program and the most honest: the model partially but not fully withstands skeptical semantic challenge.

The structural domain’s null structure is the program’s most important diagnostic. The nulls concentrate in semantic grounding: the model’s topological vocabulary (gateways, resonance trees, transformation paths) and its crosswalk mappings to external frameworks do not yet show reliable grounding in distributional semantic space. This does not mean the constructs are invalid — embedding-based tests have limited sensitivity to geometric meaning — but it identifies a specific vulnerability: the model’s most distinctive higher-order constructs are the ones least supported by the available semantic evidence.

Constructs (7 studies, 65 findings, 10 nulls)

The construct system — traps, basins, gateways, and fault lines — is the model’s primary clinical vocabulary and receives detailed benchmarking across formula verification and one synthetic benchmark (equifinal paths). Across 65 hypothesis tests, 51 (78%) reached full reportable significance, 3 were significant but below practical thresholds, 1 showed raw-only support, and 10 were null.

The interaction study confirmed that traps and basins co-occur, that gateways buffer against trap formation, and that fault lines predict centered-count depletion. Gateway outcome studies showed that gateway impairment signals predict trap formation, though 3 of 19 tests were null and 3 fell below practical thresholds — establishing that the gateway-trap pathway is real but noisier than other construct relationships. The trap taxonomy study confirmed that traps in different categories (somatic, emotional, relational, identity) produce distinguishable severity distributions.

The basin-discovery study is the construct domain’s primary source of nulls: 6 of 9 tests were null. Basin count co-occurred with trap count as predicted, but several hypothesized basin-coherence relationships did not reach significance. The domain synthesis notes that basin behavior may require different population composition or more extreme input conditions to manifest reliably. This is a confirmed boundary: the model defines basins as attractors that resist change, but the synthetic evidence does not yet confirm that they carry independent predictive value beyond their co-occurrence with traps.

The equifinal-paths study tested whether different activation routes leading to the same trap produce distinguishable structural neighborhoods. Two of four tests were reportable; two were null. The finding that path-level heterogeneity exists is confirmed, but the evidence that it maps to differential intervention response within the model is incomplete.

Dyadic (17 studies, 61 findings, 6 nulls)

The dyadic domain is the program’s second-largest, reflecting the model’s substantial investment in relational assessment. Nine formula verification studies, six synthetic benchmarks, and two applied synthetic benchmarks collectively test the dyadic engine’s cross-domain transmission, fault cascade topology, interaction tensor structure, gateway compatibility, relational basin stability, formation pattern recognition, trap dynamics, and external framework alignment.

The formula verification studies are uniformly positive: relational basin stability supports structural safety (functional dyad basin), gateway compatibility predicts dyadic coherence, cross-band coherence pairing effects behave as specified, relational provision composites (TMRC) correlate with coherence, and dyadic interaction types (reinforcing vs catalytic) differentiate as the theory predicts.

The synthetic benchmarks introduce the domain’s most productive nulls. The domain-channeling study tested whether the engine’s domain flow distributions are domain-specific rather than arbitrary; of 3 tests, 2 were null and only 1 was reportable. The emergent-phenomena study confirmed that emergent trap presence is better predicted by full dyadic interaction features than by individual-only features, but 2 of 4 tests were null, limiting the scope of emergence claims. The cross-domain asymmetry study confirmed that symmetric domain adjacency weights produce directional flow asymmetry, but 1 of 4 findings was null and 1 fell below threshold.

Hidden validation arms were present in all 17 dyadic studies. Thirteen confirmed that dyadic constructs capture latent relational information beyond what resonance coupling alone can index. Four returned null. This is a structured boundary: the model’s dyadic engine reliably differentiates relational structure across some dimensions (fault cascade, shadow alignment, lock-buffer forecasting) but not others (domain channeling specificity, some emergence pathways).

The synthetic benchmark tested whether d40 assessments recover hidden dyadic channel potential (confirmed). Applied synthetic benchmarks tested whether dyadic structure forecasts lock versus buffer states (confirmed) and whether dyadic constructs align with Gottman and attachment framework templates (all 5 findings reportable). The template-alignment study is notable: it suggests that the model’s relational geometry, when probed against relationship science frameworks within synthetic conditions, produces structural distinctions those frameworks emphasize. This is a semantic-correspondence finding, not a clinical validation, but it is the closest the program comes to bridging internal architecture and external relational theory.

Clinical (7 studies, 35 findings, 8 nulls)

The clinical domain tests whether the model’s structural features predict clinically relevant proxy outcomes within its own simulator. Six applied synthetic benchmarks and one synthetic benchmark collectively address compensation detection, stress forecasting, intervention prioritization, differential diagnosis, termination signaling, and cascade correlation architecture.

The strongest results come from applied synthetic benchmarks with clear structural predictions. The d40 compensation-brittleness benchmark confirmed that the model detects compensation-masked brittleness beyond what surface severity indicates. The d40 stress-challenge forecast confirmed that a structural reading of a single assessment forecasts later destabilization better than severity alone. The intervention-priority study confirmed that hot core health predicts centered count and that topological features track intervention ordering logic (8 of 9 findings reportable).

The termination-markers study is the clinical domain’s primary weakness: only 1 of 3 findings was reportable, with 2 nulls. Resonance total and domain-health contrasts do not yet reliably predict centered count as a completion signal. This is practically significant: if the model cannot identify when a centering process is approaching completion, the clinical path engine lacks a reliable stopping criterion.

The cascade-correlation benchmark tested whether changes in one capacity domain propagate predictably to others. Six of 8 findings were reportable; 2 were null. The nulls identify specific cross-domain propagation pathways that do not behave as the model’s cascade dynamics would predict, suggesting that the cascade architecture may need recalibration for certain domain pairs.

Capacities (2 studies, 12 findings, 0 nulls)

Both capacity studies are formula verification, and all 12 findings are reportable with large effect sizes. Capacity health scores jointly and individually predict coherence. Open and Bond variance predict trap vulnerability. Move variance predicts gateway bonus. Cross-axis health interactions confirm that row-level and column-level health co-vary through structural mechanisms rather than trivially.

Two of the 12 findings carry circularity flags (disposition: expected formula check) and are reported as implementation fidelity checks. The complete absence of nulls across both studies is itself a datum: in synthetic benchmarking of a deterministic formula, universal confirmation may reflect internal coherence or insufficient adversarial coverage. The domain synthesis explicitly notes this concern.

States (2 studies, 16 findings, 0 nulls)

The states subsystem tests the asymmetric penalty mechanism through under-expression-present versus absent contrasts and under-condition-present versus absent contrasts, alongside the capacity centering target system. All 16 findings are reportable. The asymmetric-impact study confirmed large coherence penalties when under-expression conditions are present across capacity and domain axes, with effect sizes above d = 1.0 for several domain-level contrasts (Relational d = 1.05, Physical d = 1.03, Mental d = 1.00). It did not run a matched under-vs-over head-to-head comparison. The capacity-target-validity study confirmed that centered states are computationally distinct from deviations and that centering counts propagate reliably into downstream scoring.

Neither study raised circularity flags. Both are formula verification. The zero-null rate is expected for tests of implementation fidelity in a deterministic system.

Formations (3 studies, 14 findings, 0 nulls)

The formation system — 76 higher-order personality structures derived from coherence band and grid variance patterns — receives uniformly positive results across three formula verification studies. Grid completion and pair density predict formation emergence (r = 0.81 for pair density and coherence). Topology fulcrum health predicts coherence (r = 0.82). Dynamics momentum correlates with coherence (r = 0.77). PCA on the seven dynamics metrics confirms they organize into interpretable dimensions rather than representing redundant facets.

The zero-null rate across formations is notable given the construct’s complexity: 76 formations computed from multiple grid-level properties could easily produce dimensional collapse or measurement artifacts. That they do not is evidence of implementation integrity, though the absence of semantic grounding studies for formations is a gap — the program has not yet tested whether formation labels carry meaning in distributional semantic space.

Coherence (1 study, 1 finding, 0 nulls)

A single internal consistency study confirmed that coherence correlates inversely with resonance total (r = -0.48), the model’s measure of cross-domain dissonance propagation. This is a modest but meaningful internal relationship: the two metrics share a common data source (the 20 harmony scores) but process them through different computational pathways, and the observed correlation is substantially below unity, indicating they capture partially distinct variance. Two original hypotheses (fault lines vs coherence, grid completion vs coherence) were removed as tautological, which is itself a positive signal for circularity governance.

Measurement (1 study, 25 findings, 1 null)

The engine-comparison benchmark tested cross-architecture consistency across the three assessment engines (d40, c135, p180) that independently measure the same 20-dimensional structure. The ICC on continuous cell values was 0.998. Fleiss kappa for cell state classification was 0.94 (computed across 2 engines for the cell-state agreement measure). Mean per-cell accuracy at sigma = 0.50 was 0.93. These are high internal concordance values, confirming that the three engines converge on the same structural portraits from the same input. Nine findings carry circularity flags (shared computational ancestry with the coherence formula) and are governed as shared-anchor benchmarks. The single null came from a separate noise-sensitivity boundary test rather than from that governed set.

Robustness (2 studies, 12 findings, 0 nulls)

Both robustness studies examined coherence formula behavior under extreme input conditions. Capacity-row variance predicts coherence under extreme conditions (r = -0.69 for both Move and Focus). The stress test confirmed that cross-capacity and cross-domain variance measures track coherence as expected without undue sensitivity to any single scale direction. The zero-null rate is appropriate for boundary testing of a deterministic aggregation formula: the formula either degrades gracefully or it does not.

Paths (5 studies, 21 findings, 2 nulls)

The centering paths subsystem receives substantial benchmarking across formula verification, simulator comparison, and applied synthetic evidence types. The simulator comparison confirmed that gateway-based intervention ordering yields a modest but reliable trajectory advantage over random ordering. Formula verification confirmed that path-level dynamics (compensation, oscillation), overlay modulation channels (cascade-trap, compensation-count), and path efficiency metrics (milestone count predicts coherence at rs = 0.73) behave as specified.

The applied synthetic study tested practical translation: whether path outputs map to interpretable structural signals. Eleven of 12 findings were reportable, with one null. Grid completion strongly predicts path availability (rs = -0.96 for grid completion vs single path count). This is among the program’s strongest applied results: it confirms that the path engine’s outputs are structurally grounded in the grid state they claim to address.

Crosswalk (2 studies, 20 findings, 1 null)

Two synthetic benchmarks quantified information loss when Icosa profiles are translated through Big Five, MBTI, and Enneagram frameworks and reconstructed. The round-trip fidelity study (12 findings, all reportable) showed that capacity-level structure survives Big Five translation with moderate fidelity (r = 0.65 for Focus and Move) but with stronger retention for non-Physical domains. The compression benchmark tested direct reconstruction fidelity; Big Five reconstruction reached r = 0.74, below the pre-registered threshold of r >= 0.75.

The crosswalk results establish a clear structural finding: the Icosa model carries information that coarser personality frameworks compress away, and the compression is not uniform — Physical domain content is least represented in Big Five summary space. This is useful architecture knowledge for crosswalk design, though it says nothing about whether the additional information Icosa retains corresponds to real personality variation.

Null Structure

Forty-five findings across the program were strict nulls. Their distribution is non-random and diagnostically informative. A separate layer of 10 below-threshold findings and 22 secondary positives (19 exploratory, 2 FDR-only, 1 raw-only) marks partial or provisional support rather than strict null structure.

Structural domain accounts for 17 of 45 strict nulls (37.8%), concentrated almost entirely in semantic grounding studies. The topology grounding study contributes 4 nulls, the crosswalk validity study 4 nulls, and the clinical-skepticism study 4 nulls, identifying a specific architectural vulnerability: the model’s higher-order topological vocabulary and its cross-framework mappings are the constructs least supported by distributional semantic evidence. The crosswalk validity finding is particularly notable: zero reportable findings across 11 tests of MBTI and Enneagram mapping validity means the semantic case for these crosswalks is currently unestablished.

Constructs domain accounts for 10 of 45 strict nulls (22.2%), concentrated in basin discovery (6 nulls) and equifinal paths (2 nulls). Basins are the weakest construct in the system: their hypothesized independent contribution to coherence beyond trap co-occurrence is not confirmed.

Clinical domain accounts for 8 of 45 strict nulls (17.8%), concentrated in termination markers (2 nulls) and cascade correlations (2 nulls). The termination-marker weakness has direct operational implications for the paths engine.

Dyadic domain accounts for 6 of 45 strict nulls (13.3%), distributed across domain channeling (2), emergent phenomena (2), and single nulls in cross-domain asymmetry and shadow alignment. The dyadic nulls establish boundaries around specific mechanisms (domain-channeling specificity, some emergence pathways) rather than threatening the engine’s overall architecture.

Paths, crosswalk, and measurement contribute 2, 1, and 1 nulls respectively. Capacities, coherence, formations, geometry, robustness, and states contribute zero nulls — a pattern consistent with their formula-verification-dominated evidence bases, where the engine either implements its specification or it does not.

The null structure’s concentration in semantic grounding studies carries a program-level message: the Icosa model’s computational architecture passes internal consistency checks at high rates, but its vocabulary and cross-framework claims face unresolved semantic challenges. The gap between “the formula works” and “the words mean what we say they mean” is the program’s central unresolved tension.

Circularity and Provenance

Eleven studies (14.3%) contain explicit circularity governance flags, totaling 22 governed analyses (5.0% of all findings). Those analyses are distributed across measurement (9), structural (5), capacities (2), clinical (2), geometry (2), formations (1), and paths (1). All 77 studies achieved complete provenance documentation.

The circularity governance rate is low, and the governed findings are handled appropriately — none are presented as independent discoveries. Two additional governance actions deserve mention: the coherence-convergent study removed two hypotheses (fault lines vs coherence, grid completion vs coherence) as tautological before analysis, and the clinical-skepticism study explicitly framed its mixed results as partial rather than claiming full skeptical withstanding.

The more consequential circularity concern is structural rather than statistical: the entire program tests a model against itself. Formula verification studies check whether code implements equations. Synthetic benchmarks generate data from the model and test whether the model recovers patterns from that data. Even semantic grounding studies, the least circular evidence type, test whether language-model embeddings of the model’s vocabulary show the structure the model claims. At no point does an external criterion — a human response pattern, a clinical outcome, a behavioral observation — enter the evidence chain. This is not a flaw in the research program; it is the research program’s explicit scope. But it means the program cannot, by construction, produce evidence of external validity.

Anchor Findings and Effect-Size Distribution

The program reports 37 anchor findings — the strongest effects per domain. The effect-size distribution reveals the architecture’s internal signal strength.

At the top end, the measurement domain shows near-perfect cross-engine concordance (ICC = 0.998, kappa = 0.94). PCA-based dimensionality checks confirm full-rank structure in both the 20-center grid and the 4-capacity/5-domain subsystems. The paths domain shows extremely strong grid-completion-to-path-availability coupling (rs = -0.96). These values are high for any benchmarking context and indicate that the fundamental computational architecture is tightly coupled.

In the moderate-to-large range, geometry studies show hot-core/coherence alignment at r = 0.94, core-periphery topology at r = 0.89, and polarity structure metrics in the r = 0.77-0.82 range. Formation studies show topology-coherence and density-coherence correlations of r = 0.81 and r = 0.82. The states domain shows large under-expression-related coherence penalties with effect sizes consistently above d = 1.0. The dyadic domain shows fault-cross prevalence differences above d = 1.27, cascade vulnerability effects at r_rb = 0.93, and cross-domain asymmetry at d = 0.97.

In the moderate range, the construct interaction effects show gateway-state impacts on trap count between d = 1.60 and d = 1.72, and the clinical domain shows intervention-priority signals around rs = 0.59. The crosswalk domain shows reconstruction fidelity at r = 0.65-0.74, which is moderate by psychometric standards and establishes the information-loss boundary.

The coherence-resonance relationship at r = -0.48 is the program’s most conservative anchor effect, appropriately so: it measures the relationship between two summary metrics that share input data but process it differently.

The effect-size landscape is consistent with a tightly specified computational system: internal relationships between constructs derived from the same grid tend to be strong, and the strength tracks the degree of shared computational ancestry. This is neither surprising nor uninformative — it tells you the architecture is self-consistent — but it should not be mistaken for large effects in a population-measurement context.

Cross-Domain Architectural Implications

Several patterns emerge from reading across all 13 domain syntheses.

The coherence formula is the gravitational center. Nearly every domain’s studies reference coherence as the primary dependent variable. The formula aggregates five input streams (capacity good-flow, domain stability, asymmetric penalty, gateway healing power, trap/basin penalties), and the program’s studies collectively validate that each stream contributes as specified. The risk is monoculture: if coherence is the only validated summary metric, the model’s clinical utility depends entirely on whether coherence corresponds to an externally meaningful construct. The coherence-resonance internal consistency check (r = -0.48) suggests that resonance total provides partially distinct information, but this secondary metric has received far less benchmarking.

The grid’s dimensionality is confirmed but its topological overlay is not yet grounded. The 20-center grid passes all dimensionality and independence checks. The higher-order topological vocabulary (gateways, resonance trees, transformation paths, hot core, wells, sources) passes formula verification but largely fails semantic grounding. This creates a split architecture: the grid’s cells are computationally and semantically real; the grid’s topology is computationally real but semantically unestablished.

Dyadic assessment is the model’s most ambitious extension and its most boundary-rich. The 17 dyadic studies confirm that the relational engine produces structurally differentiated outputs across multiple mechanism channels, but the null rate on domain-channeling and emergence pathways indicates that some dyadic mechanisms do not yet produce the specificity the theory claims. The hidden validation arms — 13 confirmed, 4 null — provide the most granular boundary information anywhere in the program.

The construct hierarchy has a weak link. Traps, gateways, and fault lines all receive strong formula-verification support. Basins do not. Six null findings in basin discovery suggest that basins either require different analytic conditions to manifest or do not carry the independent predictive value the model ascribes to them. This has practical consequences: basins appear prominently in the model’s clinical vocabulary, and if they lack independent structural grounding, that vocabulary is partially unsupported.

Crosswalk mappings face a dual challenge. The compression benchmarks establish that Icosa carries more information than Big Five, MBTI, or Enneagram can reconstruct. But the semantic grounding studies for crosswalk validity returned zero reportable findings for MBTI and Enneagram mappings. The model can quantify how much information crosswalks lose but cannot yet demonstrate that its crosswalk assignments are semantically valid. This is a significant gap for a feature that clinicians would use to translate between frameworks.

The paths engine works but lacks a stopping criterion. Path efficiency, overlay modulation, and practical translation all receive strong benchmarking. The simulator comparison confirms trajectory advantages for gateway-based ordering. But the termination-markers study’s 2-of-3 null rate means the model does not yet have reliable structural signals for when a centering process is approaching completion. For a clinical intervention tool, knowing when to stop is as important as knowing where to start.

What External Evidence Is Still Missing

The program identifies a specific, structured set of external evidence requirements that no amount of internal benchmarking can address.

Human response data. The foundational question: do the 20 centers, 4 capacities, 5 domains, and their derived constructs correspond to measurable variation in human personality? This requires administering the d40, c135, or p180 assessments to human populations and testing whether the resulting grid structures show the dimensional independence, construct differentiation, and coherence relationships that the synthetic program confirms. Without this, the model is a validated computation, not a validated personality assessment.

Test-retest reliability. The measurement benchmark shows that three engines converge on the same structural portrait from the same input (ICC = 0.998), but human respondents introduce response variability that synthetic profiles do not. Whether the grid structure is stable across re-administration is unknown.

Criterion validity. The applied synthetic benchmarks show that the model’s structural features predict clinically relevant proxy outcomes within the simulator: compensation-masked brittleness, stress-challenge forecasting, lock-buffer prediction. Whether these structural features predict actual clinical outcomes — therapy response, symptom trajectories, relational satisfaction, occupational functioning — is the model’s ultimate validity test and is entirely unaddressed.

Convergent and discriminant validity against established instruments. The crosswalk compression benchmarks quantify information loss in translation, but they do not test whether the Icosa model’s unique information — the portion that Big Five, MBTI, and Enneagram discard — corresponds to measurable personality variation not captured by those instruments. This is the value proposition: if the model’s additional dimensionality does not track additional real-world variation, the 20-center grid offers computational complexity without measurement benefit.

Clinical sensitivity and specificity. The trap taxonomy, basin discovery, and fault-line cascade studies establish internal construct differentiation, but whether these constructs differentiate clinical populations from each other and from non-clinical populations is untested. The model’s 50 traps, 76 formations, and multiple basin types represent a rich clinical vocabulary that may or may not carve nature at its joints.

Dyadic external validation. The dyadic program’s alignment with Gottman and attachment frameworks is a semantic-correspondence finding within synthetic conditions. Whether the model’s relational constructs (collision risk, structural safety, transmission efficiency, shadow alignment) predict observable relationship outcomes requires dyadic data from real couples.

Semantic grounding with human judges. The embedding-based semantic grounding studies provide one operationalization of whether construct vocabulary is meaningful. An alternative, potentially more valid operationalization is expert-judge sorting: whether clinicians, given construct descriptions without framework labels, sort them into the categories the model predicts. This would address the topology-grounding weakness (4 of 7 null) from a different methodological angle.

Longitudinal sensitivity to change. The paths engine computes intervention trajectories, and the simulator comparison confirms ordering advantages. Whether these computed trajectories track actual change processes in therapy requires longitudinal assessment data, ideally with repeated administration across a treatment course.

Summary

The Icosa internal research program produces a clear, bounded result. The computational architecture is internally consistent: formulas implement their specifications, constructs differentiate from each other as designed, and the grid’s claimed dimensionality holds in the model’s output space. Effect sizes for internal relationships are large, as expected for a tightly specified deterministic system.

The program’s nulls concentrate in semantic grounding, particularly for topological vocabulary and crosswalk mappings to external frameworks. Basins are the weakest construct. Termination markers are the weakest operational component. Domain-channeling specificity is the weakest dyadic mechanism.

The evidence base is entirely model-internal. It establishes implementation fidelity, identifies architectural stress points, and defines the external evidence program required for validity claims. It does not and cannot establish that the Icosa model measures human personality accurately, predicts clinical outcomes, or provides assessment utility beyond existing instruments. Those questions require human data that this program was not designed to produce.

The 82.5% reportable rate across 441 findings, with 45 strict nulls concentrated in identifiable domains and 22 governed analyses across 11 studies, indicates a mature internal benchmarking program. The next phase of evidence — human response data, criterion validation, and clinical sensitivity testing — will determine whether the architecture this program validates corresponds to anything beyond its own elegant geometry.

Downloads

Replication materials for the component studies in this paper.

Formula Verification: Capacity and Domain Aggregates with Variance Sanity Checks