Measurement Consistency

Measurement Consistency: Cross-Engine Agreement and Architectural Comparison

Icosapersonality assessmentpsychometricsengine comparisoncross-architecturemeasurement consistency

Scope and Evidentiary Status

This synthesis draws on a single large-scale synthetic benchmark study that evaluated cross-architecture measurement consistency across the three assessment engines implementing the Icosa personality model’s 4x5 grid structure. All evidence reported here is model-internal. Ground-truth profiles were generated from the Icosa grid model itself, and all scoring, perturbation, and evaluation occurred within the same computational framework. No human respondents participated. The results characterize the internal measurement architecture under controlled conditions and identify stress points, boundary behaviors, and open questions that require external evidence to resolve.

The evidence taxonomy governing this synthesis is as follows. Formula verification studies check implementation fidelity, not external discovery. Semantic grounding studies support alignment to a mapped corpus, not human outcomes. Synthetic benchmarks show simulated behavior and boundary conditions. Applied synthetic studies show scenario-level usefulness inside the simulator. The single study under review is a synthetic benchmark: it tests whether three engines that target the same 20-dimensional construct space produce the same structural portraits from the same input, and how those portraits degrade under simulated noise.

The study’s circularity audit flagged nine analyses (all in hypothesis families H7 and H8) as sharing computational ancestry with the Coherence formula. These are governed as shared-anchor benchmark checks throughout this synthesis. Their results are reported for completeness but are not treated as independent evidence of construct validity.

Benchmark Design

The study, titled “Synthetic Benchmark: Cross-Architecture Measurement Consistency in Structural Personality Assessment,” evaluated three engines: ICOSA-D40 (a direct 40-item battery), ICOSA-C135 (an adaptive convergent algorithm using dual-track Bayesian inference with 3-way majority voting), and ICOSA-P180 (a full-posterior probabilistic engine using EP/Laplace/MC approximation with confidence tiers). All three target the same 20-center personality structure defined by the Icosa model’s four Capacities (Open, Focus, Bond, Move) crossed with five Domains (Physical, Emotional, Mental, Relational, Spiritual).

The three engines differ substantially in their psychometric strategies. The d40 uses direct scoring: each of 40 items maps to specific cells, and the profile is computed from item responses without adaptation. The c135 uses dual-track Bayesian convergence: two parallel inference tracks (pattern and formation) adaptively select items based on current posterior estimates, converging via 3-way majority voting across approximately 93 items. The p180 uses full posterior estimation via expectation propagation, Laplace approximation, and Monte Carlo methods across 58-70 items, producing not just point estimates but calibrated confidence tiers for each cell value. These architectural differences mean that even when processing the same underlying response pattern, the engines traverse different item sequences, apply different estimation algorithms, and produce their final profiles through distinct computational paths.

The benchmark used 665 base synthetic profiles: 500 random configurations and 165 clinical persona archetypes. The random configurations sampled uniformly across the 20-dimensional cell-value space, while the clinical persona archetypes represented structured profile patterns corresponding to specific psychological presentations. Those base profiles were then expanded into larger engine-profile pairings across engines and deterministic/stochastic conditions for specific analyses. Eight pre-registered hypothesis families (H1 through H8) tested cross-engine convergence, noise robustness, p180 calibration, c135 internal track agreement, precision-efficiency frontiers, severity equivalence, per-Capacity consistency, and per-Domain consistency. Holm-Bonferroni correction was applied across the family of 25 tests, and program-level FDR correction was applied across all 25 analyses (drawn from a larger pool of 436 tests). Of 25 analyses, 24 survived correction as final reportable findings. One result was null.

The signal profile for the study was: 24 reportable findings, 1 null, 0 below practical threshold, 0 exploratory positives, 0 FDR-only, 0 raw-only, 0 not evaluable. The absence of below-threshold or exploratory findings reflects the benchmark’s self-consistency design: these engines were built to target the same construct space, so strong convergence under clean conditions is the expected baseline, not a surprising discovery.

Major Findings

Cross-Engine Convergence on Continuous and Categorical Profiles

The headline result is near-perfect agreement across engines on continuous cell values: ICC(2,k) = 1.00, p < .001, ES = 0.998. This exceeded the pre-registered threshold of 0.85 by a wide margin. Categorical agreement was also strong: Fleiss’ kappa for cell-state classification was .94 (p < .001), Krippendorff’s alpha for ordinal band agreement was .87 (p < .001), and Kendall’s W for band concordance was .91 (p < .001). All four metrics exceeded their respective thresholds of 0.80.

This level of agreement is partly expected. The three engines target the same 20-center construct space, and downstream constructs (Coherence, Gateways, Traps) are deterministic functions of those 20 cell values. Agreement at the cell level mathematically entails agreement at all higher deterministic levels. What the result confirms is that the three architecturally distinct estimation procedures — direct scoring, dual-track Bayesian convergence, and full posterior approximation — introduce negligible measurement error under synthetic conditions. Adaptive engines make path-dependent item selection decisions; probabilistic engines introduce estimation noise through approximation methods. The ICC result indicates that these architectural differences do not produce divergent structural portraits when the input is clean.

This is a necessary condition for trustworthy output with human respondents, but it is not sufficient. The result tells us the engines can agree; it does not tell us they will agree when processing actual human responses with their structured noise patterns.

The Formation Classification Boundary

Formation-level agreement was sharply lower: Fleiss’ kappa = .09, p < .001, ES = 0.092. While this exceeded the pre-registered threshold of 0.05, it represents only slight agreement on the Landis-Koch scale. The 76-category Formation taxonomy is derived from Coherence Band and grid variance pattern, creating a classification space where small continuous disagreements — invisible at the cell or band level — push profiles across Formation boundaries.

This finding is structurally informative rather than alarming. It reveals a mathematical property of high-resolution categorical taxonomies: they amplify continuous measurement error. A profile sitting near a Formation boundary can be assigned to different Formations by engines that agree on its continuous cell values to several decimal places. The practical implication is that Formation labels should be interpreted as approximate descriptors rather than precise diagnostic categories. Confidence in Formation assignment depends on the profile’s distance from the nearest boundary, a quantity that the current benchmark did not explicitly compute but that future work could operationalize as a boundary-proximity metric.

Noise Tolerance and Severity Equivalence

Under sigma = 0.5 stochastic perturbation, per-cell accuracy remained at 93% and band accuracy at 80%, both exceeding their pre-registered thresholds of 85% and 75% respectively. These figures establish a lower bound on measurement fidelity under moderate response noise, though the Gaussian noise model is a mathematical convenience rather than a realistic simulation of human response inconsistency (which includes structured patterns like social desirability, fatigue, and item misinterpretation).

TOST equivalence testing confirmed that noise tolerance did not differ across severity categories: max |d| = 0.10, p = .022, within the pre-registered equivalence bound of delta = 0.12. Profiles spanning the full range from Thriving through Severe Coherence bands were measured with statistically equivalent accuracy under noise. This addresses the common concern that measurement precision varies with profile severity. Under synthetic conditions, the engines do not selectively lose precision for the most impaired profiles. Whether this holds with human respondents remains an open question.

The Null Finding: Degradation Gradient Across Feature Families

The single null finding in the benchmark concerns the predicted degradation gradient across structural feature families. The hypothesis was that compound features (Traps, Basins, Formations) — which aggregate across multiple centers — would degrade faster than atomic features (cell values, band classifications) under noise. The observed effect was in the predicted direction with a large effect size (r_s = 0.714), but the test failed to reach significance at alpha = .05 (p = .055).

This null deserves attention rather than dismissal. The most likely explanation is a ceiling effect: with per-cell accuracy already at 93% under sigma = 0.5, there was limited degradation variance to distribute across feature families. A sharper noise condition (e.g., sigma = 1.0 or 1.5) might separate the families more cleanly. An alternative explanation is that the assumed linear ordering of geometric complexity does not map straightforwardly onto noise sensitivity: a Basin involving four centers in the same Capacity row could be more resistant than a two-center Trap spanning different rows. Revisiting this hypothesis with a wider noise range and finer feature-family granularity is a clear next step.

Engine-Specific Findings

p180 Calibration

The p180 engine’s posterior-derived confidence tiers showed strong calibration. Five-fold cross-validated expected calibration error was mean ECE = .05 (p < .001), well below the pre-registered threshold of 0.15. Active-tier predictions (posterior probability > 0.75) achieved precision of .92, and probable-tier predictions (0.50 to 0.75) achieved precision of .74. Both exceeded their respective thresholds of 0.75 and 0.65.

The gap between active-tier and probable-tier precision confirms that the confidence tiers carry genuine discriminative information. When the p180 assigns high confidence, it is correct 92% of the time; at moderate confidence, it is still correct nearly three-quarters of the time. This tiered reliability structure could inform how downstream systems weight or flag p180 output, providing a built-in uncertainty quantification that the d40 and c135 engines lack. However, calibration was assessed under synthetic conditions; human response data may introduce systematic biases that shift the calibration curve.

c135 Track Convergence

The c135 engine’s dual inference tracks (pattern track and formation track) agreed on 81% of deterministic cases, exceeding the pre-registered threshold of 70%. In the remaining 19% of cases where the tracks disagreed, that disagreement predicted structural error with AUC = .60 (p < .001), exceeding the threshold of 0.55. The discriminative power is modest but real: profiles that are difficult for one track tend to be difficult for both, and internal track divergence functions as a crude signal of measurement difficulty.

At AUC = .60, this signal has limited standalone practical value. Its greater potential lies in combination with p180 confidence tiers: a composite uncertainty index drawing on both sources could identify profiles where the measurement architecture is least certain, enabling flagging for re-test or clinician review. This combination was not tested in the current benchmark.

Precision-Efficiency Frontiers

Both c135 and p180 showed statistically non-linear (concave) precision-efficiency frontiers, consistent with diminishing marginal returns to additional items administered. The near-zero concavity values indicate that precision plateaus rapidly once a minimum item count is reached. The practical question — where each engine sits on its frontier during typical administration — cannot be resolved under synthetic conditions. If the c135 reaches its precision plateau at item 60 of its approximately 93-item administration, the remaining items contribute negligible accuracy. Identifying these plateaus with human response data would enable principled decisions about early stopping and engine selection based on the precision required for a given use case.

Governed Benchmark Checks: Capacity and Domain Health

Nine analyses (H7 and H8) examined correlations between per-Capacity health scores, per-Domain health scores, and engine-mean Coherence. All nine produced large effects (r = .52 to .57, all p < .001). Specifically, the four Capacity health correlations with Coherence were: Open r = .52, Focus r = .53, Bond r = .57, Move r = .54. The five Domain health correlations were: Physical r = .53, Emotional r = .55, Mental r = .55, Relational r = .54, Spiritual r = .53. All were computed on N = 665 profiles with Shapiro-Wilk tests indicating non-normality for both health scores and Coherence, though the large sample provides adequate tolerance for this assumption violation.

These results are governed as shared-anchor benchmark checks because both health scores and Coherence are computed from the same 20 cell values. The circularity audit identified the specific shared ancestors for each analysis: each Capacity health score shares its five constituent cell health values with the Coherence formula, and each Domain health score shares its four constituent cell health values. The observed correlations confirm that the computational pipeline correctly propagates cell-level information to row-level, column-level, and global summary statistics.

The narrow range across all nine analyses (r = .52 to .57, a span of .05) is consistent with each row and column contributing a similar proportion of variance to the Coherence computation. Bond Capacity and Emotional Domain showed the strongest associations (r = .57 and r = .55 respectively), which may reflect the asymmetric penalty structure in the Coherence formula, though this interpretation should be held lightly given the shared-anchor status. The uniformity of these correlations is itself a verification finding: if one Capacity or Domain showed a markedly different correlation with Coherence, it would suggest an asymmetry in the formula that might warrant investigation. The observed uniformity confirms balanced contribution.

These results cannot be upgraded to independent evidence of construct validity. They verify internal computational consistency — that the formulas work as designed — and nothing more. Independent confirmation would require external criteria (clinician ratings, behavioral outcomes, convergent/discriminant validity with established instruments) that are outside the scope of this benchmark.

Architectural Implications

Engine Selection Can Be Pragmatic

The convergence results suggest that the choice among d40, c135, and p180 can be made on pragmatic grounds — test length, respondent burden, uncertainty quantification needs — without sacrificing structural agreement at the cell or band level. The d40 offers the shortest administration. The c135 provides internal track disagreement as a crude uncertainty signal. The p180 provides calibrated confidence tiers. Under synthetic conditions, all three recover the same 20-dimensional portrait. This pragmatic-selection conclusion holds only insofar as the synthetic convergence extends to human response data.

Formation Labels Need Boundary-Aware Interpretation

The sharp drop from cell-level agreement (kappa = .94) to Formation-level agreement (kappa = .09) has direct implications for how Formation labels are communicated to practitioners and respondents. The current 76-category taxonomy produces categorical distinctions that the measurement architecture cannot reliably sustain across engines. Options include: (a) reporting Formation labels with an explicit boundary-proximity confidence metric, (b) collapsing the taxonomy into fewer, more stable categories, or (c) presenting Formation as a secondary descriptor below the more reliable band and cell-level profiles. The benchmark data support option (a) as the most informative approach, but the required proximity metric has not yet been developed.

Uncertainty Quantification Is Unevenly Distributed

The p180 engine provides calibrated confidence tiers; the c135 provides track-disagreement signals; the d40 provides neither. For applications where uncertainty quantification matters — clinical screening, high-stakes assessment, automated flagging for clinician review — the p180 has a structural advantage that the convergence results do not erase. A composite uncertainty framework that integrates p180 calibration, c135 track disagreement, and cross-engine disagreement (where multiple engines are administered) would provide the richest uncertainty picture, but this framework remains unbuilt.

The Gaussian Noise Model Is a Placeholder

The sigma = 0.5 Gaussian perturbation model served its purpose for establishing a baseline noise tolerance threshold, but it does not capture the structured patterns of human measurement error. Socially desirable responding tends to inflate certain Domains (Relational, Spiritual) more than others. Fatigue effects systematically depress precision in later items. Acquiescence bias introduces correlated errors across items with similar valence. All of these would produce patterned, non-Gaussian distortions of the 20-cell profile that the current benchmark did not test. These tolerance results should be treated as a lower bound that will likely shift — in an unknown direction — when confronted with human response patterns.

What This Synthesis Cannot Establish

Several conclusions that readers might be tempted to draw are explicitly outside the evidentiary reach of this synthesis:

  1. External validity. Agreement among engines scoring the same synthetic profiles does not demonstrate that any engine accurately captures human personality structure. The engines agree with each other; whether they agree with reality is untested.

  2. Test-retest reliability. The benchmark used single-administration scoring. Whether the same human respondent would receive the same profile across repeated administrations is unknown.

  3. Clinical utility. The severity equivalence finding (TOST) shows that synthetic noise tolerance does not vary by severity band. It does not show that the severity bands themselves correspond to clinically meaningful distinctions in human populations.

  4. Independent construct confirmation. The H7 and H8 results (Capacity and Domain health correlations with Coherence) are governed benchmark checks sharing computational ancestry with the Coherence formula. They verify pipeline fidelity, not construct validity.

  5. Superiority to existing instruments. No comparison with established personality measures (Big Five, HEXACO, MMPI) was conducted. The benchmark is self-referential by design.

Next-Step Research Priorities

The following priorities emerge from the benchmark results, ordered by their potential to advance the evidentiary base from synthetic to externally grounded:

Priority 1: Human-Data Replication. The same analytic framework (ICC, kappa, Krippendorff’s alpha, TOST equivalence, ECE calibration) should be applied to profiles generated from human respondent data. This is the single most important next step. The entire benchmark architecture was designed to transfer directly to human data without modification.

Priority 2: Structured Noise Models. Replace the Gaussian perturbation with response-style simulations: acquiescence, social desirability, fatigue gradients, random responding. Test whether the cross-engine convergence and noise tolerance results hold under structured distortion. This can be done synthetically and should precede the human-data study to generate specific predictions.

Priority 3: Formation Boundary Proximity. Develop and validate a boundary-proximity metric for Formation assignment. Profiles near Formation boundaries should receive explicit uncertainty markers. This is a computational development that can be built and tested within the current synthetic framework.

Priority 4: Composite Uncertainty Index. Combine p180 confidence tiers, c135 track disagreement, and (where available) cross-engine disagreement into a single uncertainty score. Test against known difficult profile types: profiles near Formation boundaries, low-Coherence profiles, profiles with unusual grid variance patterns.

Priority 5: Degradation Gradient Revisited. Re-run the feature-family degradation analysis with sigma values of 0.75, 1.0, and 1.5, and with finer feature-family granularity that separates within-row from cross-row compound features. The geometrically motivated prediction deserves a more powerful test.

Priority 6: Precision-Efficiency Plateau Mapping. Using human response data (once available), identify where each adaptive engine sits on its precision-efficiency frontier during typical administration. This determines whether early stopping rules can reduce respondent burden without meaningful precision loss.

Summary

This synthesis reports on a single large-scale synthetic benchmark evaluating measurement consistency across three assessment engines implementing the Icosa model’s 20-center personality structure. The engines showed near-perfect agreement on continuous cell values (ICC = 1.00) and strong agreement on categorical classifications (kappa = .94 for cell states, alpha = .87 for ordinal bands), with a sharp and informative drop to slight agreement on the 76-category Formation taxonomy (kappa = .09). Noise tolerance was adequate and equivalent across severity categories. The p180 engine demonstrated strong confidence-tier calibration (ECE = .05). One hypothesis — that compound features degrade faster than atomic features under noise — failed to reach significance (p = .055) despite a large observed effect, likely due to ceiling effects at the tested noise level. Nine governed benchmark checks confirmed computational pipeline fidelity without providing independent construct evidence. All findings are synthetic-benchmark results establishing internal measurement consistency. The most pressing next step is replication with human response data, where the convergence observed under idealized conditions will face the structured noise, bias, and distributional properties of real-world personality measurement.

Downloads

Replication materials for the component studies in this paper.

Synthetic Benchmark: Cross-Architecture Measurement Consistency in Structural Personality Assessment