Literate Code Walkthrough

A statistical reader's guide to the regression-priors codebase — critiquing "LLM Embedding for Regression Priors" (Li et al., ICAIF '25)

1. The Big Picture

This codebase does one thing: it takes a published paper's claim — that embedding cosine similarities make good Bayesian regression priors — and subjects it to a battery of falsification tests. Along the way, it implements an alternative (LLM-elicited priors) and shows it works better.

The paper under scrutiny is Li et al.'s "Semantic Prior Regression" (SPR). Their idea:

Embed the target variable description (e.g., "consumer price inflation...") and each feature description (e.g., "unemployment rate...") into a vector space using an LLM embedding model.
Compute cosine similarity between each feature embedding and the target embedding.
Use those similarities as the prior mean in a Bayesian linear regression: β_j ~ N(κ · sim_j, τ²).
Show R² improves over OLS on a grid search over κ and τ².

Our critique: the "improvement" comes from regularization (shrinking coefficients toward anything), not from the informative content of the similarities. We show this by running the exact same Bayesian regression with shuffled, uniform, and random priors — they all do just as well.

Meanwhile, if you just ask an LLM directly "what's the regression coefficient for unemployment on inflation?", it gives you a signed, calibrated answer that actually helps.

FRED Data

→

Prior Construction

→

Bayesian Regression

→

Rolling Window Forecast

→

Evaluate R², Log ML

2. Project Structure

The code is organized in layers, from data at the bottom to experiment scripts at the top. Every layer depends only on the layers below it.

src/regression_priors/ data/ descriptions.py # Variable names, FRED IDs, natural-language descriptions extended_variables.py # 24 additional FRED variables for broader tests fred.py # Fetch from FRED API, apply log transforms priors/ base.py # Protocol (interface) that all prior constructors follow embedding.py # Paper's approach: cosine similarity → prior mean elicited.py # Our approach: ask LLM directly for μ and σ² bayesian.py # Closed-form Bayesian linear regression (fixed-σ² and NIG) evaluation.py # R², BIC, directional accuracy evaluation_extended.py # Antonym test, sign accuracy infrastructure experiment.py # Rolling window forecast + grid search driver run_*.py # Experiment scripts (each one is self-contained) tests/ # 50 tests, all mocked (no API calls needed) paper/ # LaTeX paper + compiled PDF results/ # CSVs and figures

Tooling note: We use uv for package management, python-dotenv for API keys (kept in a .env file, not committed), and litellm as a unified interface to embedding and LLM APIs (Gemini, OpenAI, Cohere, etc.). All LLM calls in tests are mocked — the test suite runs without any API keys.

3. Layer 1: Data

3.1 Variable Descriptions

Everything starts with what we're forecasting and what we're using as features. The paper forecasts CPI inflation using five macro variables. Each variable has a FRED series ID, a natural-language description (used by both embedding and LLM approaches), and a data transform.

src/regression_priors/data/descriptions.py
@dataclass
class VariableInfo:
    fred_id: str         # e.g., "CPIAUCSL"
    description: str     # Natural language, used for embeddings AND LLM prompts
    transform: str       # "log", "diff_log", "diff_log_12", or "level"

# The target: Consumer Price Index
TARGET = VariableInfo(
    fred_id="CPIAUCSL",
    description="Consumer price inflation reflecting broad-based price level
                increases across urban consumption baskets, influenced by
                monetary policy transmission, labour market dynamics,
                and aggregate demand-supply balance",
    transform="diff_log_12",    # Year-over-year log growth rate
)

# The five features from Table 1 of the paper
FEATURES = [
    VariableInfo("INDPRO",   "Industrial production capturing real economic activity...",  "diff_log_12"),
    VariableInfo("UNRATE",   "Unemployment rate indicating labor market slack...",         "diff_log_12"),
    VariableInfo("FEDFUNDS", "Federal funds effective rate representing monetary policy...", "diff_log_12"),
    VariableInfo("M2SL",     "M2 money stock providing insight into monetary expansion...",  "diff_log_12"),
    VariableInfo("PCE",      "Personal consumption expenditures reflecting demand-side...",  "diff_log_12"),
]

Critical discovery: The paper says "all data are logarithmically transformed" but using raw log levels gives a spurious R² ≈ 0.999 (two trending series always correlate). The actual transform needed to match their reported R² ≈ 0.60 is 12-month log differences (year-over-year growth rates), which are stationary. We use diff_log_12 throughout.

3.2 FRED Fetching & Transforms

The fred.py module handles downloading data from the Federal Reserve's FRED API and applying the right transform to each series. The key function is apply_transform:

src/regression_priors/data/fred.py
# FEDFUNDS was 0-0.25% during 2008-2015. Can't take log(0).
_ZERO_FLOOR = 0.01

def apply_transform(series, transform):
    if transform == "diff_log_12":
        # Year-over-year growth rate: log(x_t) - log(x_{t-12})
        safe = series.where(series > 0, other=_ZERO_FLOOR)
        log_series = np.log(safe)
        return log_series - log_series.shift(12)
    ...

The fetch_and_preprocess function downloads each series, applies its transform, joins them into a single DataFrame, and drops any rows with missing values (the first 12 months are always NaN because of the 12-month differencing). The result is a clean matrix ready for regression.

Stationarity note: The 12-month log difference x_t = log(X_t) − log(X_t−12) is approximately the year-over-year percentage change. This removes trends and makes the series stationary — essential for OLS to produce meaningful coefficients. Using raw log levels is a classic unit root trap.

3.3 Extended Variable Set

extended_variables.py defines 24 macroeconomic variables across categories (output, prices, labor, interest rates, money supply, consumer, housing) for our broader tests. Six of these are designated as "target candidates" for the multi-target analysis: CPI, industrial production, unemployment, Fed Funds rate, PCE, and M2.

4. Layer 2: Prior Construction

This is the conceptual heart of the debate. Both approaches produce the same thing — a Prior object with a mean vector and a variance vector — but they get there very differently.

4.1 The Prior Dataclass

A Prior is just a diagonal Gaussian: β ~ N(μ₀, diag(σ²₁, ..., σ²_k)). In code:

src/regression_priors/bayesian.py
@dataclass
class Prior:
    """Gaussian prior on regression coefficients: β ~ N(mean, diag(variances))."""
    mean: NDArray[np.float64]       # μ_0: what you think the coefficients are
    variances: NDArray[np.float64]  # diag(Σ_0): how sure you are about each one

Two numbers per coefficient: your best guess (mean) and how confident you are (variances). Small variance = strong opinion; large variance = "I don't really know."

Why diagonal? A full covariance matrix would let you express beliefs like "if the unemployment coefficient is large, the Fed Funds coefficient is probably small too." A diagonal prior assumes independence across coefficients. This is simpler and matches what both the paper and our approach produce.

4.2 Embedding Priors (The Paper's Approach)

The paper's "Semantic Prior Regression" works like this:

1. Embed the target description and each feature description into vectors using an LLM 2. s j = cos(e target, e feature_j) — cosine similarity 3. Prior mean: μ j = κ \cdot s j — scaled by a hyperparameter κ 4. Prior variance: σ² j = τ² — same for all coefficients 5. Grid search over (κ, τ²) to find best R²

src/regression_priors/priors/embedding.py
class EmbeddingPriorConstructor:
    def __init__(self, kappa=1.0, tau_sq=0.1, model="gemini/gemini-embedding-001"):
        self.kappa = kappa
        self.tau_sq = tau_sq
        self.model = model

    def construct(self, target_description, feature_descriptions):
        # 1. Embed all descriptions in one API call
        all_texts = [target_description] + feature_descriptions
        response = litellm.embedding(model=self.model, input=all_texts)

        # 2. Extract and normalize embedding vectors
        embeddings = np.array([d.embedding for d in response.data])
        target_emb = embeddings[0]
        feature_embs = embeddings[1:]

        # 3. Cosine similarities (dot product of unit vectors)
        target_norm = target_emb / np.linalg.norm(target_emb)
        feature_norms = feature_embs / np.linalg.norm(feature_embs, axis=1, keepdims=True)
        similarities = feature_norms @ target_norm

        # 4. Pack into Prior: mean = κ·sim, variance = τ² (uniform)
        return Prior(
            mean=self.kappa * similarities,
            variances=np.full(len(feature_descriptions), self.tau_sq),
        )

Notice two critical properties:

Cosine similarities between related texts are always positive (typically 0.5–0.76 for our variables). So the prior always predicts positive coefficients. Unemployment's relationship to inflation? Positive, says the embedding. (It should be negative — Phillips curve.)
All coefficients get the same variance τ². The LLM has no way to express "I'm more certain about this relationship than that one."

4.3 LLM-Elicited Priors (Our Approach)

Instead of embedding descriptions and computing geometric similarity, we just ask the LLM what it thinks the regression coefficient is:

src/regression_priors/priors/elicited.py
_SYSTEM_PROMPT = """You are a statistical expert helping construct Bayesian regression priors.

Estimate the regression coefficient for this feature:
- prior_mean: Your best estimate of the standardized regression coefficient.
  Positive means the feature and target move together; negative means opposite.
- prior_variance: Your uncertainty about this estimate.
  Use 0.1-0.5 for well-established relationships, 2.0-5.0 when uncertain.
- reasoning: Brief explanation (1-2 sentences)."""

class CoefficientEstimate(BaseModel):   # Pydantic model for structured output
    reasoning: str
    prior_mean: float
    prior_variance: float

Each (target, feature) pair gets its own LLM call. The LLM returns structured JSON with a mean, variance, and reasoning. The construct method loops over features:

src/regression_priors/priors/elicited.py (continued)
class ElicitedPriorConstructor:
    def construct(self, target_description, feature_descriptions):
        estimates = []
        for desc in feature_descriptions:
            estimate = self.elicit_single(target_description, desc)
            estimates.append(estimate)

        means = np.array([e.prior_mean for e in estimates])
        variances = np.array([max(e.prior_variance, 1e-6) for e in estimates])

        return Prior(mean=means, variances=variances)

The key advantages over the embedding approach:

Signed means: The LLM can say "unemployment has a negative effect on inflation" (prior mean = −0.4). Embeddings can't do this.
Per-coefficient variances: The LLM can say "I'm very sure about the Phillips curve (σ² = 0.3) but unsure about M2's effect (σ² = 2.0)."
No hyperparameter tuning: No grid search over κ and τ². The LLM's estimates are used directly.

Why one call per feature? Three reasons: (1) the LLM focuses on one relationship at a time, (2) if one call fails, the others still work, and (3) we get structured JSON output via Pydantic, which is more reliable for single estimates than asking for a big JSON array.

5. Layer 3: Bayesian Regression

This is where the statistics lives. Given a prior and data, compute the posterior and make predictions. The file bayesian.py contains two formulations: the original fixed-σ² model and our added Normal-Inverse-Gamma (NIG) model.

5.1 Fixed-σ² Model

The standard conjugate normal-normal setup:

Likelihood: y | X, β, σ² ~ N(Xβ, σ² I) Prior: β ~ N(μ 0, diag(v 1, ..., v k)) Posterior precision: Σ post -1 = X T X / σ² + diag(1/ v j) Posterior mean: β post = Σ post (X T y / σ² + μ 0 / v)

In code, we use Cholesky decomposition for numerical stability (the posterior precision matrix is always symmetric positive definite, so Cholesky is the right tool):

src/regression_priors/bayesian.py
def bayesian_regression(X, y, prior, sigma_sq=None):
    n, k = X.shape

    # If σ² not given, estimate from OLS residuals
    if sigma_sq is None:
        beta_ols = np.linalg.lstsq(X, y, rcond=None)[0]
        residuals = y - X @ beta_ols
        sigma_sq = np.sum(residuals**2) / (n - k)

    # Precision contributions from data and prior
    XtX = X.T @ X                          # k×k, data information
    prior_precision = 1.0 / prior.variances  # diagonal prior precision

    # Posterior precision = data precision + prior precision
    posterior_precision = XtX / sigma_sq + np.diag(prior_precision)

    # Solve for posterior mean via Cholesky (numerically stable)
    rhs = X.T @ y / sigma_sq + prior.mean * prior_precision
    cho = cho_factor(posterior_precision)     # LL' = Σ_post^{-1}
    posterior_mean = cho_solve(cho, rhs)      # β_post = Σ_post · rhs

    # Posterior covariance (for uncertainty, not used in prediction)
    posterior_covariance = cho_solve(cho, np.eye(k))

What's Cholesky? If you need to solve Ax = b where A is symmetric positive definite, you can factor A = LL^T (lower triangular times its transpose) and solve by forward/backward substitution. It's faster and more numerically stable than general matrix inversion. cho_factor computes L, cho_solve does the solve.

The key insight: when sigma_sq is None, we estimate it from OLS residuals. This creates a subtle circularity — we use the data to estimate σ², then use σ² to evaluate how well the prior explains the data. This is why we later added the NIG formulation.

5.2 Normal-Inverse-Gamma Model

The NIG model integrates out both β and σ², eliminating the circularity. The prior is now a joint distribution:

Prior on noise: σ² ~ Inverse-Gamma(α 0, β 0) Prior on coefficients: β | σ² ~ N(μ 0, σ² \cdot diag(v 1, ..., v k)) Note: the coefficient prior variance scales with σ² . This is what makes it conjugate.

After observing the data, the posterior updates are:

Σ n -1 = diag(1/ v j) + X T X (no σ² division — it cancels!) μ n = Σ n (diag(1/ v) \cdot μ 0 + X T y) α n = α 0 + n /2 β n = β 0 + ½(y T y + μ 0 T diag(1/ v) μ 0 - μ n T Σ n -1 μ n)

src/regression_priors/bayesian.py
def bayesian_regression_nig(X, y, prior, alpha_0=1.0, beta_0=1.0):
    n, k = X.shape
    prior_precision = 1.0 / prior.variances

    # Posterior precision for β (no σ² — it cancels in NIG)
    posterior_precision = np.diag(prior_precision) + X.T @ X

    # Posterior mean via Cholesky
    rhs = prior_precision * prior.mean + X.T @ y
    cho = cho_factor(posterior_precision)
    posterior_mean = cho_solve(cho, rhs)

    # Posterior Inverse-Gamma parameters
    alpha_n = alpha_0 + n / 2.0
    beta_n = beta_0 + 0.5 * (
        np.dot(y, y)
        + np.dot(prior.mean * prior_precision, prior.mean)
        - np.dot(posterior_mean, posterior_precision @ posterior_mean)
    )

Why does σ² cancel? In the NIG prior, the coefficient precision is (1/σ²) · diag(1/v). The likelihood precision is (1/σ²) · X^TX. When you add them, (1/σ²) factors out of the entire posterior precision for β, and integrating over σ² gives a Student-t marginal likelihood. The σ² is gone.

5.3 Log Marginal Likelihood

The log marginal likelihood is our key model comparison metric. It answers: "how probable is the observed data, given this prior?" — integrating over all possible coefficient values.

For the fixed-σ² model:

log p (y | X, μ 0, Σ 0, σ²) = - n /2 \cdot log(2πσ²) + ½ log| Σ 0 -1 | - ½ log| Σ post -1 | - 1/(2 σ²) y T y + ½ μ post T Σ post -1 μ post - ½ μ 0 T Σ 0 -1 μ 0

For the NIG model (integrating out both β and σ²):

log p (y | X) = - n /2 \cdot log(π) + ½ log| Σ 0 -1 | - ½ log| Σ n -1 | + log Γ(α n) - log Γ(α 0) + α 0 log(β 0) - α n log(β n)

A higher log ML means the prior does a better job of explaining the data. If embedding priors carry real information, their log ML should beat shuffled priors. (Spoiler: it doesn't.)

6. Layer 4: Experiment Pipeline

6.1 Rolling Window Forecast

This is Algorithm 2 from the paper: a true out-of-sample forecasting evaluation. At each time step t, we train on a window of past data and predict the next observation.

src/regression_priors/experiment.py
def rolling_window_forecast(df, prior, sigma_sq, window_size=None,
                            min_train_size=12, use_nig=False, ...):
    # For each time step t:
    for t in range(start, n):
        # 1. Grab the training window (last `window_size` observations)
        X_train = data[t - window_size : t]    # includes intercept column
        y_train = target[t - window_size : t]

        # 2. Fit: Bayesian regression (or OLS if no prior)
        if prior is not None:
            # Add an uninformative prior for the intercept (variance = 1e10)
            full_prior = Prior(
                mean=np.concatenate([[0.0], prior.mean]),
                variances=np.concatenate([[1e10], prior.variances]),
            )
            result = bayesian_regression(X_train, y_train, full_prior, sigma_sq)
            beta = result.posterior_mean
            total_log_ml += result.log_marginal_likelihood

            # Optionally also compute NIG log ML for robustness
            if use_nig:
                result_nig = bayesian_regression_nig(X_train, y_train, full_prior)
                total_log_ml_nig += result_nig.log_marginal_likelihood
        else:
            beta = np.linalg.lstsq(X_train, y_train)[0]  # plain OLS

        # 3. Predict the next observation
        y_pred = x_test @ beta

    return y_true_all, y_pred_all, total_log_ml, total_log_ml_nig

Why add an intercept prior? The data has a constant column (intercept) that's not one of the economic features. We give it a vague prior (mean = 0, variance = 10¹⁰) so the intercept is essentially unconstrained. Only the slope coefficients get informative priors.

The function returns four things: true values, predicted values, the accumulated fixed-σ² log ML, and the accumulated NIG log ML. The first two give us R²; the latter two give us model evidence.

6.2 Evaluation Metrics

Three metrics, each measuring something different:

src/regression_priors/evaluation.py
def r_squared(y_true, y_pred):
    """Classic R²: 1 - SS_res / SS_tot. Higher is better."""
    ss_res = np.sum((y_true - y_pred)**2)
    ss_tot = np.sum((y_true - y_true.mean())**2)
    return 1.0 - ss_res / ss_tot

def bic(log_ml, n, k):
    """BIC = k·log(n) - 2·log_ml. Lower is better."""
    return k * np.log(n) - 2.0 * log_ml

def directional_accuracy(prior_signs, true_signs):
    """What fraction of prior signs match OLS coefficient signs?"""
    return np.mean(np.sign(prior_signs) == np.sign(true_signs))

7. Layer 5: Run Scripts

Each run_*.py file is a self-contained experiment that imports from the layers below, runs analysis, prints results, and saves CSVs. They all follow the same pattern: load data, construct priors, run rolling window forecasts, compare.

7.1 `run_experiment.py` — Grid Search Replication

Reproduces the paper's main result: grid search over κ ∈ {2,...,10} and τ² ∈ {0.05,...,0.25} (81 points). For each (κ, τ²) pair, runs rolling window forecast with the embedding prior and reports R². Also runs our LLM-elicited approach for comparison.

This is the entry point that showed us the paper's R² improvement is real (if tiny) but doesn't tell us why.

7.2 `run_ablation.py` — Prior Ablation Study

The core falsification test. For each of 6 target variables, runs 8 different prior conditions:

Condition	Prior mean	Tests what?
OLS	No prior	Baseline
Embedding SPR	κ · cos_sim	Real embedding prior
Shuffled (50 trials)	Random permutation of similarities	Same values, wrong assignment
Uniform	Mean similarity for all	Pure shrinkage, no differentiation
Random (50 trials)	Drawn from N(μ, σ²) of similarities	Same distribution, random values
Inverted	Rank-reversed similarities	Should hurt if info exists
Negated	−cos_sim	Tests sign sensitivity
LLM-Elicited	Direct LLM estimates	Our approach

The punchline: if shuffled, uniform, and random priors all produce the same R² as the real embedding prior, then the similarities carry no useful information — the improvement is just regularization.

run_ablation.py (simplified)
def eval_prior(mean_values):
    """Run rolling window forecast with a given prior mean vector."""
    prior = Prior(mean=kappa * mean_values, variances=np.full(n_features, tau_sq))
    _, y_pred, log_ml, log_ml_nig = rolling_window_forecast(
        df, prior=prior, sigma_sq=None, use_nig=True)
    return r_squared(y_true, y_pred), log_ml, log_ml_nig

# Real embedding prior
r2_real, log_ml_real, log_ml_nig_real = eval_prior(similarities)

# Shuffled: 50 random permutations of the same similarity values
for seed in range(50):
    shuffled_sims = rng.permutation(similarities)
    r2_s, log_ml_s, log_ml_nig_s = eval_prior(shuffled_sims)
    ...

# Uniform: all features get the mean similarity
uniform_sims = np.full(n_features, similarities.mean())
r2_uniform, log_ml_uniform, log_ml_nig_uniform = eval_prior(uniform_sims)

7.3 `run_grid_ablation.py` — Grid-Search Ablation

The paper's exact setup (CPIAUCSL, 5 features, expanding window) but with shuffled and uniform controls at every single grid point. This is the 81-point grid × 50 shuffled trials = 4,050 additional rolling window experiments.

This is the most computationally expensive test but also the most definitive. If the embedding prior can't beat shuffled at any (κ, τ²) combination, the information content is zero.

Result: Across all 81 grid points, embedding beats shuffled by a mean of +0.0002 R². That's noise. The LLM-elicited prior beats embedding at all 81/81 grid points.

7.4 `run_small_data.py` — Window-Size Analysis

Tests how each approach performs as the training window shrinks (where priors should matter most). Runs window sizes of 24, 36, 48, 60, 120 months, and expanding window, for both p=5 and p=10 features.

run_small_data.py (core loop)
WINDOW_SIZES = [24, 36, 48, 60, 120, None]  # None = expanding

for win_size in WINDOW_SIZES:
    # OLS, Embedding SPR, 50 Shuffled trials, LLM-Elicited
    # ... all run at this window size ...

Key finding: At small windows (≤ 60 months), the embedding prior destroys R² (up to −0.18 vs OLS). The LLM-elicited prior matches or beats OLS at every window size, providing the most help exactly where it's needed most (small windows, more features).

7.5 `run_antonym_test.py` — Antonym Test

A direct test of whether each approach can distinguish opposite economic relationships. We construct 10 pairs of descriptions that describe opposite scenarios (e.g., "GDP growing" vs "GDP contracting") and test whether each method assigns opposite signs.

src/regression_priors/evaluation_extended.py
ANTONYM_PAIRS = [
    ("Economic growth accelerating with rising GDP...",
     "Economic recession with falling GDP...",
     "opposite"),
    ("Inflation rising due to increased consumer demand...",
     "Inflation falling due to decreased consumer demand...",
     "opposite"),
    # ... 8 more pairs ...
]

Result: Embeddings correctly distinguish direction on 0 of 10 pairs (both descriptions get positive similarity to the target). The LLM correctly distinguishes 9 of 10 pairs.

7.6 `run_sign_accuracy.py` — Large-Scale Sign Accuracy

For 6 targets × 23 features = 138 (target, feature) pairs, checks whether each method predicts the correct sign of the OLS coefficient. Embeddings always predict positive (cosine similarity is always positive), so they get every negative coefficient wrong.

Metric	Embedding	LLM-Elicited
Overall sign accuracy	63.8%	69.6%
Accuracy on negative coefficients	0%	52%
Spearman ρ with OLS	0.187	0.562

7.7 `run_cross_model_embeddings.py` — Cross-Model Comparison

Tests whether the tight clustering of cosine similarities is specific to one embedding model or is a universal property. Runs 6 different models (Gemini, OpenAI large/small, Voyage, Cohere, Mistral) through the same tests.

Result: All embedding models produce tightly clustered, all-positive similarities. The inability to distinguish directional relationships is a fundamental limitation of embedding geometry for this task, not a model-specific issue.

8. Layer 6: Test Suite

The test suite has 50 tests across 5 files. Every LLM and API call is mocked — the tests verify pure math and logic. This was written test-first (TDD).

Test Categories

`test_bayesian.py` (19 tests)

The mathematical core. Tests things like:

Uninformative prior → OLS: With variance = 10¹², the posterior mean should equal the OLS estimate.
Very noisy data → prior dominates: With σ² = 10¹², the posterior mean should equal the prior mean.
Known analytic case: One observation (y=3), one predictor (x=1), prior N(0, 4), σ²=1. Posterior mean should be exactly 2.4, posterior variance exactly 0.8.
Log ML ranking: A tight prior centered at the truth should have higher log ML than a tight prior centered far from the truth.
NIG convergence: When the Inverse-Gamma prior on σ² is very precise (α₀ = 100,000), the NIG log ML should approximately match the fixed-σ² log ML.

tests/test_bayesian.py
def test_known_analytic_solution():
    """Verify against hand-computed posterior for simple case.

    Prior: β ~ N(0, τ²=4)
    Data: y=3, x=1, σ²=1
    Posterior mean = τ²x / (σ² + τ²x²) · y = 4/(1+4) · 3 = 2.4
    Posterior var  = σ²τ² / (σ² + τ²x²) = 4/5 = 0.8"""
    X = np.array([[1.0]])
    y = np.array([3.0])
    prior = Prior(mean=np.array([0.0]), variances=np.array([4.0]))
    result = bayesian_regression(X, y, prior, sigma_sq=1.0)

    assert_allclose(result.posterior_mean, [2.4], atol=1e-10)
    assert_allclose(result.posterior_covariance, [[0.8]], atol=1e-10)

`test_priors_embedding.py` (6 tests)

Tests the embedding prior constructor with mocked embedding responses. Verifies that κ scales correctly, orthogonal embeddings give zero similarity, and (critically) that antonym pairs get similar embeddings.

`test_priors_elicited.py` (7 tests)

Tests the LLM-elicited prior constructor with mocked LLM responses. Verifies structured JSON parsing, one-call-per-feature behavior, negative variance clamping, and that antonym pairs get opposite signs.

`test_data_fred.py` (9 tests)

Tests data transforms (log, diff_log, diff_log_12) and FRED fetching with mocked API. Verifies zero-flooring for FEDFUNDS, NaN handling, and date alignment.

`test_evaluation.py` (9 tests)

Tests R² against known values, directional accuracy, and BIC formula.

9. Putting It All Together

Here's the full data flow for a single experiment run, showing how every piece connects:

1. descriptions.py defines CPIAUCSL as the target with 5 features
2. fred.py downloads monthly data from 1959–2025, applies diff_log_12 transform
3a. embedding.py embeds all descriptions, computes cosine similarities [0.64, 0.67, 0.64, 0.66, 0.72]
3b. elicited.py asks Gemini 3.0 Pro for coefficient estimates [+0.35, −0.40, −0.40, +0.45, +0.45]
4. Both produce a Prior(mean=..., variances=...)
5. experiment.py runs rolling window forecast: for each month, train on past window, predict next month
6. At each window: bayesian.py computes posterior mean + log marginal likelihood (fixed-σ² and NIG)
7. evaluation.py computes R² on the collected predictions
8. run_*.py script orchestrates the comparison: OLS vs embedding vs shuffled vs elicited

The ablation scripts add a twist at step 3a: instead of using the real similarities, they permute them (shuffled), average them (uniform), draw random values (random), reverse-rank them (inverted), or negate them (negated). If the real similarities carry information, these corrupted versions should perform worse. They don't.

The Verdict

Test	Embedding Priors	LLM-Elicited Priors
R² vs shuffled (81 grid points)	Mean Δ = +0.0002 (noise)	Beats embedding 81/81
Sign accuracy (138 pairs)	63.8% (0% on negatives)	69.6% (52% on negatives)
Antonym test (10 pairs)	0/10 correct	9/10 correct
Spearman ρ with OLS	0.187	0.562
Small windows (≤60 months)	Destroys R² (−0.06 to −0.18)	Matches or beats OLS
NIG model evidence	≈ shuffled	Orders of magnitude higher

Embedding cosine similarities are a geometric property of text — they measure topical relatedness, not directional economic relationships. They're always positive, tightly clustered, and carry approximately zero information about regression coefficients beyond generic regularization. Asking the LLM directly is simpler, cheaper, and actually works.