Literate Code Walkthrough
A statistical reader's guide to the regression-priors codebase — critiquing "LLM Embedding for Regression Priors" (Li et al., ICAIF '25)
1. The Big Picture
This codebase does one thing: it takes a published paper's claim — that embedding cosine similarities make good Bayesian regression priors — and subjects it to a battery of falsification tests. Along the way, it implements an alternative (LLM-elicited priors) and shows it works better.
The paper under scrutiny is Li et al.'s "Semantic Prior Regression" (SPR). Their idea:
- Embed the target variable description (e.g., "consumer price inflation...") and each feature description (e.g., "unemployment rate...") into a vector space using an LLM embedding model.
- Compute cosine similarity between each feature embedding and the target embedding.
- Use those similarities as the prior mean in a Bayesian linear regression: βj ~ N(κ · simj, τ²).
- Show R² improves over OLS on a grid search over κ and τ².
Our critique: the "improvement" comes from regularization (shrinking coefficients toward anything), not from the informative content of the similarities. We show this by running the exact same Bayesian regression with shuffled, uniform, and random priors — they all do just as well.
Meanwhile, if you just ask an LLM directly "what's the regression coefficient for unemployment on inflation?", it gives you a signed, calibrated answer that actually helps.
2. Project Structure
The code is organized in layers, from data at the bottom to experiment scripts at the top. Every layer depends only on the layers below it.
uv for package management, python-dotenv for API keys (kept in a .env file, not committed), and litellm as a unified interface to embedding and LLM APIs (Gemini, OpenAI, Cohere, etc.). All LLM calls in tests are mocked — the test suite runs without any API keys.
3. Layer 1: Data
3.1 Variable Descriptions
Everything starts with what we're forecasting and what we're using as features. The paper forecasts CPI inflation using five macro variables. Each variable has a FRED series ID, a natural-language description (used by both embedding and LLM approaches), and a data transform.
src/regression_priors/data/descriptions.py @dataclass class VariableInfo: fred_id: str # e.g., "CPIAUCSL" description: str # Natural language, used for embeddings AND LLM prompts transform: str # "log", "diff_log", "diff_log_12", or "level" # The target: Consumer Price Index TARGET = VariableInfo( fred_id="CPIAUCSL", description="Consumer price inflation reflecting broad-based price level increases across urban consumption baskets, influenced by monetary policy transmission, labour market dynamics, and aggregate demand-supply balance", transform="diff_log_12", # Year-over-year log growth rate ) # The five features from Table 1 of the paper FEATURES = [ VariableInfo("INDPRO", "Industrial production capturing real economic activity...", "diff_log_12"), VariableInfo("UNRATE", "Unemployment rate indicating labor market slack...", "diff_log_12"), VariableInfo("FEDFUNDS", "Federal funds effective rate representing monetary policy...", "diff_log_12"), VariableInfo("M2SL", "M2 money stock providing insight into monetary expansion...", "diff_log_12"), VariableInfo("PCE", "Personal consumption expenditures reflecting demand-side...", "diff_log_12"), ]
diff_log_12 throughout.
3.2 FRED Fetching & Transforms
The fred.py module handles downloading data from the Federal Reserve's FRED API and applying the right transform to each series. The key function is apply_transform:
src/regression_priors/data/fred.py # FEDFUNDS was 0-0.25% during 2008-2015. Can't take log(0). _ZERO_FLOOR = 0.01 def apply_transform(series, transform): if transform == "diff_log_12": # Year-over-year growth rate: log(x_t) - log(x_{t-12}) safe = series.where(series > 0, other=_ZERO_FLOOR) log_series = np.log(safe) return log_series - log_series.shift(12) ...
The fetch_and_preprocess function downloads each series, applies its transform, joins them into a single DataFrame, and drops any rows with missing values (the first 12 months are always NaN because of the 12-month differencing). The result is a clean matrix ready for regression.
3.3 Extended Variable Set
extended_variables.py defines 24 macroeconomic variables across categories (output, prices, labor, interest rates, money supply, consumer, housing) for our broader tests. Six of these are designated as "target candidates" for the multi-target analysis: CPI, industrial production, unemployment, Fed Funds rate, PCE, and M2.
4. Layer 2: Prior Construction
This is the conceptual heart of the debate. Both approaches produce the same thing — a Prior object with a mean vector and a variance vector — but they get there very differently.
4.1 The Prior Dataclass
A Prior is just a diagonal Gaussian: β ~ N(μ0, diag(σ²1, ..., σ²k)). In code:
src/regression_priors/bayesian.py @dataclass class Prior: """Gaussian prior on regression coefficients: β ~ N(mean, diag(variances)).""" mean: NDArray[np.float64] # μ_0: what you think the coefficients are variances: NDArray[np.float64] # diag(Σ_0): how sure you are about each one
Two numbers per coefficient: your best guess (mean) and how confident you are (variances). Small variance = strong opinion; large variance = "I don't really know."
4.2 Embedding Priors (The Paper's Approach)
The paper's "Semantic Prior Regression" works like this:
2. sj = cos(etarget, efeature_j) — cosine similarity
3. Prior mean: μj = κ · sj — scaled by a hyperparameter κ
4. Prior variance: σ²j = τ² — same for all coefficients
5. Grid search over (κ, τ²) to find best R²
src/regression_priors/priors/embedding.py class EmbeddingPriorConstructor: def __init__(self, kappa=1.0, tau_sq=0.1, model="gemini/gemini-embedding-001"): self.kappa = kappa self.tau_sq = tau_sq self.model = model def construct(self, target_description, feature_descriptions): # 1. Embed all descriptions in one API call all_texts = [target_description] + feature_descriptions response = litellm.embedding(model=self.model, input=all_texts) # 2. Extract and normalize embedding vectors embeddings = np.array([d.embedding for d in response.data]) target_emb = embeddings[0] feature_embs = embeddings[1:] # 3. Cosine similarities (dot product of unit vectors) target_norm = target_emb / np.linalg.norm(target_emb) feature_norms = feature_embs / np.linalg.norm(feature_embs, axis=1, keepdims=True) similarities = feature_norms @ target_norm # 4. Pack into Prior: mean = κ·sim, variance = τ² (uniform) return Prior( mean=self.kappa * similarities, variances=np.full(len(feature_descriptions), self.tau_sq), )
Notice two critical properties:
- Cosine similarities between related texts are always positive (typically 0.5–0.76 for our variables). So the prior always predicts positive coefficients. Unemployment's relationship to inflation? Positive, says the embedding. (It should be negative — Phillips curve.)
- All coefficients get the same variance τ². The LLM has no way to express "I'm more certain about this relationship than that one."
4.3 LLM-Elicited Priors (Our Approach)
Instead of embedding descriptions and computing geometric similarity, we just ask the LLM what it thinks the regression coefficient is:
src/regression_priors/priors/elicited.py _SYSTEM_PROMPT = """You are a statistical expert helping construct Bayesian regression priors. Estimate the regression coefficient for this feature: - prior_mean: Your best estimate of the standardized regression coefficient. Positive means the feature and target move together; negative means opposite. - prior_variance: Your uncertainty about this estimate. Use 0.1-0.5 for well-established relationships, 2.0-5.0 when uncertain. - reasoning: Brief explanation (1-2 sentences).""" class CoefficientEstimate(BaseModel): # Pydantic model for structured output reasoning: str prior_mean: float prior_variance: float
Each (target, feature) pair gets its own LLM call. The LLM returns structured JSON with a mean, variance, and reasoning. The construct method loops over features:
src/regression_priors/priors/elicited.py (continued) class ElicitedPriorConstructor: def construct(self, target_description, feature_descriptions): estimates = [] for desc in feature_descriptions: estimate = self.elicit_single(target_description, desc) estimates.append(estimate) means = np.array([e.prior_mean for e in estimates]) variances = np.array([max(e.prior_variance, 1e-6) for e in estimates]) return Prior(mean=means, variances=variances)
The key advantages over the embedding approach:
- Signed means: The LLM can say "unemployment has a negative effect on inflation" (prior mean = −0.4). Embeddings can't do this.
- Per-coefficient variances: The LLM can say "I'm very sure about the Phillips curve (σ² = 0.3) but unsure about M2's effect (σ² = 2.0)."
- No hyperparameter tuning: No grid search over κ and τ². The LLM's estimates are used directly.
5. Layer 3: Bayesian Regression
This is where the statistics lives. Given a prior and data, compute the posterior and make predictions. The file bayesian.py contains two formulations: the original fixed-σ² model and our added Normal-Inverse-Gamma (NIG) model.
5.1 Fixed-σ² Model
The standard conjugate normal-normal setup:
Prior: β ~ N(μ0, diag(v1, ..., vk))
Posterior precision: Σpost−1 = XTX / σ² + diag(1/vj)
Posterior mean: βpost = Σpost (XTy / σ² + μ0 / v)
In code, we use Cholesky decomposition for numerical stability (the posterior precision matrix is always symmetric positive definite, so Cholesky is the right tool):
src/regression_priors/bayesian.py def bayesian_regression(X, y, prior, sigma_sq=None): n, k = X.shape # If σ² not given, estimate from OLS residuals if sigma_sq is None: beta_ols = np.linalg.lstsq(X, y, rcond=None)[0] residuals = y - X @ beta_ols sigma_sq = np.sum(residuals**2) / (n - k) # Precision contributions from data and prior XtX = X.T @ X # k×k, data information prior_precision = 1.0 / prior.variances # diagonal prior precision # Posterior precision = data precision + prior precision posterior_precision = XtX / sigma_sq + np.diag(prior_precision) # Solve for posterior mean via Cholesky (numerically stable) rhs = X.T @ y / sigma_sq + prior.mean * prior_precision cho = cho_factor(posterior_precision) # LL' = Σ_post^{-1} posterior_mean = cho_solve(cho, rhs) # β_post = Σ_post · rhs # Posterior covariance (for uncertainty, not used in prediction) posterior_covariance = cho_solve(cho, np.eye(k))
cho_factor computes L, cho_solve does the solve.
The key insight: when sigma_sq is None, we estimate it from OLS residuals. This creates a subtle circularity — we use the data to estimate σ², then use σ² to evaluate how well the prior explains the data. This is why we later added the NIG formulation.
5.2 Normal-Inverse-Gamma Model
The NIG model integrates out both β and σ², eliminating the circularity. The prior is now a joint distribution:
Prior on coefficients: β | σ² ~ N(μ0, σ² · diag(v1, ..., vk))
Note: the coefficient prior variance scales with σ². This is what makes it conjugate.
After observing the data, the posterior updates are:
μn = Σn (diag(1/v) · μ0 + XTy)
αn = α0 + n/2
βn = β0 + ½(yTy + μ0T diag(1/v) μ0 − μnT Σn−1 μn)
src/regression_priors/bayesian.py def bayesian_regression_nig(X, y, prior, alpha_0=1.0, beta_0=1.0): n, k = X.shape prior_precision = 1.0 / prior.variances # Posterior precision for β (no σ² — it cancels in NIG) posterior_precision = np.diag(prior_precision) + X.T @ X # Posterior mean via Cholesky rhs = prior_precision * prior.mean + X.T @ y cho = cho_factor(posterior_precision) posterior_mean = cho_solve(cho, rhs) # Posterior Inverse-Gamma parameters alpha_n = alpha_0 + n / 2.0 beta_n = beta_0 + 0.5 * ( np.dot(y, y) + np.dot(prior.mean * prior_precision, prior.mean) - np.dot(posterior_mean, posterior_precision @ posterior_mean) )
5.3 Log Marginal Likelihood
The log marginal likelihood is our key model comparison metric. It answers: "how probable is the observed data, given this prior?" — integrating over all possible coefficient values.
For the fixed-σ² model:
For the NIG model (integrating out both β and σ²):
A higher log ML means the prior does a better job of explaining the data. If embedding priors carry real information, their log ML should beat shuffled priors. (Spoiler: it doesn't.)
6. Layer 4: Experiment Pipeline
6.1 Rolling Window Forecast
This is Algorithm 2 from the paper: a true out-of-sample forecasting evaluation. At each time step t, we train on a window of past data and predict the next observation.
src/regression_priors/experiment.py def rolling_window_forecast(df, prior, sigma_sq, window_size=None, min_train_size=12, use_nig=False, ...): # For each time step t: for t in range(start, n): # 1. Grab the training window (last `window_size` observations) X_train = data[t - window_size : t] # includes intercept column y_train = target[t - window_size : t] # 2. Fit: Bayesian regression (or OLS if no prior) if prior is not None: # Add an uninformative prior for the intercept (variance = 1e10) full_prior = Prior( mean=np.concatenate([[0.0], prior.mean]), variances=np.concatenate([[1e10], prior.variances]), ) result = bayesian_regression(X_train, y_train, full_prior, sigma_sq) beta = result.posterior_mean total_log_ml += result.log_marginal_likelihood # Optionally also compute NIG log ML for robustness if use_nig: result_nig = bayesian_regression_nig(X_train, y_train, full_prior) total_log_ml_nig += result_nig.log_marginal_likelihood else: beta = np.linalg.lstsq(X_train, y_train)[0] # plain OLS # 3. Predict the next observation y_pred = x_test @ beta return y_true_all, y_pred_all, total_log_ml, total_log_ml_nig
The function returns four things: true values, predicted values, the accumulated fixed-σ² log ML, and the accumulated NIG log ML. The first two give us R²; the latter two give us model evidence.
6.2 Evaluation Metrics
Three metrics, each measuring something different:
src/regression_priors/evaluation.py def r_squared(y_true, y_pred): """Classic R²: 1 - SS_res / SS_tot. Higher is better.""" ss_res = np.sum((y_true - y_pred)**2) ss_tot = np.sum((y_true - y_true.mean())**2) return 1.0 - ss_res / ss_tot def bic(log_ml, n, k): """BIC = k·log(n) - 2·log_ml. Lower is better.""" return k * np.log(n) - 2.0 * log_ml def directional_accuracy(prior_signs, true_signs): """What fraction of prior signs match OLS coefficient signs?""" return np.mean(np.sign(prior_signs) == np.sign(true_signs))
7. Layer 5: Run Scripts
Each run_*.py file is a self-contained experiment that imports from the layers below, runs analysis, prints results, and saves CSVs. They all follow the same pattern: load data, construct priors, run rolling window forecasts, compare.
7.1 run_experiment.py — Grid Search Replication
Reproduces the paper's main result: grid search over κ ∈ {2,...,10} and τ² ∈ {0.05,...,0.25} (81 points). For each (κ, τ²) pair, runs rolling window forecast with the embedding prior and reports R². Also runs our LLM-elicited approach for comparison.
This is the entry point that showed us the paper's R² improvement is real (if tiny) but doesn't tell us why.
7.2 run_ablation.py — Prior Ablation Study
The core falsification test. For each of 6 target variables, runs 8 different prior conditions:
| Condition | Prior mean | Tests what? |
|---|---|---|
| OLS | No prior | Baseline |
| Embedding SPR | κ · cos_sim | Real embedding prior |
| Shuffled (50 trials) | Random permutation of similarities | Same values, wrong assignment |
| Uniform | Mean similarity for all | Pure shrinkage, no differentiation |
| Random (50 trials) | Drawn from N(μ, σ²) of similarities | Same distribution, random values |
| Inverted | Rank-reversed similarities | Should hurt if info exists |
| Negated | −cos_sim | Tests sign sensitivity |
| LLM-Elicited | Direct LLM estimates | Our approach |
The punchline: if shuffled, uniform, and random priors all produce the same R² as the real embedding prior, then the similarities carry no useful information — the improvement is just regularization.
run_ablation.py (simplified) def eval_prior(mean_values): """Run rolling window forecast with a given prior mean vector.""" prior = Prior(mean=kappa * mean_values, variances=np.full(n_features, tau_sq)) _, y_pred, log_ml, log_ml_nig = rolling_window_forecast( df, prior=prior, sigma_sq=None, use_nig=True) return r_squared(y_true, y_pred), log_ml, log_ml_nig # Real embedding prior r2_real, log_ml_real, log_ml_nig_real = eval_prior(similarities) # Shuffled: 50 random permutations of the same similarity values for seed in range(50): shuffled_sims = rng.permutation(similarities) r2_s, log_ml_s, log_ml_nig_s = eval_prior(shuffled_sims) ... # Uniform: all features get the mean similarity uniform_sims = np.full(n_features, similarities.mean()) r2_uniform, log_ml_uniform, log_ml_nig_uniform = eval_prior(uniform_sims)
7.3 run_grid_ablation.py — Grid-Search Ablation
The paper's exact setup (CPIAUCSL, 5 features, expanding window) but with shuffled and uniform controls at every single grid point. This is the 81-point grid × 50 shuffled trials = 4,050 additional rolling window experiments.
This is the most computationally expensive test but also the most definitive. If the embedding prior can't beat shuffled at any (κ, τ²) combination, the information content is zero.
7.4 run_small_data.py — Window-Size Analysis
Tests how each approach performs as the training window shrinks (where priors should matter most). Runs window sizes of 24, 36, 48, 60, 120 months, and expanding window, for both p=5 and p=10 features.
run_small_data.py (core loop) WINDOW_SIZES = [24, 36, 48, 60, 120, None] # None = expanding for win_size in WINDOW_SIZES: # OLS, Embedding SPR, 50 Shuffled trials, LLM-Elicited # ... all run at this window size ...
7.5 run_antonym_test.py — Antonym Test
A direct test of whether each approach can distinguish opposite economic relationships. We construct 10 pairs of descriptions that describe opposite scenarios (e.g., "GDP growing" vs "GDP contracting") and test whether each method assigns opposite signs.
src/regression_priors/evaluation_extended.py ANTONYM_PAIRS = [ ("Economic growth accelerating with rising GDP...", "Economic recession with falling GDP...", "opposite"), ("Inflation rising due to increased consumer demand...", "Inflation falling due to decreased consumer demand...", "opposite"), # ... 8 more pairs ... ]
Result: Embeddings correctly distinguish direction on 0 of 10 pairs (both descriptions get positive similarity to the target). The LLM correctly distinguishes 9 of 10 pairs.
7.6 run_sign_accuracy.py — Large-Scale Sign Accuracy
For 6 targets × 23 features = 138 (target, feature) pairs, checks whether each method predicts the correct sign of the OLS coefficient. Embeddings always predict positive (cosine similarity is always positive), so they get every negative coefficient wrong.
| Metric | Embedding | LLM-Elicited |
|---|---|---|
| Overall sign accuracy | 63.8% | 69.6% |
| Accuracy on negative coefficients | 0% | 52% |
| Spearman ρ with OLS | 0.187 | 0.562 |
7.7 run_cross_model_embeddings.py — Cross-Model Comparison
Tests whether the tight clustering of cosine similarities is specific to one embedding model or is a universal property. Runs 6 different models (Gemini, OpenAI large/small, Voyage, Cohere, Mistral) through the same tests.
Result: All embedding models produce tightly clustered, all-positive similarities. The inability to distinguish directional relationships is a fundamental limitation of embedding geometry for this task, not a model-specific issue.
8. Layer 6: Test Suite
The test suite has 50 tests across 5 files. Every LLM and API call is mocked — the tests verify pure math and logic. This was written test-first (TDD).
Test Categories
test_bayesian.py (19 tests)
The mathematical core. Tests things like:
- Uninformative prior → OLS: With variance = 1012, the posterior mean should equal the OLS estimate.
- Very noisy data → prior dominates: With σ² = 1012, the posterior mean should equal the prior mean.
- Known analytic case: One observation (y=3), one predictor (x=1), prior N(0, 4), σ²=1. Posterior mean should be exactly 2.4, posterior variance exactly 0.8.
- Log ML ranking: A tight prior centered at the truth should have higher log ML than a tight prior centered far from the truth.
- NIG convergence: When the Inverse-Gamma prior on σ² is very precise (α0 = 100,000), the NIG log ML should approximately match the fixed-σ² log ML.
tests/test_bayesian.py def test_known_analytic_solution(): """Verify against hand-computed posterior for simple case. Prior: β ~ N(0, τ²=4) Data: y=3, x=1, σ²=1 Posterior mean = τ²x / (σ² + τ²x²) · y = 4/(1+4) · 3 = 2.4 Posterior var = σ²τ² / (σ² + τ²x²) = 4/5 = 0.8""" X = np.array([[1.0]]) y = np.array([3.0]) prior = Prior(mean=np.array([0.0]), variances=np.array([4.0])) result = bayesian_regression(X, y, prior, sigma_sq=1.0) assert_allclose(result.posterior_mean, [2.4], atol=1e-10) assert_allclose(result.posterior_covariance, [[0.8]], atol=1e-10)
test_priors_embedding.py (6 tests)
Tests the embedding prior constructor with mocked embedding responses. Verifies that κ scales correctly, orthogonal embeddings give zero similarity, and (critically) that antonym pairs get similar embeddings.
test_priors_elicited.py (7 tests)
Tests the LLM-elicited prior constructor with mocked LLM responses. Verifies structured JSON parsing, one-call-per-feature behavior, negative variance clamping, and that antonym pairs get opposite signs.
test_data_fred.py (9 tests)
Tests data transforms (log, diff_log, diff_log_12) and FRED fetching with mocked API. Verifies zero-flooring for FEDFUNDS, NaN handling, and date alignment.
test_evaluation.py (9 tests)
Tests R² against known values, directional accuracy, and BIC formula.
9. Putting It All Together
Here's the full data flow for a single experiment run, showing how every piece connects:
descriptions.py defines CPIAUCSL as the target with 5 features2.
fred.py downloads monthly data from 1959–2025, applies diff_log_12 transform3a.
embedding.py embeds all descriptions, computes cosine similarities [0.64, 0.67, 0.64, 0.66, 0.72]3b.
elicited.py asks Gemini 3.0 Pro for coefficient estimates [+0.35, −0.40, −0.40, +0.45, +0.45]4. Both produce a
Prior(mean=..., variances=...)5.
experiment.py runs rolling window forecast: for each month, train on past window, predict next month6. At each window:
bayesian.py computes posterior mean + log marginal likelihood (fixed-σ² and NIG)7.
evaluation.py computes R² on the collected predictions8.
run_*.py script orchestrates the comparison: OLS vs embedding vs shuffled vs elicited
The ablation scripts add a twist at step 3a: instead of using the real similarities, they permute them (shuffled), average them (uniform), draw random values (random), reverse-rank them (inverted), or negate them (negated). If the real similarities carry information, these corrupted versions should perform worse. They don't.
The Verdict
| Test | Embedding Priors | LLM-Elicited Priors |
|---|---|---|
| R² vs shuffled (81 grid points) | Mean Δ = +0.0002 (noise) | Beats embedding 81/81 |
| Sign accuracy (138 pairs) | 63.8% (0% on negatives) | 69.6% (52% on negatives) |
| Antonym test (10 pairs) | 0/10 correct | 9/10 correct |
| Spearman ρ with OLS | 0.187 | 0.562 |
| Small windows (≤60 months) | Destroys R² (−0.06 to −0.18) | Matches or beats OLS |
| NIG model evidence | ≈ shuffled | Orders of magnitude higher |
Embedding cosine similarities are a geometric property of text — they measure topical relatedness, not directional economic relationships. They're always positive, tightly clustered, and carry approximately zero information about regression coefficients beyond generic regularization. Asking the LLM directly is simpler, cheaper, and actually works.