<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Tim Lawrenz]]></title><description><![CDATA[Tim Lawrenz]]></description><link>https://blog.lawrenz.com</link><image><url>https://blog.lawrenz.com/img/substack.png</url><title>Tim Lawrenz</title><link>https://blog.lawrenz.com</link></image><generator>Substack</generator><lastBuildDate>Mon, 20 Apr 2026 18:08:31 GMT</lastBuildDate><atom:link href="https://blog.lawrenz.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Tim Lawrenz]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[timlawrenz1@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[timlawrenz1@substack.com]]></itunes:email><itunes:name><![CDATA[Tim Lawrenz]]></itunes:name></itunes:owner><itunes:author><![CDATA[Tim Lawrenz]]></itunes:author><googleplay:owner><![CDATA[timlawrenz1@substack.com]]></googleplay:owner><googleplay:email><![CDATA[timlawrenz1@substack.com]]></googleplay:email><googleplay:author><![CDATA[Tim Lawrenz]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[The Literal Value Bottleneck]]></title><description><![CDATA[Why GNN Autoencoders Fail at Code Generation]]></description><link>https://blog.lawrenz.com/p/the-literal-value-bottleneck</link><guid isPermaLink="false">https://blog.lawrenz.com/p/the-literal-value-bottleneck</guid><dc:creator><![CDATA[Tim Lawrenz]]></dc:creator><pubDate>Mon, 20 Apr 2026 15:13:28 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Ny0f!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e7b243a-b089-4f1d-a761-d15b59fe86b0_1220x446.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>Why GNN Autoencoders Fail at Code Generation</h1><p>We trained GNN autoencoders on 22,000 Ruby ASTs. The models achieved <strong>81% node type accuracy</strong> and <strong>99.5% type diversity</strong>, yet generated exactly <strong>0% syntactically valid code</strong>. Here is why.</p><p>Graph Neural Networks seem like a natural fit for code &#8212; after all, Abstract Syntax Trees <em>are</em> graphs. If GNNs can learn molecular structures well enough to generate valid drug candidates, surely they can learn code structure well enough to generate valid programs?</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.lawrenz.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>We ran 51 GPU experiments across five GNN architectures, three decoder strategies, four hidden dimensions, and three loss functions to find out. The answer is a definitive no &#8212; but the <em>reason</em> is not what we expected.</p><h2>The Setup: 22,452 Ruby ASTs with 74-Dimensional Features</h2><p>We parsed 22,452 Ruby methods from open-source repositories into AST graphs. Each node gets a <strong>74-dimensional one-hot feature vector</strong> encoding its AST type &#8212; one of 73 known types (<code>def</code>, <code>send</code>, <code>args</code>, <code>lvar</code>, <code>str</code>, ...) plus a single <code>UNKNOWN</code> token. Literal values &#8212; method names, variable names, string contents, numeric values &#8212; are stripped of their content and mapped to <code>UNKNOWN</code>.</p><p>The dataset is split 85/15 into training (19,084) and validation (3,368) sets.</p><div class="callout-block" data-callout="true"><p><strong>Dataset</strong>: <a href="https://huggingface.co/datasets/timlawrenz/gnn-ruby-code-study">https://huggingface.co/datasets/timlawrenz/gnn-ruby-code-study</a></p></div><h2>GNNs <strong>Do</strong> Learn Code Structure</h2><p>Before diving into the failure, let&#8217;s establish that GNNs genuinely learn meaningful representations of code.</p><p>For <strong>cyclomatic complexity prediction</strong> (a graph-level regression task), we compared five architectures: GCN, GraphSAGE, GAT, GIN, and GraphConv. The results are clear:</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/EXoXi/4/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7e7b243a-b089-4f1d-a761-d15b59fe86b0_1220x446.png&quot;,&quot;thumbnail_url_full&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a2358ffa-a2cc-45db-bb0f-aa0a502db513_1220x446.png&quot;,&quot;height&quot;:213,&quot;title&quot;:&quot;cyclomatic complexity prediction - The Literal Value Bottleneck: Why GNN Autoencoders Fail at Code Generation&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/EXoXi/4/" width="730" height="213" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>A 5-layer GraphSAGE achieves <strong>R&#178; = 0.71</strong>, explaining 71% of the variance in cyclomatic complexity. That&#8217;s a <strong>16% improvement</strong> over the 3-layer baseline &#8212; and it&#8217;s <strong>9.9&#963; significant</strong> based on 18 replicate runs (&#963; = 0.073).</p><p>Two patterns jump out:</p><ol><li><p><strong>Depth dominates width.</strong> Going from 3 to 5 layers improves MAE by 16%. Doubling the hidden dimension from 64 to 128? Zero improvement. For ASTs with depths of 10&#8211;30, deeper networks capture cross-branch dependencies that directly relate to complexity.</p></li><li><p><strong>GIN punches above its weight.</strong> GIN&#8217;s injective sum aggregation &#8212; which preserves the full multiset of neighbor features &#8212; gives it a 4% edge over SAGE at equal depth. This is the Weisfeiler-Leman advantage in practice.</p></li></ol><p>So the models clearly learn the graph structure. The question is: can they <em>reconstruct</em> it?</p><h2>The Generation Catastrophe: 0% Across the Board</h2><p>We trained graph autoencoders to encode an AST into a latent vector and decode it back. We tried everything:</p><ul><li><p><strong>5 architectures</strong>: GCN, SAGE, GAT, GIN, GraphConv</p></li><li><p><strong>3 loss functions</strong>: simple (node type CE), improved (+ parent prediction), comprehensive</p></li><li><p><strong>3 decoder edge modes</strong>: chain (sequential), teacher-forced (ground-truth edges), iterative (predicted edges)</p></li><li><p><strong>4 hidden dimensions</strong>: 128, 256, 512, and deep 5-layer variants</p></li></ul><div class="callout-block" data-callout="true"><p><strong>Every single configuration produces 0% syntactically valid Ruby.</strong></p></div><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/0HU11/2/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26000baa-76f0-474e-84ef-f56474752114_1220x514.png&quot;,&quot;thumbnail_url_full&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c7eb6ffc-7d8f-4a5d-ad33-911c19a6d8f6_1220x514.png&quot;,&quot;height&quot;:247,&quot;title&quot;:&quot;The Generation Catastrophe - The Literal Value Bottleneck: Why GNN Autoencoders Fail at Code Generation&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/0HU11/2/" width="730" height="247" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>Validation loss converges (to ~3.8 with teacher forcing), so the models <em>are</em> learning something. But what?</p><h2>The Core Discovery: The Literal Value Bottleneck</h2><p>Here&#8217;s where it gets interesting. When we gave our best model &#8212; a 5-layer teacher-forced GIN decoder &#8212; the ground-truth tree structure and only asked it to predict node types, it achieved:</p><ul><li><p><strong>81% node type accuracy</strong></p></li><li><p><strong>99.5% type diversity</strong> (8.6 unique types per sample)</p></li><li><p><strong>0% syntax validity</strong></p></li></ul><p>How can a model be 81% accurate and produce 0% valid code?</p><p>Look at this Ruby method:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;ruby&quot;,&quot;nodeId&quot;:&quot;919e55af-eb23-4e9e-868e-a49d0ea7bc5a&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-ruby">def call(storage)
  new(storage).call
end</code></pre></div><p>Its AST contains 12 elements:</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/Obkfw/2/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8b49b44c-e0ad-4f95-8f5c-c0686bdd69bc_1220x1172.png&quot;,&quot;thumbnail_url_full&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e8b243e8-f79c-47c0-839e-cdce4accaaa5_1220x1172.png&quot;,&quot;height&quot;:554,&quot;title&quot;:&quot;The Core Discovery: The Literal Value Bottleneck - The Literal Value Bottleneck: Why GNN Autoencoders Fail at Code Generation&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/Obkfw/2/" width="730" height="554" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p><strong>12/12 correct. 100% accuracy.</strong> The model perfectly reconstructs the AST skeleton. But 6 of those 12 nodes are <code>UNKNOWN</code> &#8212; they are literal values (method names <code>call</code> and <code>new</code>, variable name <code>storage</code>, and a <code>nil</code> sentinel) that were encoded as the undifferentiated <code>UNKNOWN</code> token. The model correctly predicts <code>UNKNOWN</code> for all of them, which is technically right but utterly useless &#8212; the actual string content that makes the code meaningful is irrecoverable.</p><p>When we analyzed 500 validation samples, the numbers are stark:</p><ul><li><p>Typed AST nodes (<code>def</code>, <code>send</code>, <code>args</code>, ...): 53.2%</p></li><li><p>Literal values (identifiers, strings, numbers): <strong>46.8%*</strong></p></li></ul><p><strong>Nearly half of every AST is literal values.</strong> Method names, variable names, string contents, numeric literals &#8212; all collapsed into a single <code>UNKNOWN</code> token. No amount of architectural tweaking, loss function engineering, or hidden dimension scaling can recover information that was never encoded.</p><p>This is the <strong>literal value bottleneck</strong>: the failure isn&#8217;t in model capacity or architecture. It&#8217;s in the input representation itself.</p><h2>Without Structure, Everything Collapses</h2><p>The bottleneck becomes even more apparent when we remove teacher forcing. Without ground-truth edges, the chain decoder (which connects nodes sequentially, destroying the tree topology) exhibits <strong>catastrophic mode collapse</strong>:</p><ul><li><p><strong>92.7%</strong> of all predicted tokens are <code>UNKNOWN</code> </p></li><li><p>Only <code>def</code> (3.6%) and <code>send</code> (3.0%) appear as alternatives</p></li><li><p>Average unique types per sample drops from 8.6 to <strong>1.6</strong></p></li><li><p>Type accuracy plummets from 81% to <strong>48%</strong></p></li></ul><p>The model learns a degenerate strategy: predict the most common token (which happens to be <code>UNKNOWN</code>, since 47% of the ground truth <em>is</em> <code>UNKNOWN</code>) and call it a day.</p><p>Teacher forcing fixes the <em>structural</em> component (restoring type accuracy to 81%), but the <em>lexical</em> component &#8212; the literal value bottleneck &#8212; remains.</p><h2>Dimension Doesn&#8217;t Matter, Depth Does (Slightly)</h2><p>One striking result: hidden dimensions of 128, 256, and 512 produce nearly <em>identical</em> outcomes:</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/yuaJw/2/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2137e6ff-ce3d-423f-be3c-d9cc7573040e_1220x540.png&quot;,&quot;thumbnail_url_full&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/974686b3-9048-44de-9d3b-05b31d5f5d7a_1220x540.png&quot;,&quot;height&quot;:260,&quot;title&quot;:&quot;Dimension Doesn't Matter, Depth Does (Slightly) - The Literal Value Bottleneck: Why GNN Autoencoders Fail at Code Generation&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/yuaJw/2/" width="730" height="260" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>More capacity doesn&#8217;t help. The bottleneck is <strong>information-theoretic</strong>, not computational. Going deeper (5 layers) nudges heuristic validity from 97% to 99.5%, consistent with the depth-over-width finding from complexity prediction &#8212; but the syntax validity needle doesn&#8217;t move from 0%.</p><h2>The Infrastructure: 51 Experiments for $4.32</h2><p>All of this &#8212; 51 GPU experiments across two hardware platforms &#8212; was orchestrated by <a href="https://github.com/timlawrenz/ratiocinator">Ratiocinator</a>, an autonomous LLM-driven research pipeline.</p><p>Ratiocinator handles the full lifecycle: it provisions Vast.ai RTX 4090 instances, deploys code via Git, installs dependencies, runs training, collects metrics, and tears down instances. The 18 baseline replicates that gave us our variance estimate? Those came from Ratiocinator accidentally running the same configuration three times due to an environment variable bug &#8212; which turned into a useful statistical gift.</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/W1bv7/3/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c73f2aab-a6d5-44b3-8e1e-c081fb311ba2_1220x716.png&quot;,&quot;thumbnail_url_full&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d1af0332-df48-4aca-a029-b821c037da37_1220x716.png&quot;,&quot;height&quot;:396,&quot;title&quot;:&quot;The Infrastructure: 51 Experiments for $4.32&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/W1bv7/3/" width="730" height="396" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>A 40-GPU ablation study for the price of a latte. By treating the research process itself as a distributed systems problem, Ratiocinator proves that high-velocity architectural ablation doesn&#8217;t require a massive compute budget &#8212; just ruthless pipeline optimization.</p><h2>What Would It Take to Fix This?</h2><p>Our findings point to three concrete directions:</p><ol><li><p><strong>Literal value prediction heads.</strong> Add separate output heads for identifier names (via a vocabulary or copy mechanism), string contents, and numeric values. The structural decoder already works &#8212; it&#8217;s the lexical reconstruction that&#8217;s missing.</p></li><li><p><strong>Hybrid architectures.</strong> Use GNN encoders for structural understanding, but pair them with autoregressive or grammar-constrained decoders for sequential output. The GNN captures the <em>shape</em> of the code; a sequential decoder fills in the <em>content</em>.</p></li><li><p><strong>Pointer networks / copy mechanisms.</strong> Let the decoder point back to nodes in the input graph to copy identifier names, rather than generating them from scratch. This is analogous to copy mechanisms in summarization models.</p></li></ol><p>The fact that GNNs achieve R&#178; = 0.71 for complexity prediction proves they learn meaningful code representations. The challenge is building decoders that can reconstruct the <strong>full richness</strong> of code &#8212; not just its structural skeleton &#8212; from those representations.</p><h2>Try It Yourself</h2><p>The full dataset, pre-trained models, and experiment configurations are available:</p><ul><li><p><strong>&#128202; Dataset</strong>: <a href="https://huggingface.co/datasets/timlawrenz/gnn-ruby-code-study">https://huggingface.co/datasets/timlawrenz/gnn-ruby-code-study</a> (22,452 Ruby methods with ASTs)</p></li><li><p><strong>&#128196; Paper</strong>: <a href="https://huggingface.co/datasets/timlawrenz/gnn-ruby-code-study/blob/main/paper.md">Full research paper</a> with all tables and appendices</p></li><li><p><strong>&#128187; Code</strong>: <a href="https://github.com/timlawrenz/jubilant-palm-tree">jubilant-palm-tree</a> (branch: <code>experiment/ratiocinator-gnn-study</code>)</p></li><li><p><strong>&#129302; Orchestrator</strong>: <a href="https://github.com/timlawrenz/ratiocinator">Ratiocinator</a> &#8212; the autonomous experiment runner</p></li></ul><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;b67fda79-45f0-4382-99c4-cbcb074ab5c5&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash"># Reproduce the key result
git clone -b experiment/ratiocinator-gnn-study https://github.com/timlawrenz/jubilant-palm-tree
cd jubilant-palm-tree
pip install torch torchvision torch_geometric

# Complexity prediction (the success story)
python train.py --conv_type SAGE --num_layers 5 --epochs 50

# Autoencoder with teacher-forced GIN (the 81%-accurate failure)
python train_autoencoder.py --decoder_conv_type GIN --decoder_edge_mode teacher_forced --epochs 30</code></pre></div><p>If you use this dataset or findings, please cite:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;6cbdf8a3-3177-43d5-bb2d-237b9769ddbd&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">@misc{lawrenz2025gnnruby,
  title={Graph Neural Networks for Ruby Code Complexity Prediction and Generation: A Systematic Architecture Study},
  author={Tim Lawrenz},
  year={2026},
  howpublished={\url{https://huggingface.co/datasets/timlawrenz/gnn-ruby-code-study}}
}</code></pre></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.lawrenz.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>