Why Neural Code Retrieval is Overrated

In recent years, we've seen an explosion of AI-powered coding assistants like GitHub Copilot and Cursor (followed by what seems like a new VSCode fork every other week). These tools rely heavily on neural code retrieval to provide context to large language models. While these tools have shown impressive capabilities, their fundamental approach to context gathering through embedding-based retrieval might be suboptimal. Instead, I argue that we should leverage either language-specific tooling or AST-based parsing combined with targeted heuristics to build more reliable, explainable, and effective coding assistants. ## The Current Landscape: Neural Code Retrieval Current coding assistants typically use a combination of: - Embedding models to encode code snippets and find similar pieces - Heuristic-based approaches (open files, recently used files) - Proximity-based context gathering While this approach can work, it essentially treats code as unstructured text, ignoring the fact that we have much more powerful tools at our disposal for understanding code structure and relationships. ## Three Better Approaches to Context Retrieval ### 1. AST-Based Parsing with Heuristics Tools like tree-sitter provide fast, reliable parsing of code into Abstract Syntax Trees (ASTs). While tree-sitter itself can't resolve symbols or determine types (as it's a parser, not a compiler), we can use it alongside smart heuristics to retrieve relevant context: ```go func processData() { result := fetchUserData() processResult(result) } ``` With this approach, we could: 1. Use tree-sitter to parse the code and identify function calls (`fetchUserData`, `processResult`) 2. Scan project files for these function definitions, again using tree-sitter to parse and identify function declarations 3. Follow import statements to scan dependent packages 4. Build a graph of function calls and definitions This provides much more precise context than embedding-based similarity search, while remaining relatively language-agnostic. ### 2. Language Server Protocol (LSP) Integration For editors like Cursor that are built on VSCode, the Language Server Protocol provides an elegant solution for context retrieval. LSP is a standardized protocol that allows any editor to communicate with language servers that provide rich code intelligence features. This approach offers several advantages: First, LSP is already widely adopted, with servers available for most popular programming languages. These servers provide capabilities like go-to-definition, find-references, and type information - exactly what we need for context retrieval. Second, since many editors already integrate with LSP, this approach requires minimal additional infrastructure. Third, LSP servers are maintained by the language communities themselves, ensuring high-quality and up-to-date language support. Using LSP, we can: - Get precise symbol definitions and references - Retrieve type information and documentation - Find all implementations of interfaces - Navigate through workspace symbols - Access semantic tokens and syntax highlighting ### 3. Language-Specific Tooling Another powerful approach is to leverage language-specific tooling that already exists for symbol resolution and type checking. For example: - Go's built-in `go/types` package and `go/ast` for complete symbol resolution - Rust's `rust-analyzer` for detailed code analysis - TypeScript's language service for type information and symbol resolution - Java's JDT (Java Development Tools) for full semantic analysis Let's see how this works in Go: ```go type UserService interface { GetUser(id string) (*User, error) } func processUserData(svc UserService) { result := svc.GetUser("123") // Go's type checker can tell us exactly: // - The type of result ((*User, error)) // - Where UserService is defined // - All implementations of UserService } ``` Using Go's native tooling, we can: - Resolve all type information precisely - Find interface implementations - Track cross-package dependencies - Follow symbol definitions across the entire program ## Why Language-Specific Approaches Win Over Neural Retrieval ### Accuracy Language-specific approaches provide precise symbol resolution instead of relying on similarity-based guessing. When working with code, we can obtain exact type information and guarantee that we find relevant definitions and implementations. This stands in stark contrast to probabilistic matching used in neural approaches, where there's always uncertainty about whether the retrieved context is truly relevant. ### Explainability Every piece of context included through language-aware approaches has a clear reason for its inclusion. We can trace exact paths from where a symbol is used to where it's defined, and the results are deterministic. This makes it easier to debug issues and understand why certain suggestions or completions are being made, unlike the black-box nature of neural retrieval systems. ### Feasibility Implementing language-specific solutions is surprisingly practical. Most popular programming languages already have robust tooling that we can leverage, and supporting the top 10-15 languages would cover the vast majority of use cases. While there is an upfront cost to implement support for each language, this is a one-time investment compared to the ongoing costs of training and maintaining neural models. The engineering effort required is well-defined and builds upon decades of existing work in compiler technology and language tooling. ### Context Management and LLM Behavior One of the most compelling arguments for structured retrieval lies in how it interacts with LLM behavior and context management. While modern LLMs can technically process massive context windows (some handling entire books), this capability comes with significant caveats that directly impact real-world performance. First, there's the issue of context utilization. Even when an LLM can accept a large context window, it doesn't always effectively utilize all of that context. Embedding-based approaches often try to compensate for retrieval uncertainty by including more "top-k" results, hoping to catch all relevant information. This leads to context bloat without guaranteeing better outcomes. The cost implications are substantial. Each token in the context window increases the computational cost and latency. When using embeddings with reranking to improve accuracy by including more potential matches, you're essentially paying for the LLM to process a lot of possibly irrelevant code. This affects both the financial cost per request and the time to first token - critical metrics for user-facing tools. Most importantly, LLMs can actually perform worse when given irrelevant context. It's not just a matter of wasted tokens; irrelevant information can actively distract the model and degrade the quality of its responses. This is where structured retrieval shines: by following actual code relationships through symbol resolution and dependency graphs, every piece of context included is guaranteed to be relevant by construction. We're not guessing at relationships through statistical similarity - we're following the exact links that make the code work. This deterministic relevance has cascading benefits. We can be more selective about context inclusion without fear of missing critical information, leading to smaller, more focused context windows. This results in faster responses, lower costs, and most importantly, more accurate and reliable outputs from the LLM. ### Performance Language-aware approaches offer significant performance advantages over neural retrieval systems. From a computational perspective, structured approaches eliminate the need to maintain large vector indexes in memory or compute expensive similarity metrics like cosine distance over thousands or millions of vectors. Instead of running neural networks for embedding generation and approximate nearest neighbor (ANN) searches, we can simply traverse ASTs and symbol tables with deterministic algorithms. This performance advantage manifests in both computational resources and speed. We avoid the high memory footprint required for ANN indexes and the computational overhead of similarity searches. The structured approach can often be faster than embedding-based retrieval since we're doing direct lookups and graph traversals rather than vector similarity computations over large datasets. Additionally, these approaches are highly cacheable - parsed ASTs and symbol tables can be efficiently stored and reused. When we retrieve context, we get exactly what we need without wasting valuable context window space in our LLMs with potentially irrelevant code snippets. This efficiency becomes particularly important when working with large codebases where precise context retrieval is crucial for generating accurate completions. ## A Note on Quick Completions vs Agent Workflows It's worth acknowledging that not all AI coding features have the same requirements. For quick completions and features like Cursor's Tab autocomplete, an embeddings-based approach combined with smart heuristics (like considering open files and recent edits) might actually be more suitable. These features prioritize speed and don't necessarily need perfect context - they just need to be good enough to help developers write their next line of code quickly. However, for more complex scenarios like coding agents (think Cursor's composer or Windsurf flows) where an AI is trying to understand and modify significant portions of a codebase, structured retrieval becomes crucial. These agents need precise understanding of code relationships and dependencies to make informed decisions and generate reliable code changes. ## Conclusion Code isn't just text - it's a graph of symbols, types, and dependencies that we can traverse deterministically. When we treat it as plain text and rely on embedding-based similarity to find relevant context, we're throwing away the precise relationships that make code meaningful in the first place. It's like having a map but choosing to navigate by looking at satellite photos and guessing which blurry patches might be roads. For quick autocomplete features that suggest the next line of code, the fuzzy pattern matching of embedding approaches makes sense - developers can quickly reject incorrect suggestions, and the speed benefits outweigh perfect accuracy. But for coding agents that need to understand and modify entire codebases, this approximation breaks down. An agent can't guess whether it's looking at the right implementation or hope it found all the relevant type definitions - it needs to know. The tools to get this precise context already exist. Whether we use ASTs to follow function calls, LSP to resolve symbols, or language-specific tooling to trace types, we can gather exactly the context we need. Not only is this more reliable, but it's also computationally cheaper than maintaining giant vector indexes and computing similarity scores. We don't need to approximate code relationships when we can just follow them. *This post was crafted with a little help from Claude* 🤖