New Parallel Decoding Method Promises Faster LLM Responses Without Sacrificing Quality

This post is also available in: עברית (Hebrew)

As demand grows for faster, more capable large language models (LLMs), researchers have introduced a new approach that significantly reduces response times without compromising output quality. The method—developed by MIT’s CSAIL lab in collaboration with Google—enables LLMs to self-manage how they generate responses, using a system called Parallel Structure Annotation, or PASTA.

At its core, PASTA is designed to overcome a key technical bottleneck in LLMs: their time-consuming nature. Traditional models generate responses one token at a time, with each step dependent on the last. While this works well for simple tasks, it leads to long delays when handling complex or lengthy prompts. Earlier attempts to speed things up, such as speculative or syntactic parallel decoding, often relied on rigid rules and failed when outputs didn’t follow predictable structures.

According to TechXplore, PASTA takes a different route by training the model to identify portions of a response that can be processed independently. These sections are marked using a new annotation language—PASTA-LANG—which serves as internal guidance for breaking down tasks during inference. An interpreter reads these tags and directs the model to generate the marked sections simultaneously, effectively parallelizing the response.

This approach, called “learned asynchronous decoding”, allows the model to orchestrate its own output strategy. During trials using the AlpacaEval benchmark, the system achieved nearly double the speed of standard decoding techniques, with only minimal variation in output quality—ranging from a 2% gain to a 7% drop.

Two stages of fine-tuning were used to train the model to produce these annotations. The goal wasn’t just speed, but maintaining—and in some cases improving—the coherence and relevance of the generated text.

By shifting parallelization decisions into the model itself, PASTA opens the door to faster, more efficient LLMs that better leverage modern computing hardware. It also lays the groundwork for reducing the computational demands of large-scale inference, which could help make these systems more accessible and cost-effective across commercial and research settings.

The research was published in arXive.