New Parallel Decoding Method Promises Faster LLM Responses Without Sacrificing Quality

Jul 24, 2025

This post is also available in: עברית (Hebrew)

As demand grows for faster, more capable large language models (LLMs), researchers have introduced a new approach that significantly reduces response times without compromising output quality. The method—developed by MIT’s CSAIL lab in collaboration with Google—enables LLMs to self-manage how they generate responses, using a system called Parallel Structure Annotation, or PASTA.

At its core, PASTA is designed to overcome a key technical bottleneck in LLMs: their time-consuming nature. Traditional models generate responses one token at a time, with each step dependent on the last. While this works well for simple tasks, it leads to long delays when handling complex or lengthy prompts. Earlier attempts to speed things up, such as speculative or syntactic parallel decoding, often relied on rigid rules and failed when outputs didn’t follow predictable structures.

According to TechXplore, PASTA takes a different route by training the model to identify portions of a response that can be processed independently. These sections are marked using a new annotation language—PASTA-LANG—which serves as internal guidance for breaking down tasks during inference. An interpreter reads these tags and directs the model to generate the marked sections simultaneously, effectively parallelizing the response.

This approach, called “learned asynchronous decoding”, allows the model to orchestrate its own output strategy. During trials using the AlpacaEval benchmark, the system achieved nearly double the speed of standard decoding techniques, with only minimal variation in output quality—ranging from a 2% gain to a 7% drop.

Two stages of fine-tuning were used to train the model to produce these annotations. The goal wasn’t just speed, but maintaining—and in some cases improving—the coherence and relevance of the generated text.

By shifting parallelization decisions into the model itself, PASTA opens the door to faster, more efficient LLMs that better leverage modern computing hardware. It also lays the groundwork for reducing the computational demands of large-scale inference, which could help make these systems more accessible and cost-effective across commercial and research settings.

The research was published in arXive.

New Parallel Decoding Method Promises Faster LLM Responses Without Sacrificing Quality

Latest

China-Linked Cyber Campaigns Target Taiwan’s Semiconductor Sector

Google Removes Over 30,000 Propaganda-Linked YouTube Channels

Ukrainian Cyberattack Disrupts Major Russian Drone Manufacturer

From Months to Seconds: The AI Startup Automating Construction Permits

Portable Cooling System Targets Heatstroke for Soldiers in the Field

LLM-Created “Package Hallucinations” Pose New Software Supply Chain Risk

The Fire Beetle-Inspired Infrared Sensor for Surveillance and Defense Tech

First Ethanol-to-Jet Fuel Plant Nears Production

AI-Driven SAR System Makes Maritime Target Recognition Quicker

Shopify Privacy Plugin Exposes Hundreds of Stores to Critical Security Risks

ChatGPT’s New Agent Mode Brings Automation to Everyday Digital Tasks

Live Facial Recognition to Be Deployed at a Large Public Event...

Field Trial Validates GPS-Free, Gravity-Based Quantum Navigation at Sea

Report Finds Houthi-Linked Arms Dealers Using X and WhatsApp to Sell...

International Operation Disrupts Pro-Russian Cyber Network Linked to Thousands of DDoS...

AI Coding Assistants May Hinder Speed for Experienced Developers

Lithium Metal Battery Includes Built-in Fire Suppressant

Lightweight Interceptor Drone Promises Fast UAV Neutralization Without Explosives

Researchers Show How Hidden Commands Can Trick Google Gemini into Delivering...

Worm-Inspired Swarm Robots Offer New Approach to Search and Rescue Robotics