This post is also available in:
עברית (Hebrew)
As demand for AI models continues to grow, so does the need for high-quality training data. While synthetic data offers a practical alternative to real-world datasets—especially in sensitive or hard-to-collect environments—it often lacks proper quality control. A new framework, recently introduced by researchers from the University of Pittsburgh and Peking University, addresses this gap with a method to assess and improve the usability of synthetic wireless data for AI training.
The research, presented at the MobiSys 2025 conference and recognized with a Best Paper Award, focuses on a challenge unique to wireless signal data. Unlike images or audio, wireless signals—often used in applications like home monitoring or virtual reality—are not easily interpretable by humans. This makes it difficult to evaluate whether synthetic examples accurately reflect real-world signal behavior.
To address this, the team developed a set of metrics to evaluate two critical aspects of synthetic data: affinity, which measures how closely the data aligns with real-world examples, and diversity, which ensures variation across samples. While many synthetic datasets succeed in generating diversity, the researchers found that wireless data often struggles with affinity—leading to mislabeled or misleading training samples. According to TechXplore.
To improve model training outcomes, the team introduced a semi-supervised approach using a system called SynCheck. This framework filters out low-quality synthetic samples and reinforces the use of high-affinity data during model training. A small set of verified examples helps guide the model to recognize legitimate signal patterns.
SynCheck showed promising results. Models trained with SynCheck achieved a 4.3% performance boost, while those trained with unfiltered synthetic data experienced a performance drop of 13.4%.
The research highlights a key consideration in the use of synthetic data: quantity alone is not enough. For AI models to perform reliably, particularly in complex domains, careful evaluation and filtering of training data is essential. This work provides a practical path forward for improving AI systems that rely on signal-based inputs in both commercial and research applications.