Abstract: In recent years, the field of speech and language processing has made significant strides, yet persistent challenges such as speech noise, limited high-quality data, and the lack of robustness in speech generation systems persist. Furthermore, evaluating speech remains a considerable obstacle for comprehensive assessment at scale. Concurrently, recent breakthroughs in Large Language Models (LLMs) have revolutionized text generation and natural language processing. However, the complexity of spoken language introduces unique hurdles, including managing long speech waveform sequences.