At the forefront of neuroscience, EEG technology is gradually uncovering the mysteries of how the brain processes language. While EEG datasets using non-Chinese corpuses as stimuli are relatively comprehensive, there is a notable absence of EEG datasets stimulated by Chinese corpuses. This gap constrains research on the language representation mechanism of the human brain in the Chinese context and affects the accuracy of BCI technologies based on the Chinese language.
Recently, a breakthrough study by Professor Quanying Liu from Southern University of Science and Technology and Professor Haiyan Wu from the University of Macau was published in Scientific Data, a Nature portfolio journal. This study introduced CihneseEEG, the first EEG dataset specifically designed for the Chinese language. It is also the first project funded by the MindD Program, a data support program initiated by the Tianqiao & Chrissy Chen Institute (TCCI®) in China.
Language is central to human communication. Relying on the intricate processing mechanisms of the brain, humans can understand each other and express themselves in either mother tongue or foreign languages through study. When receiving linguistic information, the brain activates a series of neural responses to decode the data. By studying this neural activity, scientists can reveal how the brain processes and understands language.
In recent years, technologies including EEG, functional magnetic resonance imaging (fMRI), and electrocorticography (ECoG) have played a critical role in researching the language processing mechanism of the brain. Acquiring extensive neural signal data, however, remains challenging, especially for Chinese EEG datasets, which are scarce. The structural differences between languages indicate that they are processed differently in the brain, thus making it ever more important to create EEG datasets based on non-English stimuli.
To bridge this gap, Professor Liu and Professor Wu combined their research effort. Their paradigm used the Chinese translation of The Little Prince and the Chinese classic Dream of the Wolf King as the experimental materials. These texts are rich in common Chinese characters and expressions, thus providing diverse linguistic stimuli for the experiment. Each participant was asked to silently read these Chinese texts up to 12 hours while their EEG and eye movement data were monitored and recorded. The reading process is divided into a practice reading phase and two formal reading phases, each comprising several rounds of experiments.
Experiment equipment and relevant data modality
The advantage of the ChineseEEG dataset lies in offering not only various pre-processed sensor-grade EEG data but also Chinese text embeddings generated by the BERT-base-Chinese model. This provides new perspectives for studying the correlation between text representation in natural language processing models and the neural activities of the brain. Researchers can use this dataset to analyze how the brain processes Chinese language, advancing cross-linguistic neuroscience research.
Potential applications of the ChineseEEG
First, since participants have been stimulated by Chinese texts rich in vocabulary and semantics for 12 hours, it is highly beneficial for studying long-term changes in the language processing mechanism of the brain. Second, with the use of 128-channel high-density EEG data and a sampling rate of 1000 times per second, researchers can precisely track the subtle variations in the brain when the participant is reading Chinese. More importantly, the pre-processed EEG data and text embeddings offered by researchers will enable those without a neuroscience or computer science background to directly apply the data in research.
For instance, The ChineseEEG dataset can be used for several purposes: 1. Time-frequency analysis of EEG: Extracting different frequency bands of neural oscillations; 2. Reconstructed source activation of EEG: Revealing the sources of brain activity; 3. Text embeddings: Utilizing pre-trained models to calculate novel embeddings and explore the correlation between EEG and text; 4. Data alignment: Aligning EEG data with text content and eye-tracking data to enhance data understanding.
This dataset has significant implications for neuroscience, linguistics, and related fields, and broad applications in areas such as brain-computer interfaces (BCI) and semantic decoding. For instance, text conversion technology based on brain signals can help disabled individuals control computers or other devices directly through brain activity, enabling more convenient communication and lifestyles.
Professor Wu noted, “The collection, management, and analysis of vast amount of data collected from the brain is a well-recognized challenge that limits the application of next-generation AI technology, represented by large language models, in relevant fields. The MindD Program initiated by the Tianqiao and Chrissy Chen Institute is a timely solution that addresses the most urgent need of scientists and clinicians.”
The MindD Program aims to fund Chinese neuroscientists, cognitive scientists, psychologists, and doctors specializing in neurological and psychiatric diseases to facilitate the collection and analysis of data from the human brain, whole body and behaviors as well as the subsequent model training in compliance with safety regulations. The first phase of the program is expected to provide 100 million RMB (nearly US $14 million) in funding, along with free storage servers, computing power, innovative data collection technology, and AI and data expertise. The funding agreement between TCCI and the research team jointly led by Professor Wu and Professor Liu marks the first milestone of this program.
More established technology and further enriched datasets will foster more innovative research outcomes which will deepen our understanding of how the brain processes language and other complex tasks. The MindD Program will continue to help break data bottlenecks in related research fields to lay a solid foundation for the integration of “AI + brain science”. It will also strive to accelerate the actual use of AI technology in medical and health scenarios by empowering more international cooperation and cross-disciplinary research.