Abstract
Depression detection benefits from combining neurological and behavioral indicators, yet integrating heterogeneous modalities such as EEG and interview audio remains challenging. We propose a transformer-based multimodal framework that jointly models spectral, spatial, and temporal EEG features alongside linguistic and paralinguistic cues from interviews. By employing synchronized multi-head cross-attention and self-attention mechanisms, the model effectively captures intra- and inter-modal correlations. In addition, a flexible temporal sequence matching strategy reduces EEG channel requirements, enhancing device portability. Evaluated on the MODMA and DAIC-WOZ datasets, our approach achieves superior performance compared to state-of-the-art models, with a 4.7% improvement in accuracy and a 10% increase in precision. These results demonstrate the potential of the proposed framework for accurate, scalable, and cost-effective depression detection in both clinical and real-world settings.
| Original language | English |
|---|---|
| Article number | 109039 |
| Number of pages | 11 |
| Journal | Biomedical signal processing and control |
| Volume | 113 |
| Issue number | B |
| Early online date | 5-Nov-2025 |
| DOIs | |
| Publication status | E-pub ahead of print - 5-Nov-2025 |
Keywords
- Depression detection
- EEG
- Flexible temporal sequence matching
- Modality synchronization
- Multimodal transformer