======
Scores
======

The scores define different metrics to evaluate control model performance in articulatory speech synthesis. Higher scores indicate better performance, with each subscore typically ranging between 0 and 100.

Groups of Scores
================

The benchmark distinguishes three main groups of scores:

1. **Articulatory Scores**

  - Measures quality of articulatory movements
  - Evaluates velocity and jerk distribution
  - Compares virtual tongue movement to EMA data
  - Compares virtual tongue height to ultrasound measurements

2. **Semantic Scores**

  - Evaluates closeness of produced to target semantic vector embedding
  - Assesses classification rank in word classification

3. **Acoustic Scores**

  - Compares synthesis and target audio recording
  - Evaluates loudness envelope and log-mel spectrograms
  - Future: f0 and formant transition scores


Score Calculation
=================

Scores are calculated as follows:

1. Calculate error per token
2. Average errors across dataset
3. Normalize by baseline model's average error
4. Subtract from 1 and multiply by 100

This ensures:

- No error = score of 100
- Baseline model error = score of 0

Total Score Formula
-------------------
.. math::

    S_\text{total} = S_\text{articulatory} + S_\text{semantic} + S_\text{acoustic}

1. **Articulatory Scoring**
----------------------------

The `score_articulatory` function calculates an overall articulatory score by combining the results from three sub-scores:

- **Tongue Height**
- **EMA (Electromagnetic Articulography)**
- **Velocity and Jerk**


1.1 `score_tongue_height(data, task)`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. math::

   S_{\text{tongue_height}} = 100 \cdot \left( 1 - \frac{\text{mean}_\text{token}(RMSE(\text{height}_\text{synthesis}, \text{height}_\text{ultrasound}))}{\text{baseline model}} \right)


Calculates a score based on the mean RMSE difference between the predicted tongue height and the reference tongue height.


1.2 `score_ema(data, task)`
~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. math::

   S_{\text{ema}} = 100 \cdot \left( 4 - \frac{\text{mean}_\text{token}(\text{RMSE}_\text{TT,x}, \text{RMSE}_\text{TT,y},\text{RMSE}_\text{TT,z}, \text{RMSE}_\text{TB,x}, \text{RMSE}_\text{TB,y}, \text{RMSE}_\text{TB,z},}{\text{baseline model}} \right)

.. math::
    
   \text{RMSE}_\text{TT,x} = RMSE(\text{tongue_tip}_\text{synthesis, x}, \text{tongue_tip}_\text{ema, x})


Calculates a score on the EMA (Electromagnetic Articulography) data for the tongue tip (TT) and tongue body (TB) based on the mean RMSE difference between synthesis and reference EMA in x, y, z direction.


1.3 `score_vel_jerk(data, task)`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. math::
   S_\text{vel\_jerk} = 100 \cdot \left(2 - \frac{mean_\text{token}(max(\text{velocity}_\text{synthesis}))}{max(\text{velocity}_\text{GECO})} - \frac{mean_\text{token}(max(\text{jerk}_\text{synthesis}))}{max(\text{jerk}_\text{GECO})}\right)


Calculates a score based on the velocity and jerk of the cp-trajectories. The score is computed on a logarithmic scale and considers outliers by using the 99.9% quantile for the calculation.


2. **Acoustic Scoring**
------------------------

The `score_acoustic` function evaluates the acoustic properties of the data by combining two sub-scores:

- **Loudness**
- **Spectrogram**


2.1 `score_loudness(data, task)`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. math::
  S_\text{loudness} = 100 \cdot \left( 1 - \frac{mean_\text{token}( RMSE(\text{loudness}_\text{synthesis}, \text{loudness}_\text{recording}))}{\text{baseline model}} \right)


Calculates a score based on the difference between the predicted loudness and the target loudness. 
Loudness is calculated every 220 samples over a 1024 sample window by summing all log-mel spectrogram entries for each time slice.


2.2 `score_spectrogram(data, task)`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. math::
  S_\text{spectrogram} = 100 \cdot \left( 1 - \frac{mean_\text{token}(RMSE(\text{spectrogram}_\text{synthesis}, \text{spectrogram}_\text{recording}))}{\text{baseline model}} \right) 

Calculates a score based on the difference between the predicted log-mel spectrogram and the target spectrogram.
We use a Mel spectrogram with 60 banks in the frequency range from 10 to 12000 Hz, a time shift of 110 samples and an aggregation window for the Fourier transform of 1024 samples.

3. **Semantic Scoring**
------------------------

The `score_semantic` function evaluates the semantic properties of the data by combining two sub-scores:

- **Semantic Distance**
- **Semantic Rank**


3.1 `score_sem_dist(data, task)`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. math::
  S_\text{sem\_dist} = 100 \cdot \left( 1 - \frac{mean_\text{token}( RMSE(\text{semantic\_vector}_\text{synthesis}, \text{semantic\_vector}_\text{target}))}{\text{baseline model}} \right)


Calculates a score based on the semantic distance between the predicted semantic vector and the target semantic vector.


3.2 `score_sem_rank(data, task)`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. math::
  S_\text{sem\_rank} = 100 \cdot \left( 1 - \frac{ mean_\text{token}(rank_\text{target} - 1))}{4311} \right)


Calculates a score based on the rank of the predicted semantic vector compared to a set of 4311 reference vectors including the target.
  Ranking them least to most distant based on the euclidean distance between our produced compared and the reference vectors.