Scores

The scores define different metrics to evaluate control model performance in articulatory speech synthesis. Higher scores indicate better performance, with each subscore typically ranging between 0 and 100.

Groups of Scores

The benchmark distinguishes three main groups of scores:

  1. Articulatory Scores

  • Measures quality of articulatory movements

  • Evaluates velocity and jerk distribution

  • Compares virtual tongue movement to EMA data

  • Compares virtual tongue height to ultrasound measurements

  1. Semantic Scores

  • Evaluates closeness of produced to target semantic vector embedding

  • Assesses classification rank in word classification

  1. Acoustic Scores

  • Compares synthesis and target audio recording

  • Evaluates loudness envelope and log-mel spectrograms

  • Future: f0 and formant transition scores

Score Calculation

Scores are calculated as follows:

  1. Calculate error per token

  2. Average errors across dataset

  3. Normalize by baseline model’s average error

  4. Subtract from 1 and multiply by 100

This ensures:

  • No error = score of 100

  • Baseline model error = score of 0

Total Score Formula

\[S_\text{total} = S_\text{articulatory} + S_\text{semantic} + S_\text{acoustic}\]

1. Articulatory Scoring

The score_articulatory function calculates an overall articulatory score by combining the results from three sub-scores:

  • Tongue Height

  • EMA (Electromagnetic Articulography)

  • Velocity and Jerk

1.1 score_tongue_height(data, task)

\[S_{\text{tongue_height}} = 100 \cdot \left( 1 - \frac{\text{mean}_\text{token}(RMSE(\text{height}_\text{synthesis}, \text{height}_\text{ultrasound}))}{\text{baseline model}} \right)\]

Calculates a score based on the mean RMSE difference between the predicted tongue height and the reference tongue height.

1.2 score_ema(data, task)

\[S_{\text{ema}} = 100 \cdot \left( 4 - \frac{\text{mean}_\text{token}(\text{RMSE}_\text{TT,x}, \text{RMSE}_\text{TT,y},\text{RMSE}_\text{TT,z}, \text{RMSE}_\text{TB,x}, \text{RMSE}_\text{TB,y}, \text{RMSE}_\text{TB,z},}{\text{baseline model}} \right)\]
\[\text{RMSE}_\text{TT,x} = RMSE(\text{tongue_tip}_\text{synthesis, x}, \text{tongue_tip}_\text{ema, x})\]

Calculates a score on the EMA (Electromagnetic Articulography) data for the tongue tip (TT) and tongue body (TB) based on the mean RMSE difference between synthesis and reference EMA in x, y, z direction.

1.3 score_vel_jerk(data, task)

\[S_\text{vel\_jerk} = 100 \cdot \left(2 - \frac{mean_\text{token}(max(\text{velocity}_\text{synthesis}))}{max(\text{velocity}_\text{GECO})} - \frac{mean_\text{token}(max(\text{jerk}_\text{synthesis}))}{max(\text{jerk}_\text{GECO})}\right)\]

Calculates a score based on the velocity and jerk of the cp-trajectories. The score is computed on a logarithmic scale and considers outliers by using the 99.9% quantile for the calculation.

2. Acoustic Scoring

The score_acoustic function evaluates the acoustic properties of the data by combining two sub-scores:

  • Loudness

  • Spectrogram

2.1 score_loudness(data, task)

\[S_\text{loudness} = 100 \cdot \left( 1 - \frac{mean_\text{token}( RMSE(\text{loudness}_\text{synthesis}, \text{loudness}_\text{recording}))}{\text{baseline model}} \right)\]

Calculates a score based on the difference between the predicted loudness and the target loudness. Loudness is calculated every 220 samples over a 1024 sample window by summing all log-mel spectrogram entries for each time slice.

2.2 score_spectrogram(data, task)

\[S_\text{spectrogram} = 100 \cdot \left( 1 - \frac{mean_\text{token}(RMSE(\text{spectrogram}_\text{synthesis}, \text{spectrogram}_\text{recording}))}{\text{baseline model}} \right)\]

Calculates a score based on the difference between the predicted log-mel spectrogram and the target spectrogram. We use a Mel spectrogram with 60 banks in the frequency range from 10 to 12000 Hz, a time shift of 110 samples and an aggregation window for the Fourier transform of 1024 samples.

3. Semantic Scoring

The score_semantic function evaluates the semantic properties of the data by combining two sub-scores:

  • Semantic Distance

  • Semantic Rank

3.1 score_sem_dist(data, task)

\[S_\text{sem\_dist} = 100 \cdot \left( 1 - \frac{mean_\text{token}( RMSE(\text{semantic\_vector}_\text{synthesis}, \text{semantic\_vector}_\text{target}))}{\text{baseline model}} \right)\]

Calculates a score based on the semantic distance between the predicted semantic vector and the target semantic vector.

3.2 score_sem_rank(data, task)

\[S_\text{sem\_rank} = 100 \cdot \left( 1 - \frac{ mean_\text{token}(rank_\text{target} - 1))}{4311} \right)\]
Calculates a score based on the rank of the predicted semantic vector compared to a set of 4311 reference vectors including the target.

Ranking them least to most distant based on the euclidean distance between our produced compared and the reference vectors.