Metrics Overview

Modified on Thu, Oct 19, 2023 at 7:46 AM

Metrics are essential for evaluating the quality and performance of AI models. They help users to compare different models, identify strengths and weaknesses, and optimize their solutions. However, there are many different types of metrics for different modalities and tasks, and it can be challenging to understand and use them correctly.

That's why aiXplain provides a comprehensive guide that introduces users to various evaluation metrics available for Benchmarking AI models on our platform ?:

Translation Metrics

BLEU - Measures ngram overlap with reference translation. Favors fluency over adequacy.
chRF - Measures character ngram overlap with reference. Alleviates BLEU's sensitivity to morphology.
METEOR - Matches unigrams and stems/synonyms between generated and reference translations.
TER - Counts edits required to modify hypothesis for it to match reference.
COMET-DA - Predicts human ranking scores based on reference. Highly reliable.

Transcription Metrics

WER - Measures insertions, deletions and substitutions relative to reference transcript.

Speech Quality Metrics

PESQ - Predicts subjective opinion scores. Range 1-5. Higher = better quality.
COMET-QE - Estimates speech quality without reference. Useful for model selection.
NORESQA-MOS - Predicts human mean opinion scores without reference.
DNSMOS - Non-reference speech quality score. Accounts for distortions.
VISQOL - Estimates speech quality from vocoder features without reference.
WARP-Q - Lightweight non-reference speech quality score based on priors.
CLSSS - Estimates human scores without reference. Less reliable than reference-based.

I hope this article has proven to be both helpful and informative for you. We greatly appreciate your decision to select aiXplain as your AI creation and optimization partner. Should you have any inquiries or feedback, please don't hesitate to reach out to us at your convenience.