Create and Run a Benchmarking Job

Modified on Wed, Jun 7, 2023 at 6:51 AM

Benchmark is a tool provided by aiXplain to measure and compare the quality of outputs generated by one or more selected models using reference-based benchmarking metrics.

Selecting a dataset
Choosing models to benchmark
Selecting performance metrics

Selecting a dataset

To get started, you can select a dataset from aiXplain's catalog of open-source datasets or upload your own dataset. A Dataset is a large collection of data that will be used to evaluate the performance of the models you want to benchmark against. To add a dataset, click on "Find Datasets" and that will generate a popup with a list of available datasets to choose from in the form of cards showing details about each dataset. You can use the search box on the top of the popup to search for a dataset or you can use the function, as well as input and output filters based on the language direction that you want (if you choose translation) . Click "Collect" to select it After collecting the dataset, click "Add to Benchmark".

Choosing models to benchmark

Next, choose the machine learning models you want to benchmark. You can choose from a variety of pre-built models or upload your own custom models. Benchmarking allows you to compare the performance of different models. The models will use the dataset that you’ve previously selected to run. Similar to adding a dataset, click on "Find Models" will generate a popup. Use the search bar at the top of the popup to find the models you’re looking for as cards that show brief details (e.g. translation from Spanish to Greek). Note that the models are already pre-filtered based on your selection of dataset. When you hover over a model card you can click on the three dots, then Show Details to view further details of that model or "Collect" to choose this as a model for your benchmarking job. After selecting the different models that you’d like to use in your benchmarking job click on "Add to Benchmark" and you’ll see the models you added in the benchmarking page.

Selecting performance metrics

Select the performance metrics you want to evaluate. All metrics have a short description inside it’s display card. This provides a better understanding of each metric's functionality, making it easier to choose the best ones for your needs. The scores of the performance metrics are used to benchmark the models that you have selected against each other. When you click on "Find Metrics" it will create a list of our available metrics. Check on the metrics that you would like to be used for benchmarking. Collect your desired metrics, and add them to your Benchmark. Feel free to read more about the metrics for benchmarking machine translation and speech recognition.

Segmentation slider

Before starting the benchmarking process, you can adjust the segmentation slider to specify the number of segments in your dataset. Segments refer to the subsets of data used to train and test the machine learning models. The number of segments required depends on the dataset's characteristics. For example, if you have a speech recognition dataset with background noise, you may want to use more segments. It's essential to strike a balance between the number of segments and the dataset's size. Adjust the slider as needed to achieve the optimal number of segments.

You have added the Dataset, Models, and Performance Metrics, an estimate of the cost of usage for your benchmarking job is displayed. If you do not have enough credits, a warning will display to let you know to add credits to your wallet to proceed. When your credit balance checks out, you are ready to go! Click on Start on your screen to run your benchmarking job. Depending on the size of your dataset, it can take several minutes to several hours to generate a report. When the report is ready, you will receive an email notification and can view the report in the "Benchmarks" category of your "My Assets" section.

The report provides detailed performance metrics for the models you selected, including quality, cost, latency, and bias analysis. You can also see charts and graphs that visually represent the performance of each model.