Metrics Toolbox v0.1.1

Configurable ML evaluation toolkit with built-in cross-validation and metric aggregation.

Requirements

matplotlib~=3.6
numpy>=1.24
scikit-learn~=1.1

Features

Flexible Metric Configuration: Add and configure metrics using enums, strings, or dictionaries
Type-Safe: Leverage enums for type safety while maintaining flexibility with string names
Multiple Reducers: Track metrics over time with various aggregation strategies (mean, min, max, std, etc.)
Built-in Metrics: Pre-configured metrics for probability, label, and regression tasks on target, micro, and macro scopes
Chainable Builder Pattern: Intuitive API for constructing metric evaluators
Visualization Support: Generate ROC curves and other visualizations

Three reasons to use metrics-toolbox

1. You can buid it

evaluator = (
    EvaluatorBuilder()
    .add_metric("roc_auc_target", target_name="true")
    .add_metric("accuracy", reducers=["mean", "std"])
    .add_metric("precision_target", target_name="true")
).build()

2. You can config it

config = {
    "roc_auc_macro": {
        "reducers": ["mean", "min"]
    },
    "accuracy"": {
        "reducers": ["mean", "std"],
    }
}
evaluator = EvaluatorBuilder().from_json(config).build()

3. You can cross validate

for y, fold in kfolds:
    evaluator.add_model_evaluation(model, fold, y)

Available Classification Metrics

Name	Figures	Settings
`accuracy`	Confusion matrix	opt_confusion_normalization
`precision_micro`	-	-
`precision_macro`	-	-
`precision_target`	-	target_name
`recall_micro`	-	-
`recall_macro`	-	-
`recall_target`	-	target_name
`f1_score_micro`	-	-
`f1_score_macro`	-	-
`f1_score_target`	-	target_name

Available Probability Metrics

Name	Figures	Settings
`roc_auc_micro`	Traces	-
`roc_auc_macro`	Traces	-
`roc_auc_target`	Traces	target_name

Available Regression Metrics

Name	Figures	Settings
`mse_target`	True, Pred, Error	target_name, opt_metadata_series_length
`mse_macro`	-	-
`rmse_target`	True, Pred, Error	target_name, opt_metadata_series_length
`rmse_macro`	-	-

Available Reducers

Name	Explanation
`latest`	Returns the most recent metric value
`mean`	Calculates the average of all metric values
`std`	Computes the standard deviation of metric values
`max`	Returns the maximum metric value
`min`	Returns the minimum metric value
`minmax`	Returns the difference between min and max metric values

Installation

pip install metrics-toolbox

Quick Start

from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

from metrics_toolbox import EvaluatorBuilder
import numpy as np

# 1. Load dataset
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 2. Train a model
model = RandomForestClassifier(n_estimators=2, random_state=42, max_depth=3)
model.fit(X_train, y_train)

# 3. Build evaluator with multiple metrics, you can mix and match classification and probabilistic metrics
evaluator = (
    EvaluatorBuilder()
    .add_metric("roc_auc_target", target_name=1, reducers=["mean", "std"])
    .add_metric("accuracy", reducers=["mean", "std"])
    .add_metric("precision_target", target_name=1)
    .add_metric("recall_target", target_name=1)
    .add_metric("f1_score_target", target_name=1, reducers=["mean", "minmax"])
).build()

# 4. Evaluate model directly
evaluator.add_model_evaluation(model, X_test, y_test)

# 5. Add another evaluation on training set for comparison
evaluator.add_model_evaluation(model, X_train, y_train)

# 6. Get results
result = evaluator.get_results()
display(result['values'])
display(result['steps'])
display(result['figures'])

# 7. View figures
display(result['figures']['roc_auc_curves'])
display(result['figures']['confusion_matrices'])

Usage

To see examples how to: - Get help, see the help notebook - Use the builder pattern, see the builder examples notebook - Binary classification model evaluation, see the binary model notebook - Multiclass classification model evaluation, see the multiclass model notebook - Multivariate regression model evaluation, see the regression model notebook - Custom Evaluator for custom model, see the custom notebook

Development

Setup

This project uses uv for dependency management.

# Clone the repository
git clone https://github.com/rasmushaa/metrics-toolbox.git
cd metrics-toolbox

# Install in editable mode with development dependencies
uv pip install -e ".[dev]"

# Set up pre-commit hooks
uv run pre-commit install

Testing

Run the test suite with coverage reporting:

uv run pytest

Coverage configuration is specified in pyproject.toml. Also the examples are runned automatically as tests, to keep those update according to the latest changes.

Code Quality

The project uses pre-commit hooks to maintain code quality:

# Run all hooks on all files
uv run pre-commit run --all-files

# Run hooks on staged files only
uv run pre-commit run

Test Deployment

Before publishing to the main PyPI repository, you can test the deployment process using TestPyPI:

bash scripts/publish_to_test_pypi.sh

This script will: 1. Create/update pyproject.toml.dev with auto-incremented version (e.g., 0.1.0.dev1, 0.1.0.dev2, etc.) 2. Build the package distribution files using the dev version 3. Load credentials from .env file (PAT and twine username) 4. Upload to TestPyPI (https://test.pypi.org/) 5. Keep your main pyproject.toml unchanged

The script automatically increments the .dev<N> suffix on each run, allowing unlimited test uploads without manual version management.

To install from TestPyPI for testing:

pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ metrics-toolbox

Deployment

The project uses automated CI/CD workflows:

Continuous Testing: Matrix testing across supported Python versions on main and feature/** branches
Python & Dependency Validation: All Python versions from pyproject.toml classifiers are automatically validated in CI/CD. The oldest Python version (first in matrix) is additionally tested with lowest-direct dependencies to ensure minimum version compatibility, and the CI/CD workflow validates that this oldest version is correctly specified
Documentation: Automatically updates MkDocs API reference and deploys documentation on pushes to main
PyPI Publishing: Automated deployment triggered by version tags

To release a new version:

Push a tag v0.1.0 to the main
Pipelien validates that the tag matches the current project version, and the change log is updates
A new version is pushed to PyPi using twine

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.