Metrics Toolbox v0.1.1
Configurable ML evaluation toolkit with built-in cross-validation and metric aggregation.
Requirements
- matplotlib~=3.6
- numpy>=1.24
- scikit-learn~=1.1
Features
- Flexible Metric Configuration: Add and configure metrics using enums, strings, or dictionaries
- Type-Safe: Leverage enums for type safety while maintaining flexibility with string names
- Multiple Reducers: Track metrics over time with various aggregation strategies (mean, min, max, std, etc.)
- Built-in Metrics: Pre-configured metrics for probability, label, and regression tasks on target, micro, and macro scopes
- Chainable Builder Pattern: Intuitive API for constructing metric evaluators
- Visualization Support: Generate ROC curves and other visualizations
Three reasons to use metrics-toolbox
1. You can buid it
evaluator = (
EvaluatorBuilder()
.add_metric("roc_auc_target", target_name="true")
.add_metric("accuracy", reducers=["mean", "std"])
.add_metric("precision_target", target_name="true")
).build()
2. You can config it
config = {
"roc_auc_macro": {
"reducers": ["mean", "min"]
},
"accuracy"": {
"reducers": ["mean", "std"],
}
}
evaluator = EvaluatorBuilder().from_json(config).build()
3. You can cross validate
Available Classification Metrics
| Name | Figures | Settings |
|---|---|---|
accuracy |
Confusion matrix | opt_confusion_normalization |
precision_micro |
- | - |
precision_macro |
- | - |
precision_target |
- | target_name |
recall_micro |
- | - |
recall_macro |
- | - |
recall_target |
- | target_name |
f1_score_micro |
- | - |
f1_score_macro |
- | - |
f1_score_target |
- | target_name |
Available Probability Metrics
| Name | Figures | Settings |
|---|---|---|
roc_auc_micro |
Traces | - |
roc_auc_macro |
Traces | - |
roc_auc_target |
Traces | target_name |
Available Regression Metrics
| Name | Figures | Settings |
|---|---|---|
mse_target |
True, Pred, Error | target_name, opt_metadata_series_length |
mse_macro |
- | - |
rmse_target |
True, Pred, Error | target_name, opt_metadata_series_length |
rmse_macro |
- | - |
Available Reducers
| Name | Explanation |
|---|---|
latest |
Returns the most recent metric value |
mean |
Calculates the average of all metric values |
std |
Computes the standard deviation of metric values |
max |
Returns the maximum metric value |
min |
Returns the minimum metric value |
minmax |
Returns the difference between min and max metric values |
Installation
Quick Start
from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from metrics_toolbox import EvaluatorBuilder
import numpy as np
# 1. Load dataset
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 2. Train a model
model = RandomForestClassifier(n_estimators=2, random_state=42, max_depth=3)
model.fit(X_train, y_train)
# 3. Build evaluator with multiple metrics, you can mix and match classification and probabilistic metrics
evaluator = (
EvaluatorBuilder()
.add_metric("roc_auc_target", target_name=1, reducers=["mean", "std"])
.add_metric("accuracy", reducers=["mean", "std"])
.add_metric("precision_target", target_name=1)
.add_metric("recall_target", target_name=1)
.add_metric("f1_score_target", target_name=1, reducers=["mean", "minmax"])
).build()
# 4. Evaluate model directly
evaluator.add_model_evaluation(model, X_test, y_test)
# 5. Add another evaluation on training set for comparison
evaluator.add_model_evaluation(model, X_train, y_train)
# 6. Get results
result = evaluator.get_results()
display(result['values'])
display(result['steps'])
display(result['figures'])
# 7. View figures
display(result['figures']['roc_auc_curves'])
display(result['figures']['confusion_matrices'])
Usage
To see examples how to: - Get help, see the help notebook - Use the builder pattern, see the builder examples notebook - Binary classification model evaluation, see the binary model notebook - Multiclass classification model evaluation, see the multiclass model notebook - Multivariate regression model evaluation, see the regression model notebook - Custom Evaluator for custom model, see the custom notebook
Development
Setup
This project uses uv for dependency management.
# Clone the repository
git clone https://github.com/rasmushaa/metrics-toolbox.git
cd metrics-toolbox
# Install in editable mode with development dependencies
uv pip install -e ".[dev]"
# Set up pre-commit hooks
uv run pre-commit install
Testing
Run the test suite with coverage reporting:
Coverage configuration is specified in pyproject.toml.
Also the examples are runned automatically as tests,
to keep those update according to the latest changes.
Code Quality
The project uses pre-commit hooks to maintain code quality:
# Run all hooks on all files
uv run pre-commit run --all-files
# Run hooks on staged files only
uv run pre-commit run
Test Deployment
Before publishing to the main PyPI repository, you can test the deployment process using TestPyPI:
This script will:
1. Create/update pyproject.toml.dev with auto-incremented version (e.g., 0.1.0.dev1, 0.1.0.dev2, etc.)
2. Build the package distribution files using the dev version
3. Load credentials from .env file (PAT and twine username)
4. Upload to TestPyPI (https://test.pypi.org/)
5. Keep your main pyproject.toml unchanged
The script automatically increments the .dev<N> suffix on each run, allowing unlimited test uploads without manual version management.
To install from TestPyPI for testing:
pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ metrics-toolbox
Deployment
The project uses automated CI/CD workflows:
- Continuous Testing: Matrix testing across supported Python versions on
mainandfeature/**branches - Python & Dependency Validation: All Python versions from
pyproject.tomlclassifiers are automatically validated in CI/CD. The oldest Python version (first in matrix) is additionally tested with lowest-direct dependencies to ensure minimum version compatibility, and the CI/CD workflow validates that this oldest version is correctly specified - Documentation: Automatically updates MkDocs API reference and deploys documentation on pushes to
main - PyPI Publishing: Automated deployment triggered by version tags
To release a new version:
- Push a tag
v0.1.0to the main - Pipelien validates that the tag matches the current project version, and the change log is updates
- A new version is pushed to PyPi using twine
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.