evaluation

`sleap_nn.evaluation` ¶

This module is to compute evaluation metrics for trained models.

Classes:

Name	Description
`Evaluator`	Compute the standard evaluation metrics with the predicted and the ground-truth Labels.
`MatchInstance`	Class to have a new structure for sio.Instance object.

Functions:

Name	Description
`compute_dists`	Compute Euclidean distances between matched pairs of instances.
`compute_instance_area`	Compute the area of the bounding box of a set of keypoints.
`compute_oks`	Compute the object keypoints similarity between sets of points.
`find_frame_pairs`	Find corresponding frames across two sets of labels.
`get_instances`	Get a list of instances of type MatchInstance from the Labeled Frame.
`load_metrics`	Load the metrics for a given model and split.
`match_frame_pairs`	Match all ground truth and predicted instances within each pair of frames.
`match_instances`	Match pairs of instances between ground truth and predictions in a frame.
`run_evaluation`	Evaluate SLEAP-NN model predictions against ground truth labels.

`Evaluator` ¶

Compute the standard evaluation metrics with the predicted and the ground-truth Labels.

This class is used to calculate the common metrics for pose estimation models which includes voc metrics (with oks and pck), mOKS, distance metrics, pck metrics and visibility metrics.

Parameters:

Name	Type	Description	Default
`ground_truth_instances`	`Labels`	The `sio.Labels` dataset object with ground truth labels.	required
`predicted_instances`	`Labels`	The `sio.Labels` dataset object with predicted labels.	required
`oks_stddev`	`float`	The standard deviation to use for calculating object keypoint similarity; see `compute_oks` function for details.	`0.025`
`oks_scale`	`Optional[float]`	The scale to use for calculating object keypoint similarity; see `compute_oks` function for details.	`None`
`match_threshold`	`float`	The threshold to use on oks scores when determining which instances match between ground truth and predicted frames.	`0`
`user_labels_only`	`bool`	If False, predicted instances in the ground truth frame may be considered for matching.	`True`

Methods:

Name	Description
`__init__`	Initialize the Evaluator class with ground-truth and predicted labels.
`distance_metrics`	Compute the Euclidean distance error at different percentiles using the pairwise distances.
`evaluate`	Return the evaluation metrics.
`mOKS`	Return the meanOKS value.
`pck_metrics`	Compute PCK across a range of thresholds using the pair-wise distances.
`visibility_metrics`	Compute node visibility metrics for the matched pair of instances.
`voc_metrics`	Compute VOC metrics for a matched pairs of instances positive pairs and false negatives.

Source code in sleap_nn/evaluation.py

class Evaluator:
    """Compute the standard evaluation metrics with the predicted and the ground-truth Labels.

    This class is used to calculate the common metrics for pose estimation models which
    includes voc metrics (with oks and pck), mOKS, distance metrics, pck metrics and
    visibility metrics.

    Args:
        ground_truth_instances: The `sio.Labels` dataset object with ground truth labels.
        predicted_instances: The `sio.Labels` dataset object with predicted labels.
        oks_stddev: The standard deviation to use for calculating object
            keypoint similarity; see `compute_oks` function for details.
        oks_scale: The scale to use for calculating object
            keypoint similarity; see `compute_oks` function for details.
        match_threshold: The threshold to use on oks scores when determining
            which instances match between ground truth and predicted frames.
        user_labels_only: If False, predicted instances in the ground truth frame may be
            considered for matching.

    """

    def __init__(
        self,
        ground_truth_instances: sio.Labels,
        predicted_instances: sio.Labels,
        oks_stddev: float = 0.025,
        oks_scale: Optional[float] = None,
        match_threshold: float = 0,
        user_labels_only: bool = True,
    ):
        """Initialize the Evaluator class with ground-truth and predicted labels."""
        self.ground_truth_instances = ground_truth_instances
        self.predicted_instances = predicted_instances
        self.match_threshold = match_threshold
        self.oks_stddev = oks_stddev
        self.oks_scale = oks_scale
        self.user_labels_only = user_labels_only

        self._process_frames()

    def _process_frames(self):
        self.frame_pairs = find_frame_pairs(
            self.ground_truth_instances, self.predicted_instances, self.user_labels_only
        )
        if not self.frame_pairs:
            message = "Empty Frame Pairs. No match found for the video frames"
            logger.error(message)
            raise Exception(message)

        self.positive_pairs, self.false_negatives = match_frame_pairs(
            self.frame_pairs,
            stddev=self.oks_stddev,
            scale=self.oks_scale,
            threshold=self.match_threshold,
        )

        self.dists_dict = compute_dists(self.positive_pairs)

    def voc_metrics(
        self,
        match_score_by="oks",
        match_score_thresholds: np.ndarray = np.linspace(
            0.5, 0.95, 10
        ),  # 0.5:0.05:0.95
        recall_thresholds: np.ndarray = np.linspace(0, 1, 101),  # 0.0:0.01:1.00
    ):
        """Compute VOC metrics for a matched pairs of instances positive pairs and false negatives.

        Args:
            match_score_by: The score to be used for computing the metrics. "ock" or "pck"
            match_score_thresholds: Score thresholds at which to consider matches as a true
                positive match.
            recall_thresholds: Recall thresholds at which to evaluate Average Precision.

        Returns:
            A dictionary of VOC metrics.
        """
        if match_score_by == "oks":
            match_scores = np.array([oks for _, _, oks in self.positive_pairs])
            name = "oks_voc"
        elif match_score_by == "pck":
            pck_metrics = self.pck_metrics()
            match_scores = pck_metrics["pcks"].mean(axis=-1).mean(axis=-1)
            name = "pck_voc"
        else:
            message = "Invalid Option for match_score_by. Choose either `oks` or `pck`"
            logger.error(message)
            raise Exception(message)

        detection_scores = np.array(
            [pp[1].instance.score for pp in self.positive_pairs]
        )

        inds = np.argsort(-detection_scores, kind="mergesort")
        detection_scores = detection_scores[inds]
        match_scores = match_scores[inds]

        precisions = []
        recalls = []

        npig = len(self.positive_pairs) + len(
            self.false_negatives
        )  # total number of GT instances

        for match_score_threshold in match_score_thresholds:
            tp = np.cumsum(match_scores >= match_score_threshold)
            fp = np.cumsum(match_scores < match_score_threshold)

            if tp.size == 0:
                return {
                    name + ".match_score_thresholds": 0,
                    name + ".recall_thresholds": 0,
                    name + ".match_scores": 0,
                    name + ".precisions": 0,
                    name + ".recalls": 0,
                    name + ".AP": 0,
                    name + ".AR": 0,
                    name + ".mAP": 0,
                    name + ".mAR": 0,
                }

            rc = tp / npig
            pr = tp / (fp + tp + np.spacing(1))

            recall = rc[-1]  # best recall at this OKS threshold

            # Ensure strictly decreasing precisions.
            for i in range(len(pr) - 1, 0, -1):
                if pr[i] > pr[i - 1]:
                    pr[i - 1] = pr[i]

            # Find best precision at each recall threshold.
            rc_inds = np.searchsorted(rc, recall_thresholds, side="left")
            precision = np.zeros(rc_inds.shape)
            is_valid_rc_ind = rc_inds < len(pr)
            precision[is_valid_rc_ind] = pr[rc_inds[is_valid_rc_ind]]

            precisions.append(precision)
            recalls.append(recall)

        precisions = np.array(precisions)
        recalls = np.array(recalls)

        AP = precisions.mean(
            axis=1
        )  # AP = average precision over fixed set of recall thresholds
        AR = recalls  # AR = max recall given a fixed number of detections per image

        mAP = precisions.mean()  # mAP = mean over all OKS thresholds
        mAR = recalls.mean()  # mAR = mean over all OKS thresholds

        return {
            name + ".match_score_thresholds": match_score_thresholds,
            name + ".recall_thresholds": recall_thresholds,
            name + ".match_scores": match_scores,
            name + ".precisions": precisions,
            name + ".recalls": recalls,
            name + ".AP": AP,
            name + ".AR": AR,
            name + ".mAP": mAP,
            name + ".mAR": mAR,
        }

    def mOKS(self):
        """Return the meanOKS value."""
        pair_oks = np.array([oks for _, _, oks in self.positive_pairs])
        return {"mOKS": pair_oks.mean()}

    def distance_metrics(self):
        """Compute the Euclidean distance error at different percentiles using the pairwise distances.

        Returns:
            A dictionary of distance metrics.
        """
        dists = self.dists_dict["dists"]
        results = {
            "frame_idxs": self.dists_dict["frame_idxs"],
            "video_paths": self.dists_dict["video_paths"],
            "dists": dists,
            "avg": np.nanmean(dists),
            "p50": np.nan,
            "p75": np.nan,
            "p90": np.nan,
            "p95": np.nan,
            "p99": np.nan,
        }

        is_non_nan = ~np.isnan(dists)
        if np.any(is_non_nan):
            non_nans = dists[is_non_nan]
            for ptile in (50, 75, 90, 95, 99):
                results[f"p{ptile}"] = np.percentile(non_nans, ptile)

        return results

    def pck_metrics(self, thresholds: np.ndarray = np.linspace(1, 10, 10)):
        """Compute PCK across a range of thresholds using the pair-wise distances.

        Args:
            thresholds: A list of distance thresholds in pixels.

        Returns:
            A dictionary of PCK metrics evaluated at each threshold.
        """
        dists = self.dists_dict["dists"]
        dists = np.copy(dists)
        dists[np.isnan(dists)] = np.inf
        pcks = np.expand_dims(dists, -1) < np.reshape(thresholds, (1, 1, -1))
        mPCK_parts = pcks.mean(axis=0).mean(axis=-1)
        mPCK = mPCK_parts.mean()

        return {
            "thresholds": thresholds,
            "pcks": pcks,
            "mPCK_parts": mPCK_parts,
            "mPCK": mPCK,
        }

    def visibility_metrics(self):
        """Compute node visibility metrics for the matched pair of instances.

        Returns:
            A dictionary of visibility metrics, including the confusion matrix.
        """
        vis_tp = 0
        vis_fn = 0
        vis_fp = 0
        vis_tn = 0

        for instance_gt, instance_pr, _ in self.positive_pairs:
            missing_nodes_gt = np.isnan(instance_gt.instance.numpy()).any(axis=-1)
            missing_nodes_pr = np.isnan(instance_pr.instance.numpy()).any(axis=-1)

            vis_tn += ((missing_nodes_gt) & (missing_nodes_pr)).sum()
            vis_fn += ((~missing_nodes_gt) & (missing_nodes_pr)).sum()
            vis_fp += ((missing_nodes_gt) & (~missing_nodes_pr)).sum()
            vis_tp += ((~missing_nodes_gt) & (~missing_nodes_pr)).sum()

        return {
            "tp": vis_tp,
            "fp": vis_fp,
            "tn": vis_tn,
            "fn": vis_fn,
            "precision": vis_tp / (vis_tp + vis_fp) if (vis_tp + vis_fp) else np.nan,
            "recall": vis_tp / (vis_tp + vis_fn) if (vis_tp + vis_fn) else np.nan,
        }

    def evaluate(self):
        """Return the evaluation metrics."""
        metrics = {}
        metrics["voc_metrics"] = self.voc_metrics(match_score_by="oks")
        metrics["voc_metrics"].update(self.voc_metrics(match_score_by="pck"))
        metrics["mOKS"] = self.mOKS()
        metrics["distance_metrics"] = self.distance_metrics()
        metrics["pck_metrics"] = self.pck_metrics()
        metrics["visibility_metrics"] = self.visibility_metrics()

        return metrics

`init(ground_truth_instances, predicted_instances, oks_stddev=0.025, oks_scale=None, match_threshold=0, user_labels_only=True)` ¶

Initialize the Evaluator class with ground-truth and predicted labels.

Source code in sleap_nn/evaluation.py

def __init__(
    self,
    ground_truth_instances: sio.Labels,
    predicted_instances: sio.Labels,
    oks_stddev: float = 0.025,
    oks_scale: Optional[float] = None,
    match_threshold: float = 0,
    user_labels_only: bool = True,
):
    """Initialize the Evaluator class with ground-truth and predicted labels."""
    self.ground_truth_instances = ground_truth_instances
    self.predicted_instances = predicted_instances
    self.match_threshold = match_threshold
    self.oks_stddev = oks_stddev
    self.oks_scale = oks_scale
    self.user_labels_only = user_labels_only

    self._process_frames()

`distance_metrics()` ¶

Compute the Euclidean distance error at different percentiles using the pairwise distances.

Returns:

Type	Description
	A dictionary of distance metrics.

Source code in sleap_nn/evaluation.py

def distance_metrics(self):
    """Compute the Euclidean distance error at different percentiles using the pairwise distances.

    Returns:
        A dictionary of distance metrics.
    """
    dists = self.dists_dict["dists"]
    results = {
        "frame_idxs": self.dists_dict["frame_idxs"],
        "video_paths": self.dists_dict["video_paths"],
        "dists": dists,
        "avg": np.nanmean(dists),
        "p50": np.nan,
        "p75": np.nan,
        "p90": np.nan,
        "p95": np.nan,
        "p99": np.nan,
    }

    is_non_nan = ~np.isnan(dists)
    if np.any(is_non_nan):
        non_nans = dists[is_non_nan]
        for ptile in (50, 75, 90, 95, 99):
            results[f"p{ptile}"] = np.percentile(non_nans, ptile)

    return results

`evaluate()` ¶

Return the evaluation metrics.

Source code in sleap_nn/evaluation.py

def evaluate(self):
    """Return the evaluation metrics."""
    metrics = {}
    metrics["voc_metrics"] = self.voc_metrics(match_score_by="oks")
    metrics["voc_metrics"].update(self.voc_metrics(match_score_by="pck"))
    metrics["mOKS"] = self.mOKS()
    metrics["distance_metrics"] = self.distance_metrics()
    metrics["pck_metrics"] = self.pck_metrics()
    metrics["visibility_metrics"] = self.visibility_metrics()

    return metrics

`mOKS()` ¶

Return the meanOKS value.

Source code in sleap_nn/evaluation.py

def mOKS(self):
    """Return the meanOKS value."""
    pair_oks = np.array([oks for _, _, oks in self.positive_pairs])
    return {"mOKS": pair_oks.mean()}

`pck_metrics(thresholds=np.linspace(1, 10, 10))` ¶

Compute PCK across a range of thresholds using the pair-wise distances.

Parameters:

Name	Type	Description	Default
`thresholds`	`ndarray`	A list of distance thresholds in pixels.	`linspace(1, 10, 10)`

Returns:

Type	Description
	A dictionary of PCK metrics evaluated at each threshold.

Source code in sleap_nn/evaluation.py

def pck_metrics(self, thresholds: np.ndarray = np.linspace(1, 10, 10)):
    """Compute PCK across a range of thresholds using the pair-wise distances.

    Args:
        thresholds: A list of distance thresholds in pixels.

    Returns:
        A dictionary of PCK metrics evaluated at each threshold.
    """
    dists = self.dists_dict["dists"]
    dists = np.copy(dists)
    dists[np.isnan(dists)] = np.inf
    pcks = np.expand_dims(dists, -1) < np.reshape(thresholds, (1, 1, -1))
    mPCK_parts = pcks.mean(axis=0).mean(axis=-1)
    mPCK = mPCK_parts.mean()

    return {
        "thresholds": thresholds,
        "pcks": pcks,
        "mPCK_parts": mPCK_parts,
        "mPCK": mPCK,
    }

`visibility_metrics()` ¶

Compute node visibility metrics for the matched pair of instances.

Returns:

Type	Description
	A dictionary of visibility metrics, including the confusion matrix.

Source code in sleap_nn/evaluation.py

def visibility_metrics(self):
    """Compute node visibility metrics for the matched pair of instances.

    Returns:
        A dictionary of visibility metrics, including the confusion matrix.
    """
    vis_tp = 0
    vis_fn = 0
    vis_fp = 0
    vis_tn = 0

    for instance_gt, instance_pr, _ in self.positive_pairs:
        missing_nodes_gt = np.isnan(instance_gt.instance.numpy()).any(axis=-1)
        missing_nodes_pr = np.isnan(instance_pr.instance.numpy()).any(axis=-1)

        vis_tn += ((missing_nodes_gt) & (missing_nodes_pr)).sum()
        vis_fn += ((~missing_nodes_gt) & (missing_nodes_pr)).sum()
        vis_fp += ((missing_nodes_gt) & (~missing_nodes_pr)).sum()
        vis_tp += ((~missing_nodes_gt) & (~missing_nodes_pr)).sum()

    return {
        "tp": vis_tp,
        "fp": vis_fp,
        "tn": vis_tn,
        "fn": vis_fn,
        "precision": vis_tp / (vis_tp + vis_fp) if (vis_tp + vis_fp) else np.nan,
        "recall": vis_tp / (vis_tp + vis_fn) if (vis_tp + vis_fn) else np.nan,
    }

`voc_metrics(match_score_by='oks', match_score_thresholds=np.linspace(0.5, 0.95, 10), recall_thresholds=np.linspace(0, 1, 101))` ¶

Compute VOC metrics for a matched pairs of instances positive pairs and false negatives.

Parameters:

Name	Type	Description	Default
`match_score_by`		The score to be used for computing the metrics. "ock" or "pck"	`'oks'`
`match_score_thresholds`	`ndarray`	Score thresholds at which to consider matches as a true positive match.	`linspace(0.5, 0.95, 10)`
`recall_thresholds`	`ndarray`	Recall thresholds at which to evaluate Average Precision.	`linspace(0, 1, 101)`

Returns:

Type	Description
	A dictionary of VOC metrics.

Source code in sleap_nn/evaluation.py

def voc_metrics(
    self,
    match_score_by="oks",
    match_score_thresholds: np.ndarray = np.linspace(
        0.5, 0.95, 10
    ),  # 0.5:0.05:0.95
    recall_thresholds: np.ndarray = np.linspace(0, 1, 101),  # 0.0:0.01:1.00
):
    """Compute VOC metrics for a matched pairs of instances positive pairs and false negatives.

    Args:
        match_score_by: The score to be used for computing the metrics. "ock" or "pck"
        match_score_thresholds: Score thresholds at which to consider matches as a true
            positive match.
        recall_thresholds: Recall thresholds at which to evaluate Average Precision.

    Returns:
        A dictionary of VOC metrics.
    """
    if match_score_by == "oks":
        match_scores = np.array([oks for _, _, oks in self.positive_pairs])
        name = "oks_voc"
    elif match_score_by == "pck":
        pck_metrics = self.pck_metrics()
        match_scores = pck_metrics["pcks"].mean(axis=-1).mean(axis=-1)
        name = "pck_voc"
    else:
        message = "Invalid Option for match_score_by. Choose either `oks` or `pck`"
        logger.error(message)
        raise Exception(message)

    detection_scores = np.array(
        [pp[1].instance.score for pp in self.positive_pairs]
    )

    inds = np.argsort(-detection_scores, kind="mergesort")
    detection_scores = detection_scores[inds]
    match_scores = match_scores[inds]

    precisions = []
    recalls = []

    npig = len(self.positive_pairs) + len(
        self.false_negatives
    )  # total number of GT instances

    for match_score_threshold in match_score_thresholds:
        tp = np.cumsum(match_scores >= match_score_threshold)
        fp = np.cumsum(match_scores < match_score_threshold)

        if tp.size == 0:
            return {
                name + ".match_score_thresholds": 0,
                name + ".recall_thresholds": 0,
                name + ".match_scores": 0,
                name + ".precisions": 0,
                name + ".recalls": 0,
                name + ".AP": 0,
                name + ".AR": 0,
                name + ".mAP": 0,
                name + ".mAR": 0,
            }

        rc = tp / npig
        pr = tp / (fp + tp + np.spacing(1))

        recall = rc[-1]  # best recall at this OKS threshold

        # Ensure strictly decreasing precisions.
        for i in range(len(pr) - 1, 0, -1):
            if pr[i] > pr[i - 1]:
                pr[i - 1] = pr[i]

        # Find best precision at each recall threshold.
        rc_inds = np.searchsorted(rc, recall_thresholds, side="left")
        precision = np.zeros(rc_inds.shape)
        is_valid_rc_ind = rc_inds < len(pr)
        precision[is_valid_rc_ind] = pr[rc_inds[is_valid_rc_ind]]

        precisions.append(precision)
        recalls.append(recall)

    precisions = np.array(precisions)
    recalls = np.array(recalls)

    AP = precisions.mean(
        axis=1
    )  # AP = average precision over fixed set of recall thresholds
    AR = recalls  # AR = max recall given a fixed number of detections per image

    mAP = precisions.mean()  # mAP = mean over all OKS thresholds
    mAR = recalls.mean()  # mAR = mean over all OKS thresholds

    return {
        name + ".match_score_thresholds": match_score_thresholds,
        name + ".recall_thresholds": recall_thresholds,
        name + ".match_scores": match_scores,
        name + ".precisions": precisions,
        name + ".recalls": recalls,
        name + ".AP": AP,
        name + ".AR": AR,
        name + ".mAP": mAP,
        name + ".mAR": mAR,
    }

`MatchInstance` ¶

Class to have a new structure for sio.Instance object.

Source code in sleap_nn/evaluation.py

@attrs.define(auto_attribs=True, slots=True)
class MatchInstance:
    """Class to have a new structure for sio.Instance object."""

    instance: sio.Instance
    frame_idx: int
    video_path: str

`compute_dists(positive_pairs)` ¶

Compute Euclidean distances between matched pairs of instances.

Parameters:

Name	Type	Description	Default
`positive_pairs`	`List[Tuple[Instance, PredictedInstance, Any]]`	A list of tuples of the form `(instance_gt, instance_pr, _)` containing the matched pair of instances.	required

Returns:

Type	Description
`Dict[str, Union[ndarray, List[int], List[str]]]`	A dictionary with the following keys: dists: An array of pairwise distances of shape `(n_positive_pairs, n_nodes)` frame_idxs: A list of frame indices corresponding to the `dists` video_paths: A list of video paths corresponding to the `dists`

Source code in sleap_nn/evaluation.py

def compute_dists(
    positive_pairs: List[Tuple[sio.Instance, sio.PredictedInstance, Any]],
) -> Dict[str, Union[np.ndarray, List[int], List[str]]]:
    """Compute Euclidean distances between matched pairs of instances.

    Args:
        positive_pairs: A list of tuples of the form `(instance_gt, instance_pr, _)`
            containing the matched pair of instances.

    Returns:
        A dictionary with the following keys:
            dists: An array of pairwise distances of shape `(n_positive_pairs, n_nodes)`
            frame_idxs: A list of frame indices corresponding to the `dists`
            video_paths: A list of video paths corresponding to the `dists`
    """
    dists = []
    frame_idxs = []
    video_paths = []
    for instance_gt, instance_pr, _ in positive_pairs:
        points_gt = instance_gt.instance.numpy()
        points_pr = instance_pr.instance.numpy()

        dists.append(np.linalg.norm(points_pr - points_gt, axis=-1))
        frame_idxs.append(instance_gt.frame_idx)
        video_paths.append(instance_gt.video_path)

    dists = np.array(dists)

    # Bundle everything into a dictionary
    dists_dict = {
        "dists": dists,
        "frame_idxs": frame_idxs,
        "video_paths": video_paths,
    }

    return dists_dict

`compute_instance_area(points)` ¶

Compute the area of the bounding box of a set of keypoints.

Parameters:

Name	Type	Description	Default
`points`	`ndarray`	A numpy array of coordinates.	required

Returns:

Type	Description
`ndarray`	The area of the bounding box of the points.

Source code in sleap_nn/evaluation.py

def compute_instance_area(points: np.ndarray) -> np.ndarray:
    """Compute the area of the bounding box of a set of keypoints.

    Args:
        points: A numpy array of coordinates.

    Returns:
        The area of the bounding box of the points.
    """
    if points.ndim == 2:
        points = np.expand_dims(points, axis=0)

    min_pt = np.nanmin(points, axis=-2)
    max_pt = np.nanmax(points, axis=-2)

    return np.prod(max_pt - min_pt, axis=-1)

`compute_oks(points_gt, points_pr, scale=None, stddev=0.025, use_cocoeval=True)` ¶

Compute the object keypoints similarity between sets of points.

Parameters:

Name	Type	Description	Default
`points_gt`	`ndarray`	Ground truth instances of shape (n_gt, n_nodes, n_ed), where n_nodes is the number of body parts/keypoint types, and n_ed is the number of Euclidean dimensions (typically 2 or 3). Keypoints that are missing/not visible should be represented as NaNs.	required
`points_pr`	`ndarray`	Predicted instance of shape (n_pr, n_nodes, n_ed).	required
`use_cocoeval`	`bool`	Indicates whether the OKS score is calculated like cocoeval method or not. True indicating the score is calculated using the cocoeval method (widely used and the code can be found here at https://github.com/cocodataset/cocoapi/blob/8c9bcc3cf640524c4c20a9c40e89cb6a2f2fa0e9/PythonAPI/pycocotools/cocoeval.py#L192C5-L233C20) and False indicating the score is calculated using the method exactly as given in the paper referenced in the Notes below.	`True`
`scale`	`Optional[float]`	Size scaling factor to use when weighing the scores, typically the area of the bounding box of the instance (in pixels). This should be of the length n_gt. If a scalar is provided, the same number is used for all ground truth instances. If set to None, the bounding box area of the ground truth instances will be calculated.	`None`
`stddev`	`float`	The standard deviation associated with the spread in the localization accuracy of each node/keypoint type. This should be of the length n_nodes. "Easier" keypoint types will have lower values to reflect the smaller spread expected in localizing it.	`0.025`

Returns:

Type	Description
`ndarray`	The object keypoints similarity between every pair of ground truth and predicted instance, a numpy array of of shape (n_gt, n_pr) in the range of [0, 1.0], with 1.0 denoting a perfect match.

Notes

It's important to set the stddev appropriately when accounting for the difficulty of each keypoint type. For reference, the median value for all keypoint types in COCO is 0.072. The "easiest" keypoint is the left eye, with stddev of 0.025, since it is easy to precisely locate the eyes when labeling. The "hardest" keypoint is the left hip, with stddev of 0.107, since it's hard to locate the left hip bone without external anatomical features and since it is often occluded by clothing.

The implementation here is based off of the descriptions in: Ronch & Perona. "Benchmarking and Error Diagnosis in Multi-Instance Pose Estimation." ICCV (2017).

Source code in sleap_nn/evaluation.py

def compute_oks(
    points_gt: np.ndarray,
    points_pr: np.ndarray,
    scale: Optional[float] = None,
    stddev: float = 0.025,
    use_cocoeval: bool = True,
) -> np.ndarray:
    """Compute the object keypoints similarity between sets of points.

    Args:
        points_gt: Ground truth instances of shape (n_gt, n_nodes, n_ed),
            where n_nodes is the number of body parts/keypoint types, and n_ed
            is the number of Euclidean dimensions (typically 2 or 3). Keypoints
            that are missing/not visible should be represented as NaNs.
        points_pr: Predicted instance of shape (n_pr, n_nodes, n_ed).
        use_cocoeval: Indicates whether the OKS score is calculated like cocoeval
            method or not. True indicating the score is calculated using the
            cocoeval method (widely used and the code can be found here at
            https://github.com/cocodataset/cocoapi/blob/8c9bcc3cf640524c4c20a9c40e89cb6a2f2fa0e9/PythonAPI/pycocotools/cocoeval.py#L192C5-L233C20)
            and False indicating the score is calculated using the method exactly
            as given in the paper referenced in the Notes below.
        scale: Size scaling factor to use when weighing the scores, typically
            the area of the bounding box of the instance (in pixels). This
            should be of the length n_gt. If a scalar is provided, the same
            number is used for all ground truth instances. If set to None, the
            bounding box area of the ground truth instances will be calculated.
        stddev: The standard deviation associated with the spread in the
            localization accuracy of each node/keypoint type. This should be of
            the length n_nodes. "Easier" keypoint types will have lower values
            to reflect the smaller spread expected in localizing it.

    Returns:
        The object keypoints similarity between every pair of ground truth and
        predicted instance, a numpy array of of shape (n_gt, n_pr) in the range
        of [0, 1.0], with 1.0 denoting a perfect match.

    Notes:
        It's important to set the stddev appropriately when accounting for the
        difficulty of each keypoint type. For reference, the median value for
        all keypoint types in COCO is 0.072. The "easiest" keypoint is the left
        eye, with stddev of 0.025, since it is easy to precisely locate the
        eyes when labeling. The "hardest" keypoint is the left hip, with stddev
        of 0.107, since it's hard to locate the left hip bone without external
        anatomical features and since it is often occluded by clothing.

        The implementation here is based off of the descriptions in:
        Ronch & Perona. "Benchmarking and Error Diagnosis in Multi-Instance Pose
        Estimation." ICCV (2017).
    """
    if points_gt.ndim == 2:
        points_gt = np.expand_dims(points_gt, axis=0)
    if points_pr.ndim == 2:
        points_pr = np.expand_dims(points_pr, axis=0)

    if scale is None:
        scale = compute_instance_area(points_gt)

    n_gt, n_nodes, n_ed = points_gt.shape  # n_ed = 2 or 3 (euclidean dimensions)
    n_pr = points_pr.shape[0]

    # If scalar scale was provided, use the same for each ground truth instance.
    if np.isscalar(scale):
        scale = np.full(n_gt, scale)

    # If scalar standard deviation was provided, use the same for each node.
    if np.isscalar(stddev):
        stddev = np.full(n_nodes, stddev)

    # Compute displacement between each pair.
    displacement = np.reshape(points_gt, (n_gt, 1, n_nodes, n_ed)) - np.reshape(
        points_pr, (1, n_pr, n_nodes, n_ed)
    )
    assert displacement.shape == (n_gt, n_pr, n_nodes, n_ed)

    # Convert to pairwise Euclidean distances.
    distance = (displacement**2).sum(axis=-1)  # (n_gt, n_pr, n_nodes)
    assert distance.shape == (n_gt, n_pr, n_nodes)

    # Compute the normalization factor per keypoint.
    if use_cocoeval:
        # If use_cocoeval is True, then compute normalization factor according to cocoeval.
        spread_factor = (2 * stddev) ** 2
        scale_factor = 2 * (scale + np.spacing(1))
    else:
        # If use_cocoeval is False, then compute normalization factor according to the paper.
        spread_factor = stddev**2
        scale_factor = 2 * ((scale + np.spacing(1)) ** 2)
    normalization_factor = np.reshape(spread_factor, (1, 1, n_nodes)) * np.reshape(
        scale_factor, (n_gt, 1, 1)
    )
    assert normalization_factor.shape == (n_gt, 1, n_nodes)

    # Since a "miss" is considered as KS < 0.5, we'll set the
    # distances for predicted points that are missing to inf.
    missing_pr = np.any(np.isnan(points_pr), axis=-1)  # (n_pr, n_nodes)
    assert missing_pr.shape == (n_pr, n_nodes)
    distance[:, missing_pr] = np.inf

    # Compute the keypoint similarity as per the top of Eq. 1.
    ks = np.exp(-(distance / normalization_factor))  # (n_gt, n_pr, n_nodes)
    assert ks.shape == (n_gt, n_pr, n_nodes)

    # Set the KS for missing ground truth points to 0.
    # This is equivalent to the visibility delta function of the bottom
    # of Eq. 1.
    missing_gt = np.any(np.isnan(points_gt), axis=-1)  # (n_gt, n_nodes)
    assert missing_gt.shape == (n_gt, n_nodes)
    ks[np.expand_dims(missing_gt, axis=1)] = 0

    # Compute the OKS.
    n_visible_gt = np.sum(
        (~missing_gt).astype("float32"), axis=-1, keepdims=True
    )  # (n_gt, 1)
    oks = np.sum(ks, axis=-1) / n_visible_gt
    assert oks.shape == (n_gt, n_pr)

    return oks

`find_frame_pairs(labels_gt, labels_pr, user_labels_only=True)` ¶

Find corresponding frames across two sets of labels.

Parameters:

Name	Type	Description	Default
`labels_gt`	`Labels`	A `sio.Labels` instance with ground truth instances.	required
`labels_pr`	`Labels`	A `sio.Labels` instance with predicted instances.	required
`user_labels_only`	`bool`	If False, frames with predicted instances in `labels_gt` will also be considered for matching.	`True`

Returns:

Type	Description
`List[Tuple[LabeledFrame, LabeledFrame]]`	A list of pairs of `sio.LabeledFrame`s in the form `(frame_gt, frame_pr)`.

Source code in sleap_nn/evaluation.py

def find_frame_pairs(
    labels_gt: sio.Labels, labels_pr: sio.Labels, user_labels_only: bool = True
) -> List[Tuple[sio.LabeledFrame, sio.LabeledFrame]]:
    """Find corresponding frames across two sets of labels.

    Args:
        labels_gt: A `sio.Labels` instance with ground truth instances.
        labels_pr: A `sio.Labels` instance with predicted instances.
        user_labels_only: If False, frames with predicted instances in `labels_gt` will
            also be considered for matching.

    Returns:
        A list of pairs of `sio.LabeledFrame`s in the form `(frame_gt, frame_pr)`.
    """
    frame_pairs = []
    for video_gt in labels_gt.videos:
        # Find matching video instance in predictions.
        video_pr = None
        for video in labels_pr.videos:
            if (
                isinstance(video.backend, type(video_gt.backend))
                and video.filename == video_gt.filename
            ):
                same_dataset = (
                    (video.backend.dataset == video_gt.backend.dataset)
                    if hasattr(video.backend, "dataset")
                    else True
                )  # `dataset` attr exists only for hdf5 backend not for mediavideo
                if same_dataset:
                    video_pr = video
                    break

        if video_pr is None:
            continue

        # Find labeled frames in this video.
        labeled_frames_gt = labels_gt.find(video_gt)
        if user_labels_only:
            for lf in labeled_frames_gt:
                lf.instances = lf.user_instances
            labeled_frames_gt = [
                lf for lf in labeled_frames_gt if len(lf.user_instances) > 0
            ]

        # Attempt to match each labeled frame in the ground truth.
        for labeled_frame_gt in labeled_frames_gt:
            labeled_frames_pr = labels_pr.find(
                video_pr, frame_idx=labeled_frame_gt.frame_idx
            )

            if not labeled_frames_pr:
                # No match
                continue
            elif len(labeled_frames_pr) == 1:
                # Match!
                frame_pairs.append((labeled_frame_gt, labeled_frames_pr[0]))

    return frame_pairs

`get_instances(labeled_frame)` ¶

Get a list of instances of type MatchInstance from the Labeled Frame.

Parameters:

Name	Type	Description	Default
`labeled_frame`	`LabeledFrame`	Input Labeled frame of type sio.LabeledFrame.	required

Returns:

Type	Description
`List[MatchInstance]`	List of MatchInstance objects for the given labeled frame.

Source code in sleap_nn/evaluation.py

def get_instances(labeled_frame: sio.LabeledFrame) -> List[MatchInstance]:
    """Get a list of instances of type MatchInstance from the Labeled Frame.

    Args:
        labeled_frame: Input Labeled frame of type sio.LabeledFrame.

    Returns:
        List of MatchInstance objects for the given labeled frame.
    """
    instance_list = []
    frame_idx = labeled_frame.frame_idx
    video_path = (
        labeled_frame.video.backend.source_filename
        if hasattr(labeled_frame.video.backend, "source_filename")
        else labeled_frame.video.backend.filename
    )
    for instance in labeled_frame.instances:
        match_instance = MatchInstance(
            instance=instance, frame_idx=frame_idx, video_path=video_path
        )
        instance_list.append(match_instance)
    return instance_list

`load_metrics(model_path, split='val')` ¶

Load the metrics for a given model and split.

Parameters:

Name	Type	Description	Default
`model_path`	`str`	Path to a model folder or metrics file (.npz).	required
`split`		Name of the split to load the metrics for. Must be `"train"`, `"val"` or `"test"` (default: `"val"`). Ignored if a path to a metrics NPZ file is provided.	`'val'`

Source code in sleap_nn/evaluation.py

def load_metrics(model_path: str, split="val"):
    """Load the metrics for a given model and split.

    Args:
        model_path: Path to a model folder or metrics file (.npz).
        split: Name of the split to load the metrics for. Must be `"train"`, `"val"` or
            `"test"` (default: `"val"`). Ignored if a path to a metrics NPZ file is
            provided.

    """
    if Path(model_path).suffix == ".npz":
        metrics_path = Path(model_path)
    else:
        metrics_path = Path(model_path) / f"{split}_0_pred_metrics.npz"
    if not metrics_path.exists():
        raise FileNotFoundError(f"Metrics file not found at {metrics_path}")
    with np.load(metrics_path, allow_pickle=True) as data:
        return data["metrics"].item()

`match_frame_pairs(frame_pairs, stddev=0.025, scale=None, threshold=0)` ¶

Match all ground truth and predicted instances within each pair of frames.

This is a wrapper for match_instances() but operates on lists of frames.

Parameters:

Name	Type	Description	Default
`frame_pairs`	`List[Tuple[LabeledFrame, LabeledFrame]]`	A list of pairs of `sleap.LabeledFrame`s in the form `(frame_gt, frame_pr)`. These can be obtained with `find_frame_pairs()`.	required
`stddev`	`float`	The expected spread of coordinates for OKS computation.	`0.025`
`scale`	`Optional[float]`	The scale for normalizing the OKS. If not set, the bounding box area will be used.	`None`
`threshold`	`float`	The minimum OKS between a candidate pair of instances to be considered a match.	`0`

Returns:

Type	Description
`Tuple[List[Tuple[Instance, PredictedInstance, float]], List[Instance]]`	A tuple of (`positive_pairs`, `false_negatives`). `positive_pairs` is a list of 3-tuples of the form `(instance_gt, instance_pr, oks)` containing the matched pair of instances and their OKS. `false_negatives` is a list of ground truth `sio.Instance`s that could not be matched.

Source code in sleap_nn/evaluation.py

def match_frame_pairs(
    frame_pairs: List[Tuple[sio.LabeledFrame, sio.LabeledFrame]],
    stddev: float = 0.025,
    scale: Optional[float] = None,
    threshold: float = 0,
) -> Tuple[List[Tuple[sio.Instance, sio.PredictedInstance, float]], List[sio.Instance]]:
    """Match all ground truth and predicted instances within each pair of frames.

    This is a wrapper for `match_instances()` but operates on lists of frames.

    Args:
        frame_pairs: A list of pairs of `sleap.LabeledFrame`s in the form
            `(frame_gt, frame_pr)`. These can be obtained with `find_frame_pairs()`.
        stddev: The expected spread of coordinates for OKS computation.
        scale: The scale for normalizing the OKS. If not set, the bounding box area will
            be used.
        threshold: The minimum OKS between a candidate pair of instances to be
            considered a match.

    Returns:
        A tuple of (`positive_pairs`, `false_negatives`).

        `positive_pairs` is a list of 3-tuples of the form
        `(instance_gt, instance_pr, oks)` containing the matched pair of instances and
        their OKS.

        `false_negatives` is a list of ground truth `sio.Instance`s that could not be
        matched.
    """
    positive_pairs = []
    false_negatives = []
    for frame_gt, frame_pr in frame_pairs:
        positive_pairs_frame, false_negatives_frame = match_instances(
            frame_gt,
            frame_pr,
            stddev=stddev,
            scale=scale,
            threshold=threshold,
        )
        positive_pairs.extend(positive_pairs_frame)
        false_negatives.extend(false_negatives_frame)

    return positive_pairs, false_negatives

`match_instances(frame_gt, frame_pr, stddev=0.025, scale=None, threshold=0)` ¶

Match pairs of instances between ground truth and predictions in a frame.

Parameters:

Name	Type	Description	Default
`frame_gt`	`LabeledFrame`	A `sio.LabeledFrame` with ground truth instances.	required
`frame_pr`	`LabeledFrame`	A `sio.LabeledFrame` with predicted instances.	required
`stddev`	`float`	The expected spread of coordinates for OKS computation.	`0.025`
`scale`	`Optional[float]`	The scale for normalizing the OKS. If not set, the bounding box area will be used.	`None`
`threshold`	`float`	The minimum OKS between a candidate pair of instances to be considered a match.	`0`

Returns:

Type	Description
`Tuple[List[Tuple[Instance, PredictedInstance, float]], List[Instance]]`	A tuple of (`positive_pairs`, `false_negatives`). `positive_pairs` is a list of 3-tuples of the form `(instance_gt, instance_pr, oks)` containing the matched pair of instances and their OKS. `false_negatives` is a list of ground truth `sleap.Instance`s that could not be matched.

Notes

This function uses the approach from the PASCAL VOC scoring procedure. Briefly, predictions are sorted descending by their instance-level prediction scores and greedily matched to ground truth instances which are then removed from the pool of available instances.

Ground truth instances that remain unmatched are considered false negatives.

Source code in sleap_nn/evaluation.py

def match_instances(
    frame_gt: sio.LabeledFrame,
    frame_pr: sio.LabeledFrame,
    stddev: float = 0.025,
    scale: Optional[float] = None,
    threshold: float = 0,
) -> Tuple[List[Tuple[sio.Instance, sio.PredictedInstance, float]], List[sio.Instance]]:
    """Match pairs of instances between ground truth and predictions in a frame.

    Args:
        frame_gt: A `sio.LabeledFrame` with ground truth instances.
        frame_pr: A `sio.LabeledFrame` with predicted instances.
        stddev: The expected spread of coordinates for OKS computation.
        scale: The scale for normalizing the OKS. If not set, the bounding box area will
            be used.
        threshold: The minimum OKS between a candidate pair of instances to be
            considered a match.

    Returns:
        A tuple of (`positive_pairs`, `false_negatives`).

        `positive_pairs` is a list of 3-tuples of the form
        `(instance_gt, instance_pr, oks)` containing the matched pair of instances and
        their OKS.

        `false_negatives` is a list of ground truth `sleap.Instance`s that could not be
        matched.

    Notes:
        This function uses the approach from the PASCAL VOC scoring procedure. Briefly,
        predictions are sorted descending by their instance-level prediction scores and
        greedily matched to ground truth instances which are then removed from the pool
        of available instances.

        Ground truth instances that remain unmatched are considered false negatives.
    """
    # Sort predicted instances by score.
    frame_pr_match_instances = get_instances(frame_pr)

    scores_pr = np.array(
        [
            m.instance.score
            for m in frame_pr_match_instances
            if hasattr(m.instance, "score")
        ]
    )
    idxs_pr = np.argsort(-scores_pr, kind="mergesort")  # descending
    scores_pr = scores_pr[idxs_pr]

    available_instances_gt = get_instances(frame_gt)
    available_instances_gt_idxs = list(range(len(available_instances_gt)))

    positive_pairs = []
    for idx_pr in idxs_pr:
        # Pull out predicted instance.
        instance_pr = frame_pr_match_instances[idx_pr]

        # Convert instances to point arrays.
        points_pr = np.expand_dims(instance_pr.instance.numpy(), axis=0)
        points_gt = np.stack(
            [
                available_instances_gt[idx].instance.numpy()
                for idx in available_instances_gt_idxs
            ],
            axis=0,
        )

        # Find the best match by computing OKS.
        oks = compute_oks(points_gt, points_pr, stddev=stddev, scale=scale)
        oks = np.squeeze(oks, axis=1)
        assert oks.shape == (len(points_gt),)

        oks[oks <= threshold] = np.nan
        best_match_gt_idx = np.argsort(-oks, kind="mergesort")[0]
        best_match_oks = oks[best_match_gt_idx]
        if np.isnan(best_match_oks):
            continue

        # Remove matched ground truth instance and add as a positive pair.
        instance_gt_idx = available_instances_gt_idxs.pop(best_match_gt_idx)
        instance_gt = available_instances_gt[instance_gt_idx]
        positive_pairs.append((instance_gt, instance_pr, best_match_oks))

        # Stop matching lower scoring instances if we run out of candidates in the
        # ground truth.
        if not available_instances_gt_idxs:
            break

    # Any remaining ground truth instances are considered false negatives.
    false_negatives = [
        available_instances_gt[idx] for idx in available_instances_gt_idxs
    ]

    return positive_pairs, false_negatives

`run_evaluation(ground_truth_path, predicted_path, oks_stddev=0.025, oks_scale=None, match_threshold=0, user_labels_only=True, save_metrics=None)` ¶

Evaluate SLEAP-NN model predictions against ground truth labels.

Source code in sleap_nn/evaluation.py

def run_evaluation(
    ground_truth_path: str,
    predicted_path: str,
    oks_stddev: float = 0.025,
    oks_scale: Optional[float] = None,
    match_threshold: float = 0,
    user_labels_only: bool = True,
    save_metrics: Optional[str] = None,
):
    """Evaluate SLEAP-NN model predictions against ground truth labels."""
    logger.info("Loading ground truth labels...")
    ground_truth_instances = sio.load_slp(ground_truth_path)

    logger.info("Loading predicted labels...")
    predicted_instances = sio.load_slp(predicted_path)

    logger.info("Creating evaluator...")
    evaluator = Evaluator(
        ground_truth_instances=ground_truth_instances,
        predicted_instances=predicted_instances,
        oks_stddev=oks_stddev,
        oks_scale=oks_scale,
        match_threshold=match_threshold,
        user_labels_only=user_labels_only,
    )

    logger.info("Computing evaluation metrics...")
    metrics = evaluator.evaluate()

    # Print key metrics
    logger.info("Evaluation Results:")
    logger.info(f"mOKS: {metrics['mOKS']['mOKS']:.4f}")
    logger.info(f"mAP (OKS VOC): {metrics['voc_metrics']['oks_voc.mAP']:.4f}")
    logger.info(f"mAR (OKS VOC): {metrics['voc_metrics']['oks_voc.mAR']:.4f}")
    logger.info(f"Average Distance: {metrics['distance_metrics']['avg']:.4f}")
    logger.info(f"mPCK: {metrics['pck_metrics']['mPCK']:.4f}")
    logger.info(
        f"Visibility Precision: {metrics['visibility_metrics']['precision']:.4f}"
    )
    logger.info(f"Visibility Recall: {metrics['visibility_metrics']['recall']:.4f}")

    # Save metrics if path provided
    if save_metrics:
        logger.info(f"Saving metrics to {save_metrics}...")
        save_path = Path(save_metrics)
        save_path.parent.mkdir(parents=True, exist_ok=True)

        # Convert metrics to numpy arrays for saving
        np.savez(
            save_path,
            mOKS=metrics["mOKS"]["mOKS"],
            mAP=metrics["voc_metrics"]["oks_voc.mAP"],
            mAR=metrics["voc_metrics"]["oks_voc.mAR"],
            avg_distance=metrics["distance_metrics"]["avg"],
            mPCK=metrics["pck_metrics"]["mPCK"],
            visibility_precision=metrics["visibility_metrics"]["precision"],
            visibility_recall=metrics["visibility_metrics"]["recall"],
            # Save full metrics dict as well
            voc_metrics=metrics["voc_metrics"],
            distance_metrics=metrics["distance_metrics"],
            pck_metrics=metrics["pck_metrics"],
            visibility_metrics=metrics["visibility_metrics"],
        )
        logger.info(f"Metrics saved successfully to {save_path}")

    return metrics

evaluation

sleap_nn.evaluation ¶

Evaluator ¶

__init__(ground_truth_instances, predicted_instances, oks_stddev=0.025, oks_scale=None, match_threshold=0, user_labels_only=True) ¶

distance_metrics() ¶

evaluate() ¶

mOKS() ¶

pck_metrics(thresholds=np.linspace(1, 10, 10)) ¶

visibility_metrics() ¶

voc_metrics(match_score_by='oks', match_score_thresholds=np.linspace(0.5, 0.95, 10), recall_thresholds=np.linspace(0, 1, 101)) ¶

MatchInstance ¶

compute_dists(positive_pairs) ¶

compute_instance_area(points) ¶

compute_oks(points_gt, points_pr, scale=None, stddev=0.025, use_cocoeval=True) ¶

find_frame_pairs(labels_gt, labels_pr, user_labels_only=True) ¶

get_instances(labeled_frame) ¶

load_metrics(model_path, split='val') ¶

match_frame_pairs(frame_pairs, stddev=0.025, scale=None, threshold=0) ¶

match_instances(frame_gt, frame_pr, stddev=0.025, scale=None, threshold=0) ¶

run_evaluation(ground_truth_path, predicted_path, oks_stddev=0.025, oks_scale=None, match_threshold=0, user_labels_only=True, save_metrics=None) ¶

`sleap_nn.evaluation` ¶

`Evaluator` ¶

`init(ground_truth_instances, predicted_instances, oks_stddev=0.025, oks_scale=None, match_threshold=0, user_labels_only=True)` ¶

`distance_metrics()` ¶

`evaluate()` ¶

`mOKS()` ¶

`pck_metrics(thresholds=np.linspace(1, 10, 10))` ¶

`visibility_metrics()` ¶

`voc_metrics(match_score_by='oks', match_score_thresholds=np.linspace(0.5, 0.95, 10), recall_thresholds=np.linspace(0, 1, 101))` ¶

`MatchInstance` ¶

`compute_dists(positive_pairs)` ¶

`compute_instance_area(points)` ¶

`compute_oks(points_gt, points_pr, scale=None, stddev=0.025, use_cocoeval=True)` ¶

`find_frame_pairs(labels_gt, labels_pr, user_labels_only=True)` ¶

`get_instances(labeled_frame)` ¶

`load_metrics(model_path, split='val')` ¶

`match_frame_pairs(frame_pairs, stddev=0.025, scale=None, threshold=0)` ¶

`match_instances(frame_gt, frame_pr, stddev=0.025, scale=None, threshold=0)` ¶

`run_evaluation(ground_truth_path, predicted_path, oks_stddev=0.025, oks_scale=None, match_threshold=0, user_labels_only=True, save_metrics=None)` ¶