pytorchvideo.models.head¶

class pytorchvideo.models.head.SequencePool(mode)[source]¶

Sequence pool produces a single embedding from a sequence of embeddings. Currently it supports “mean” and “cls”.

__init__(mode)[source]¶

Parameters: mode (str) – Optionals include “cls” and “mean”. If set to “cls”, it assumes the first element in the input is the cls token and returns it. If set to “mean”, it returns the mean of the entire sequence.
Return type: None

pytorchvideo.models.head.create_res_basic_head(*, in_features, out_features, pool=<class 'torch.nn.modules.pooling.AvgPool3d'>, output_size=(1, 1, 1), pool_kernel_size=(1, 7, 7), pool_stride=(1, 1, 1), pool_padding=(0, 0, 0), dropout_rate=0.5, activation=None, output_with_global_average=True)[source]¶

Creates ResNet basic head. This layer performs an optional pooling operation followed by an optional dropout, a fully-connected projection, an activation layer and a global spatiotemporal averaging.

 Pooling
    ↓
 Dropout
    ↓
Projection
    ↓
Activation
    ↓
Averaging

Activation examples include: ReLU, Softmax, Sigmoid, and None. Pool3d examples include: AvgPool3d, MaxPool3d, AdaptiveAvgPool3d, and None.

Parameters

in_features (int) – input channel size of the resnet head.
out_features (int) – output channel size of the resnet head.
pool (callable) – a callable that constructs resnet head pooling layer, examples include: nn.AvgPool3d, nn.MaxPool3d, nn.AdaptiveAvgPool3d, and None (not applying pooling).
pool_kernel_size (tuple) – pooling kernel size(s) when not using adaptive pooling.
pool_stride (tuple) – pooling stride size(s) when not using adaptive pooling.
pool_padding (tuple) – pooling padding size(s) when not using adaptive pooling.
output_size (tuple) – spatial temporal output size when using adaptive pooling.
activation (callable) – a callable that constructs resnet head activation layer, examples include: nn.ReLU, nn.Softmax, nn.Sigmoid, and None (not applying activation).
dropout_rate (float) – dropout rate.
output_with_global_average (bool) – if True, perform global averaging on temporal and spatial dimensions and reshape output to batch_size x out_features.

Return type

torch.nn.modules.module.Module

pytorchvideo.models.head.create_vit_basic_head(*, in_features, out_features, seq_pool_type='cls', dropout_rate=0.5, activation=None)[source]¶

Creates vision transformer basic head.

 Pooling
    ↓
 Dropout
    ↓
Projection
    ↓
Activation

Activation examples include: ReLU, Softmax, Sigmoid, and None. Pool type examples include: cls, mean and none.

Parameters

in_features (int) – input channel size of the resnet head.
out_features (int) – output channel size of the resnet head.
pool_type (str) – Pooling type. It supports “cls”, “mean ” and “none”. If set to “cls”, it assumes the first element in the input is the cls token and returns it. If set to “mean”, it returns the mean of the entire sequence.
activation (callable) – a callable that constructs vision transformer head activation layer, examples include: nn.ReLU, nn.Softmax, nn.Sigmoid, and None (not applying activation).
dropout_rate (float) – dropout rate.
seq_pool_type (str) –

Return type

torch.nn.modules.module.Module

pytorchvideo.models.head.create_res_roi_pooling_head(*, in_features, out_features, resolution, spatial_scale, sampling_ratio=0, roi=<class 'torchvision.ops.roi_align.RoIAlign'>, pool=<class 'torch.nn.modules.pooling.AvgPool3d'>, output_size=(1, 1, 1), pool_kernel_size=(1, 7, 7), pool_stride=(1, 1, 1), pool_padding=(0, 0, 0), pool_spatial=<class 'torch.nn.modules.pooling.MaxPool2d'>, dropout_rate=0.5, activation=None, output_with_global_average=True)[source]¶

Creates ResNet RoI head. This layer performs an optional pooling operation followed by an RoI projection, an optional 2D spatial pool, an optional dropout, a fully-connected projection, an activation layer and a global spatiotemporal averaging.

Pool3d
↓

RoI Align

↓

Pool2d
↓

Dropout
↓

Projection
↓

Activation
↓

Averaging

Activation examples include: ReLU, Softmax, Sigmoid, and None. Pool3d examples include: AvgPool3d, MaxPool3d, AdaptiveAvgPool3d, and None. RoI examples include: detectron2.layers.ROIAlign, detectron2.layers.ROIAlignRotated,

tochvision.ops.RoIAlign and None

Pool2d examples include: MaxPool2e, AvgPool2d, and None.

Parameters

related configs (Output) – in_features: input channel size of the resnet head. out_features: output channel size of the resnet head.
layer related configs (RoI) –
resolution (tuple): h, w sizes of the RoI interpolation. spatial_scale (float): scale the input boxes by this number sampling_ratio (int): number of inputs samples to take for each output

sample interpolation. 0 to take samples densely.

roi (callable): a callable that constructs the roi interpolation layer,
examples include detectron2.layers.ROIAlign, detectron2.layers.ROIAlignRotated, and None.
related configs –

pool (callable): a callable that constructs resnet head pooling layer,
examples include: nn.AvgPool3d, nn.MaxPool3d, nn.AdaptiveAvgPool3d, and None (not applying pooling).

pool_kernel_size (tuple): pooling kernel size(s) when not using adaptive
pooling.

pool_stride (tuple): pooling stride size(s) when not using adaptive pooling. pool_padding (tuple): pooling padding size(s) when not using adaptive

pooling.

output_size (tuple): spatial temporal output size when using adaptive
pooling.

pool_spatial (callable): a callable that constructs the 2d pooling layer which
follows the RoI layer, examples include: nn.AvgPool2d, nn.MaxPool2d, and None (not applying spatial pooling).
related configs –

activation (callable): a callable that constructs resnet head activation
layer, examples include: nn.ReLU, nn.Softmax, nn.Sigmoid, and None (not applying activation).
related configs – dropout_rate (float): dropout rate.
related configs –

output_with_global_average (bool): if True, perform global averaging on temporal
and spatial dimensions and reshape output to batch_size x out_features.
in_features (int) –
out_features (int) –
resolution (Tuple) –
spatial_scale (float) –
sampling_ratio (int) –
roi (Callable) –
pool (Callable) –
output_size (Tuple[int]) –
pool_kernel_size (Tuple[int]) –
pool_stride (Tuple[int]) –
pool_padding (Tuple[int]) –
pool_spatial (Callable) –
dropout_rate (float) –
activation (Callable) –
output_with_global_average (bool) –

Return type

torch.nn.modules.module.Module

class pytorchvideo.models.head.ResNetBasicHead(pool=None, dropout=None, proj=None, activation=None, output_pool=None)[source]¶

ResNet basic head. This layer performs an optional pooling operation followed by an optional dropout, a fully-connected projection, an optional activation layer and a global spatiotemporal averaging.

 Pool3d
    ↓
 Dropout
    ↓
Projection
    ↓
Activation
    ↓
Averaging

The builder can be found in create_res_basic_head.

__init__(pool=None, dropout=None, proj=None, activation=None, output_pool=None)[source]¶

Parameters

pool (torch.nn.modules) – pooling module.
dropout (torch.nn.modules) – dropout module.
proj (torch.nn.modules) – project module.
activation (torch.nn.modules) – activation module.
output_pool (torch.nn.Module) – pooling module for output.

Return type

None

class pytorchvideo.models.head.ResNetRoIHead(pool=None, pool_spatial=None, roi_layer=None, dropout=None, proj=None, activation=None, output_pool=None)[source]¶

ResNet RoI head. This layer performs an optional pooling operation followed by an RoI projection, an optional 2D spatial pool, an optional dropout, a fully-connected projection, an activation layer and a global spatiotemporal averaging.

Pool3d
↓

RoI Align

↓

Pool2d
↓

Dropout
↓

Projection
↓

Activation
↓

Averaging

The builder can be found in create_res_roi_pooling_head.

__init__(pool=None, pool_spatial=None, roi_layer=None, dropout=None, proj=None, activation=None, output_pool=None)[source]¶

Parameters

pool (torch.nn.modules) – pooling module.
pool_spatial (torch.nn.modules) – pooling module.
roi_spatial (torch.nn.modules) – RoI (Ex: Align, pool) module.
dropout (torch.nn.modules) – dropout module.
proj (torch.nn.modules) – project module.
activation (torch.nn.modules) – activation module.
output_pool (torch.nn.Module) – pooling module for output.
roi_layer (torch.nn.modules.module.Module) –

Return type

None

forward(x, bboxes)[source]¶

Parameters

x (torch.tensor) – input tensor
bboxes (torch.tensor) – Accociated bounding boxes. The format is N*5 (Index, X_1,Y_1,X_2,Y_2) if using RoIAlign and N*6 (Index, x_ctr, y_ctr, width, height, angle_degrees) if using RoIAlignRotated.

Return type

torch.Tensor

class pytorchvideo.models.head.VisionTransformerBasicHead(sequence_pool=None, dropout=None, proj=None, activation=None)[source]¶

Vision transformer basic head.

SequencePool
     ↓
  Dropout
     ↓
 Projection
     ↓
 Activation

The builder can be found in create_vit_basic_head.

__init__(sequence_pool=None, dropout=None, proj=None, activation=None)[source]¶

Parameters

sequence_pool (torch.nn.modules) – pooling module.
dropout (torch.nn.modules) – dropout module.
proj (torch.nn.modules) – project module.
activation (torch.nn.modules) – activation module.

Return type

None