pytorchvideo.layers.batch_norm¶

class pytorchvideo.layers.batch_norm.NaiveSyncBatchNorm1d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, device=None, dtype=None)[source]¶: An implementation of 1D naive sync batch normalization. See details in NaiveSyncBatchNorm2d below.

class pytorchvideo.layers.batch_norm.NaiveSyncBatchNorm2d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, device=None, dtype=None)[source]¶

An implementation of 2D naive sync batch normalization. In PyTorch<=1.5, nn.SyncBatchNorm has incorrect gradient when the batch size on each worker is different. (e.g., when scale augmentation is used, or when it is applied to mask head).

This is a slower but correct alternative to nn.SyncBatchNorm.

Note

This module computes overall statistics by using statistics of each worker with equal weight. The result is true statistics of all samples (as if they are all on one worker) only when all workers have the same (N, H, W). This mode does not support inputs with zero batch size.

class pytorchvideo.layers.batch_norm.NaiveSyncBatchNorm3d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, device=None, dtype=None)[source]¶: An implementation of 3D naive sync batch normalization. See details in NaiveSyncBatchNorm2d above.

pytorchvideo.layers.convolutions¶

class pytorchvideo.layers.convolutions.ConvReduce3D(*, in_channels, out_channels, kernel_size, stride=None, padding=None, padding_mode=None, dilation=None, groups=None, bias=None, reduction_method='sum')[source]¶

Builds a list of convolutional operators and performs summation on the outputs.

Conv3d, Conv3d, ...,  Conv3d
               ↓
              Sum

__init__(*, in_channels, out_channels, kernel_size, stride=None, padding=None, padding_mode=None, dilation=None, groups=None, bias=None, reduction_method='sum')[source]¶

Parameters

int (out_channels) – number of input channels.
int – number of output channels produced by the convolution(s).
tuple (bias) – Tuple of sizes of the convolutionaling kernels.
tuple – Tuple of strides of the convolutions.
tuple – Tuple of paddings added to all three sides of the input.
tuple – Tuple of padding modes for each convs. Options include zeros, reflect, replicate or circular.
tuple – Tuple of spacings between kernel elements.
tuple – Tuple of numbers of blocked connections from input channels to output channels.
tuple – If True, adds a learnable bias to the output.
str (reduction_method) – Options include sum and cat.
in_channels (int) –
out_channels (int) –
kernel_size (Tuple[Union[int, Tuple[int, int, int]]]) –
stride (Optional[Tuple[Union[int, Tuple[int, int, int]]]]) –
padding (Optional[Tuple[Union[int, Tuple[int, int, int]]]]) –
padding_mode (Optional[Tuple[str]]) –
dilation (Optional[Tuple[Union[int, Tuple[int, int, int]]]]) –
groups (Optional[Tuple[int]]) –
bias (Optional[Tuple[bool]]) –
reduction_method (str) –

Return type

None

pytorchvideo.layers.convolutions.create_conv_2plus1d(*, in_channels, out_channels, inner_channels=None, conv_xy_first=False, kernel_size=(3, 3, 3), stride=(2, 2, 2), padding=(1, 1, 1), bias=False, dilation=(1, 1, 1), groups=1, norm=<class 'torch.nn.modules.batchnorm.BatchNorm3d'>, norm_eps=1e-05, norm_momentum=0.1, activation=<class 'torch.nn.modules.activation.ReLU'>)[source]¶

Create a 2plus1d conv layer. It performs spatiotemporal Convolution, BN, and Relu following by a spatiotemporal pooling.

Conv_t (or Conv_xy if conv_xy_first = True)
                   ↓
             Normalization
                   ↓
               Activation
                   ↓
Conv_xy (or Conv_t if conv_xy_first = True)

Normalization options include: BatchNorm3d and None (no normalization). Activation options include: ReLU, Softmax, Sigmoid, and None (no activation).

Parameters

in_channels (int) – input channel size of the convolution.
out_channels (int) – output channel size of the convolution.
kernel_size (tuple) – convolutional kernel size(s).
stride (tuple) – convolutional stride size(s).
padding (tuple) – convolutional padding size(s).
bias (bool) – convolutional bias. If true, adds a learnable bias to the output.
groups (int) – Number of groups in convolution layers. value >1 is unsupported.
dilation (tuple) – dilation value in convolution layers. value >1 is unsupported.
conv_xy_first (bool) – If True, spatial convolution comes before temporal conv
norm (callable) – a callable that constructs normalization layer, options include nn.BatchNorm3d, None (not performing normalization).
norm_eps (float) – normalization epsilon.
norm_momentum (float) – normalization momentum.
activation (callable) – a callable that constructs activation layer, options include: nn.ReLU, nn.Softmax, nn.Sigmoid, and None (not performing activation).
inner_channels (int) –

Returns

(nn.Module) – 2plus1d conv layer.

Return type

torch.nn.modules.module.Module

class pytorchvideo.layers.convolutions.Conv2plus1d(*, conv_t=None, norm=None, activation=None, conv_xy=None, conv_xy_first=False)[source]¶

Implementation of 2+1d Convolution by factorizing 3D Convolution into an 1D temporal Convolution and a 2D spatial Convolution with Normalization and Activation module in between:

Conv_t (or Conv_xy if conv_xy_first = True)
                   ↓
             Normalization
                   ↓
               Activation
                   ↓
Conv_xy (or Conv_t if conv_xy_first = True)

The 2+1d Convolution is used to build the R(2+1)D network.

__init__(*, conv_t=None, norm=None, activation=None, conv_xy=None, conv_xy_first=False)[source]¶

Parameters

conv_t (torch.nn.modules) – temporal convolution module.
norm (torch.nn.modules) – normalization module.
activation (torch.nn.modules) – activation module.
conv_xy (torch.nn.modules) – spatial convolution module.
conv_xy_first (bool) – If True, spatial convolution comes before temporal conv

Return type

None

pytorchvideo.layers.fusion¶

pytorchvideo.layers.fusion.make_fusion_layer(method, feature_dims)[source]¶

Parameters

method (str) – the fusion method to be constructed. Options: - ‘concat’ - ‘temporal_concat’ - ‘max’ - ‘sum’ - ‘prod’
feature_dims (List[int]) – the first argument of all fusion layers. It holds a list of required feature_dims for each tensor input (where the tensor inputs are of shape (batch_size, seq_len, feature_dim)). The list order must corresponds to the tensor order passed to forward(…).

class pytorchvideo.layers.fusion.ConcatFusion(feature_dims)[source]¶

Concatenates all inputs by their last dimension. The resulting tensor last dim will be the sum of the last dimension of all input tensors.

property output_dim¶: Last dimension size of forward(..) tensor output.

forward(input_list)[source]¶

Parameters

input_list (List[torch.Tensor]) – a list of tensors of shape (batch_size, seq_len, feature_dim).

Returns

Tensor of shape (batch_size, seq_len, sum(feature_dims)) where sum(feature_dims): is the sum of all input feature_dims.

Return type

torch.Tensor

class pytorchvideo.layers.fusion.TemporalConcatFusion(feature_dims)[source]¶

Concatenates all inputs by their temporal dimension which is assumed to be dim=1.

property output_dim¶: Last dimension size of forward(..) tensor output.

forward(input_list)[source]¶

Parameters

input_list (List[torch.Tensor]) – a list of tensors of shape (batch_size, seq_len, feature_dim)

Returns

Tensor of shape (batch_size, sum(seq_len), feature_dim) where sum(seq_len) is: the sum of all input tensors.

Return type

torch.Tensor

class pytorchvideo.layers.fusion.ReduceFusion(feature_dims, reduce_fn)[source]¶

Generic fusion method which takes a callable which takes the list of input tensors and expects a single tensor to be used. This class can be used to implement fusion methods like “sum”, “max” and “prod”.

property output_dim¶: Last dimension size of forward(..) tensor output.

forward(input_list)[source]¶

Parameters: input_list (List[torch.Tensor]) – a list of tensors of shape (batch_size, seq_len, feature_dim).
Returns: Tensor of shape (batch_size, seq_len, feature_dim).
Return type: torch.Tensor

pytorchvideo.layers.mlp¶

pytorchvideo.layers.mlp.make_multilayer_perceptron(fully_connected_dims, norm=None, mid_activation=<class 'torch.nn.modules.activation.ReLU'>, final_activation=<class 'torch.nn.modules.activation.ReLU'>, dropout_rate=0.0)[source]¶

Factory function for Multi-Layer Perceptron. These are constructed as repeated blocks of the following format where each fc represents the blocks output/input dimension.

       Linear (in=fc[i-1], out=fc[i])
                     ↓
           Normalization (norm)
                     ↓
         Activation (mid_activation)
                     ↓
      After the repeated Perceptron blocks,
a final dropout and activation layer is applied:
                     ↓
         Dropout (p=dropout_rate)
                     ↓
         Activation (final_activation)

Parameters

fully_connected_dims (List[int]) –
norm (Optional[Callable]) –
mid_activation (Callable) –
final_activation (Optional[Callable]) –
dropout_rate (float) –

Return type

Tuple[torch.nn.modules.module.Module, int]

pytorchvideo.layers.nonlocal_net¶

class pytorchvideo.layers.nonlocal_net.NonLocal(*, conv_theta, conv_phi, conv_g, conv_out, pool=None, norm=None, instantiation='dot_product')[source]¶: Builds Non-local Neural Networks as a generic family of building blocks for capturing long-range dependencies. Non-local Network computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. More details in the paper: Wang, Xiaolong, Ross Girshick, Abhinav Gupta, and Kaiming He. “Non-local neural networks.” In Proceedings of the IEEE conference on CVPR, 2018.

pytorchvideo.layers.nonlocal_net.create_nonlocal(*, dim_in, dim_inner, pool_size=(1, 1, 1), instantiation='softmax', norm=<class 'torch.nn.modules.batchnorm.BatchNorm3d'>, norm_eps=1e-05, norm_momentum=0.1)[source]¶

Builds Non-local Neural Networks as a generic family of building blocks for capturing long-range dependencies. Non-local Network computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. More details in the paper: https://arxiv.org/pdf/1711.07971 :param dim_in: number of dimension for the input. :type dim_in: int :param dim_inner: number of dimension inside of the Non-local block. :type dim_inner: int :param pool_size: the kernel size of spatial temporal pooling,

temporal pool kernel size, spatial pool kernel size, spatial pool kernel size in order. By default pool_size is None, then there would be no pooling used.

Parameters

instantiation (string) – supports two different instantiation method: “dot_product”: normalizing correlation matrix with L2. “softmax”: normalizing correlation matrix with Softmax.
norm (nn.Module) – nn.Module for the normalization layer. The default is nn.BatchNorm3d.
norm_eps (float) – normalization epsilon.
norm_momentum (float) – normalization momentum.
dim_in (int) –
dim_inner (int) –
pool_size (tuple[int]) –

pytorchvideo.layers.positional_encoding¶

class pytorchvideo.layers.positional_encoding.PositionalEncoding(embed_dim, seq_len=1024)[source]¶

Applies a positional encoding to a tensor with shape (batch_size x seq_len x embed_dim).

The positional encoding is computed as follows:

PE(pos,2i) = sin(pos/10000^(2i/dmodel)) PE(pos,2i+1) = cos(pos/10000^(2i/dmodel))

where pos = position, pos in [0, seq_len) dmodel = data embedding dimension = embed_dim i = dimension index, i in [0, embed_dim)

Reference: “Attention Is All You Need” https://arxiv.org/abs/1706.03762 Implementation Reference: https://pytorch.org/tutorials/beginner/transformer_tutorial.html

class pytorchvideo.layers.positional_encoding.SpatioTemporalClsPositionalEncoding(embed_dim, patch_embed_shape, sep_pos_embed=False, has_cls=True)[source]¶

Add a cls token and apply a spatiotemporal encoding to a tensor.

__init__(embed_dim, patch_embed_shape, sep_pos_embed=False, has_cls=True)[source]¶

Parameters

embed_dim (int) – Embedding dimension for input sequence.
patch_embed_shape (Tuple) – The number of patches in each dimension (T, H, W) after patch embedding.
sep_pos_embed (bool) – If set to true, one positional encoding is used for spatial patches and another positional encoding is used for temporal sequence. Otherwise, only one positional encoding is used for all the patches.
has_cls (bool) – If set to true, a cls token is added in the beginning of each input sequence.

Return type

None

forward(x)[source]¶

Parameters: x (torch.Tensor) – Input tensor.
Return type: torch.Tensor

pytorchvideo.layers.swish¶

class pytorchvideo.layers.swish.Swish[source]¶: Wrapper for the Swish activation function.

class pytorchvideo.layers.swish.SwishFunction(*args, **kwargs)[source]¶

Implementation of the Swish activation function: x * sigmoid(x).

Searching for activation functions. Ramachandran, Prajit and Zoph, Barret and Le, Quoc V. 2017

pytorchvideo.layers.squeeze_excitation¶

class pytorchvideo.layers.squeeze_excitation.SqueezeAndExcitationLayer2D(in_planes, reduction_ratio=16, reduced_planes=None)[source]¶

2D Squeeze and excitation layer, as per https://arxiv.org/pdf/1709.01507.pdf

__init__(in_planes, reduction_ratio=16, reduced_planes=None)[source]¶

Parameters

in_planes (int) – input channel dimension.
reduction_ratio (int) – factor by which in_planes should be reduced to get the output channel dimension.
reduced_planes (int) – Output channel dimension. Only one of reduction_ratio or reduced_planes should be defined.

forward(x)[source]¶

Parameters: x (tensor) – 2D image of format C * H * W
Return type: torch.Tensor

pytorchvideo.layers.squeeze_excitation.create_audio_2d_squeeze_excitation_block(dim_in, dim_out, use_se=False, se_reduction_ratio=16, branch_fusion=<function <lambda>>, conv_a_kernel_size=3, conv_a_stride=1, conv_a_padding=1, conv_b_kernel_size=3, conv_b_stride=1, conv_b_padding=1, norm=<class 'torch.nn.modules.batchnorm.BatchNorm2d'>, norm_eps=1e-05, norm_momentum=0.1, activation=<class 'torch.nn.modules.activation.ReLU'>)[source]¶

2-D Residual block with squeeze excitation (SE2D) for 2d. Performs a summation between an identity shortcut in branch1 and a main block in branch2. When the input and output dimensions are different, a convolution followed by a normalization will be performed.

  Input
    |-------+
    ↓       |
  conv2d    |
    ↓       |
   Norm     |
    ↓       |
activation  |
    ↓       |
  conv2d    |
    ↓       |
   Norm     |
    ↓       |
   SE2D     |
    ↓       }
Summation ←-+
    ↓
Activation

Normalization examples include: BatchNorm3d and None (no normalization). Activation examples include: ReLU, Softmax, Sigmoid, and None (no activation). Transform examples include: BottleneckBlock.

Parameters

dim_in (int) – input channel size to the bottleneck block.
dim_out (int) – output channel size of the bottleneck.
use_se (bool) – if true, use squeeze excitation layer in the bottleneck.
se_reduction_ratio (int) – factor by which input channels should be reduced to get the output channel dimension in SE layer.
branch_fusion (callable) – a callable that constructs summation layer. Examples include: lambda x, y: x + y, OctaveSum.
conv_a_kernel_size (tuple) – convolutional kernel size(s) for conv_a.
conv_a_stride (tuple) – convolutional stride size(s) for conv_a.
conv_a_padding (tuple) – convolutional padding(s) for conv_a.
conv_b_kernel_size (tuple) – convolutional kernel size(s) for conv_b.
conv_b_stride (tuple) – convolutional stride size(s) for conv_b.
conv_b_padding (tuple) – convolutional padding(s) for conv_b.
norm (callable) – a callable that constructs normalization layer. Examples include nn.BatchNorm3d, None (not performing normalization).
norm_eps (float) – normalization epsilon.
norm_momentum (float) – normalization momentum.
activation (callable) – a callable that constructs activation layer in bottleneck and block. Examples include: nn.ReLU, nn.Softmax, nn.Sigmoid, and None (not performing activation).

Returns

(nn.Module) – resnet basic block layer.

Return type

torch.nn.modules.module.Module