pytorchvideo.layers.batch_norm¶
-
class
pytorchvideo.layers.batch_norm.
NaiveSyncBatchNorm1d
(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, device=None, dtype=None)[source]¶ An implementation of 1D naive sync batch normalization. See details in NaiveSyncBatchNorm2d below.
-
class
pytorchvideo.layers.batch_norm.
NaiveSyncBatchNorm2d
(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, device=None, dtype=None)[source]¶ An implementation of 2D naive sync batch normalization. In PyTorch<=1.5,
nn.SyncBatchNorm
has incorrect gradient when the batch size on each worker is different. (e.g., when scale augmentation is used, or when it is applied to mask head).This is a slower but correct alternative to nn.SyncBatchNorm.
Note
This module computes overall statistics by using statistics of each worker with equal weight. The result is true statistics of all samples (as if they are all on one worker) only when all workers have the same (N, H, W). This mode does not support inputs with zero batch size.
pytorchvideo.layers.convolutions¶
-
class
pytorchvideo.layers.convolutions.
ConvReduce3D
(*, in_channels, out_channels, kernel_size, stride=None, padding=None, padding_mode=None, dilation=None, groups=None, bias=None, reduction_method='sum')[source]¶ Builds a list of convolutional operators and performs summation on the outputs.
Conv3d, Conv3d, ..., Conv3d ↓ Sum
-
__init__
(*, in_channels, out_channels, kernel_size, stride=None, padding=None, padding_mode=None, dilation=None, groups=None, bias=None, reduction_method='sum')[source]¶ - Parameters
int (out_channels) – number of input channels.
int – number of output channels produced by the convolution(s).
tuple (bias) – Tuple of sizes of the convolutionaling kernels.
tuple – Tuple of strides of the convolutions.
tuple – Tuple of paddings added to all three sides of the input.
tuple – Tuple of padding modes for each convs. Options include zeros, reflect, replicate or circular.
tuple – Tuple of spacings between kernel elements.
tuple – Tuple of numbers of blocked connections from input channels to output channels.
tuple – If True, adds a learnable bias to the output.
str (reduction_method) – Options include sum and cat.
in_channels (int) –
out_channels (int) –
stride (Optional[Tuple[Union[int, Tuple[int, int, int]]]]) –
padding (Optional[Tuple[Union[int, Tuple[int, int, int]]]]) –
padding_mode (Optional[Tuple[str]]) –
dilation (Optional[Tuple[Union[int, Tuple[int, int, int]]]]) –
groups (Optional[Tuple[int]]) –
bias (Optional[Tuple[bool]]) –
reduction_method (str) –
- Return type
-
-
pytorchvideo.layers.convolutions.
create_conv_2plus1d
(*, in_channels, out_channels, inner_channels=None, conv_xy_first=False, kernel_size=(3, 3, 3), stride=(2, 2, 2), padding=(1, 1, 1), bias=False, dilation=(1, 1, 1), groups=1, norm=<class 'torch.nn.modules.batchnorm.BatchNorm3d'>, norm_eps=1e-05, norm_momentum=0.1, activation=<class 'torch.nn.modules.activation.ReLU'>)[source]¶ Create a 2plus1d conv layer. It performs spatiotemporal Convolution, BN, and Relu following by a spatiotemporal pooling.
Conv_t (or Conv_xy if conv_xy_first = True) ↓ Normalization ↓ Activation ↓ Conv_xy (or Conv_t if conv_xy_first = True)
Normalization options include: BatchNorm3d and None (no normalization). Activation options include: ReLU, Softmax, Sigmoid, and None (no activation).
- Parameters
in_channels (int) – input channel size of the convolution.
out_channels (int) – output channel size of the convolution.
kernel_size (tuple) – convolutional kernel size(s).
stride (tuple) – convolutional stride size(s).
padding (tuple) – convolutional padding size(s).
bias (bool) – convolutional bias. If true, adds a learnable bias to the output.
groups (int) – Number of groups in convolution layers. value >1 is unsupported.
dilation (tuple) – dilation value in convolution layers. value >1 is unsupported.
conv_xy_first (bool) – If True, spatial convolution comes before temporal conv
norm (callable) – a callable that constructs normalization layer, options include nn.BatchNorm3d, None (not performing normalization).
norm_eps (float) – normalization epsilon.
norm_momentum (float) – normalization momentum.
activation (callable) – a callable that constructs activation layer, options include: nn.ReLU, nn.Softmax, nn.Sigmoid, and None (not performing activation).
inner_channels (int) –
- Returns
(nn.Module) – 2plus1d conv layer.
- Return type
torch.nn.modules.module.Module
-
class
pytorchvideo.layers.convolutions.
Conv2plus1d
(*, conv_t=None, norm=None, activation=None, conv_xy=None, conv_xy_first=False)[source]¶ Implementation of 2+1d Convolution by factorizing 3D Convolution into an 1D temporal Convolution and a 2D spatial Convolution with Normalization and Activation module in between:
Conv_t (or Conv_xy if conv_xy_first = True) ↓ Normalization ↓ Activation ↓ Conv_xy (or Conv_t if conv_xy_first = True)
The 2+1d Convolution is used to build the R(2+1)D network.
-
__init__
(*, conv_t=None, norm=None, activation=None, conv_xy=None, conv_xy_first=False)[source]¶ - Parameters
conv_t (torch.nn.modules) – temporal convolution module.
norm (torch.nn.modules) – normalization module.
activation (torch.nn.modules) – activation module.
conv_xy (torch.nn.modules) – spatial convolution module.
conv_xy_first (bool) – If True, spatial convolution comes before temporal conv
- Return type
-
pytorchvideo.layers.fusion¶
-
pytorchvideo.layers.fusion.
make_fusion_layer
(method, feature_dims)[source]¶ - Parameters
method (str) – the fusion method to be constructed. Options: - ‘concat’ - ‘temporal_concat’ - ‘max’ - ‘sum’ - ‘prod’
feature_dims (List[int]) – the first argument of all fusion layers. It holds a list of required feature_dims for each tensor input (where the tensor inputs are of shape (batch_size, seq_len, feature_dim)). The list order must corresponds to the tensor order passed to forward(…).
-
class
pytorchvideo.layers.fusion.
ConcatFusion
(feature_dims)[source]¶ Concatenates all inputs by their last dimension. The resulting tensor last dim will be the sum of the last dimension of all input tensors.
-
property
output_dim
¶ Last dimension size of forward(..) tensor output.
-
forward
(input_list)[source]¶ - Parameters
input_list (List[torch.Tensor]) – a list of tensors of shape (batch_size, seq_len, feature_dim).
- Returns
- Tensor of shape (batch_size, seq_len, sum(feature_dims)) where sum(feature_dims)
is the sum of all input feature_dims.
- Return type
-
property
-
class
pytorchvideo.layers.fusion.
TemporalConcatFusion
(feature_dims)[source]¶ Concatenates all inputs by their temporal dimension which is assumed to be dim=1.
-
property
output_dim
¶ Last dimension size of forward(..) tensor output.
-
forward
(input_list)[source]¶ - Parameters
input_list (List[torch.Tensor]) – a list of tensors of shape (batch_size, seq_len, feature_dim)
- Returns
- Tensor of shape (batch_size, sum(seq_len), feature_dim) where sum(seq_len) is
the sum of all input tensors.
- Return type
-
property
-
class
pytorchvideo.layers.fusion.
ReduceFusion
(feature_dims, reduce_fn)[source]¶ Generic fusion method which takes a callable which takes the list of input tensors and expects a single tensor to be used. This class can be used to implement fusion methods like “sum”, “max” and “prod”.
-
property
output_dim
¶ Last dimension size of forward(..) tensor output.
-
forward
(input_list)[source]¶ - Parameters
input_list (List[torch.Tensor]) – a list of tensors of shape (batch_size, seq_len, feature_dim).
- Returns
Tensor of shape (batch_size, seq_len, feature_dim).
- Return type
-
property
pytorchvideo.layers.mlp¶
-
pytorchvideo.layers.mlp.
make_multilayer_perceptron
(fully_connected_dims, norm=None, mid_activation=<class 'torch.nn.modules.activation.ReLU'>, final_activation=<class 'torch.nn.modules.activation.ReLU'>, dropout_rate=0.0)[source]¶ Factory function for Multi-Layer Perceptron. These are constructed as repeated blocks of the following format where each fc represents the blocks output/input dimension.
Linear (in=fc[i-1], out=fc[i]) ↓ Normalization (norm) ↓ Activation (mid_activation) ↓ After the repeated Perceptron blocks, a final dropout and activation layer is applied: ↓ Dropout (p=dropout_rate) ↓ Activation (final_activation)
pytorchvideo.layers.nonlocal_net¶
-
class
pytorchvideo.layers.nonlocal_net.
NonLocal
(*, conv_theta, conv_phi, conv_g, conv_out, pool=None, norm=None, instantiation='dot_product')[source]¶ Builds Non-local Neural Networks as a generic family of building blocks for capturing long-range dependencies. Non-local Network computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. More details in the paper: Wang, Xiaolong, Ross Girshick, Abhinav Gupta, and Kaiming He. “Non-local neural networks.” In Proceedings of the IEEE conference on CVPR, 2018.
-
pytorchvideo.layers.nonlocal_net.
create_nonlocal
(*, dim_in, dim_inner, pool_size=(1, 1, 1), instantiation='softmax', norm=<class 'torch.nn.modules.batchnorm.BatchNorm3d'>, norm_eps=1e-05, norm_momentum=0.1)[source]¶ Builds Non-local Neural Networks as a generic family of building blocks for capturing long-range dependencies. Non-local Network computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. More details in the paper: https://arxiv.org/pdf/1711.07971 :param dim_in: number of dimension for the input. :type dim_in: int :param dim_inner: number of dimension inside of the Non-local block. :type dim_inner: int :param pool_size: the kernel size of spatial temporal pooling,
temporal pool kernel size, spatial pool kernel size, spatial pool kernel size in order. By default pool_size is None, then there would be no pooling used.
- Parameters
instantiation (string) – supports two different instantiation method: “dot_product”: normalizing correlation matrix with L2. “softmax”: normalizing correlation matrix with Softmax.
norm (nn.Module) – nn.Module for the normalization layer. The default is nn.BatchNorm3d.
norm_eps (float) – normalization epsilon.
norm_momentum (float) – normalization momentum.
dim_in (int) –
dim_inner (int) –
pytorchvideo.layers.positional_encoding¶
-
class
pytorchvideo.layers.positional_encoding.
PositionalEncoding
(embed_dim, seq_len=1024)[source]¶ Applies a positional encoding to a tensor with shape (batch_size x seq_len x embed_dim).
- The positional encoding is computed as follows:
PE(pos,2i) = sin(pos/10000^(2i/dmodel)) PE(pos,2i+1) = cos(pos/10000^(2i/dmodel))
where pos = position, pos in [0, seq_len) dmodel = data embedding dimension = embed_dim i = dimension index, i in [0, embed_dim)
Reference: “Attention Is All You Need” https://arxiv.org/abs/1706.03762 Implementation Reference: https://pytorch.org/tutorials/beginner/transformer_tutorial.html
-
class
pytorchvideo.layers.positional_encoding.
SpatioTemporalClsPositionalEncoding
(embed_dim, patch_embed_shape, sep_pos_embed=False, has_cls=True)[source]¶ Add a cls token and apply a spatiotemporal encoding to a tensor.
-
__init__
(embed_dim, patch_embed_shape, sep_pos_embed=False, has_cls=True)[source]¶ - Parameters
embed_dim (int) – Embedding dimension for input sequence.
patch_embed_shape (Tuple) – The number of patches in each dimension (T, H, W) after patch embedding.
sep_pos_embed (bool) – If set to true, one positional encoding is used for spatial patches and another positional encoding is used for temporal sequence. Otherwise, only one positional encoding is used for all the patches.
has_cls (bool) – If set to true, a cls token is added in the beginning of each input sequence.
- Return type
-
forward
(x)[source]¶ - Parameters
x (torch.Tensor) – Input tensor.
- Return type
-
pytorchvideo.layers.swish¶
pytorchvideo.layers.squeeze_excitation¶
-
class
pytorchvideo.layers.squeeze_excitation.
SqueezeAndExcitationLayer2D
(in_planes, reduction_ratio=16, reduced_planes=None)[source]¶ 2D Squeeze and excitation layer, as per https://arxiv.org/pdf/1709.01507.pdf
-
pytorchvideo.layers.squeeze_excitation.
create_audio_2d_squeeze_excitation_block
(dim_in, dim_out, use_se=False, se_reduction_ratio=16, branch_fusion=<function <lambda>>, conv_a_kernel_size=3, conv_a_stride=1, conv_a_padding=1, conv_b_kernel_size=3, conv_b_stride=1, conv_b_padding=1, norm=<class 'torch.nn.modules.batchnorm.BatchNorm2d'>, norm_eps=1e-05, norm_momentum=0.1, activation=<class 'torch.nn.modules.activation.ReLU'>)[source]¶ 2-D Residual block with squeeze excitation (SE2D) for 2d. Performs a summation between an identity shortcut in branch1 and a main block in branch2. When the input and output dimensions are different, a convolution followed by a normalization will be performed.
Input |-------+ ↓ | conv2d | ↓ | Norm | ↓ | activation | ↓ | conv2d | ↓ | Norm | ↓ | SE2D | ↓ } Summation ←-+ ↓ Activation
Normalization examples include: BatchNorm3d and None (no normalization). Activation examples include: ReLU, Softmax, Sigmoid, and None (no activation). Transform examples include: BottleneckBlock.
- Parameters
dim_in (int) – input channel size to the bottleneck block.
dim_out (int) – output channel size of the bottleneck.
use_se (bool) – if true, use squeeze excitation layer in the bottleneck.
se_reduction_ratio (int) – factor by which input channels should be reduced to get the output channel dimension in SE layer.
branch_fusion (callable) – a callable that constructs summation layer. Examples include: lambda x, y: x + y, OctaveSum.
conv_a_kernel_size (tuple) – convolutional kernel size(s) for conv_a.
conv_a_stride (tuple) – convolutional stride size(s) for conv_a.
conv_a_padding (tuple) – convolutional padding(s) for conv_a.
conv_b_kernel_size (tuple) – convolutional kernel size(s) for conv_b.
conv_b_stride (tuple) – convolutional stride size(s) for conv_b.
conv_b_padding (tuple) – convolutional padding(s) for conv_b.
norm (callable) – a callable that constructs normalization layer. Examples include nn.BatchNorm3d, None (not performing normalization).
norm_eps (float) – normalization epsilon.
norm_momentum (float) – normalization momentum.
activation (callable) – a callable that constructs activation layer in bottleneck and block. Examples include: nn.ReLU, nn.Softmax, nn.Sigmoid, and None (not performing activation).
- Returns
(nn.Module) – resnet basic block layer.
- Return type
torch.nn.modules.module.Module