pytorchvideo.models.r2plus1d¶

pytorchvideo.models.r2plus1d.create_2plus1d_bottleneck_block(*, dim_in, dim_inner, dim_out, conv_a_kernel_size=(1, 1, 1), conv_a_stride=(1, 1, 1), conv_a_padding=(0, 0, 0), conv_a=<class 'torch.nn.modules.conv.Conv3d'>, conv_b_kernel_size=(3, 3, 3), conv_b_stride=(2, 2, 2), conv_b_padding=(1, 1, 1), conv_b_num_groups=1, conv_b_dilation=(1, 1, 1), conv_b=<function create_conv_2plus1d>, conv_c=<class 'torch.nn.modules.conv.Conv3d'>, norm=<class 'torch.nn.modules.batchnorm.BatchNorm3d'>, norm_eps=1e-05, norm_momentum=0.1, activation=<class 'torch.nn.modules.activation.ReLU'>)[source]¶

2plus1d bottleneck block: a sequence of spatiotemporal Convolution, Normalization, and Activations repeated in the following order:

   Conv3d (conv_a)
          ↓
Normalization (norm_a)
          ↓
  Activation (act_a)
          ↓
 Conv(2+1)d (conv_b)
          ↓
Normalization (norm_b)
          ↓
  Activation (act_b)
          ↓
   Conv3d (conv_c)
          ↓
Normalization (norm_c)

Normalization examples include: BatchNorm3d and None (no normalization). Activation examples include: ReLU, Softmax, Sigmoid, and None (no activation).

Parameters

dim_in (int) – input channel size to the bottleneck block.
dim_inner (int) – intermediate channel size of the bottleneck.
dim_out (int) – output channel size of the bottleneck.
conv_a_kernel_size (tuple) – convolutional kernel size(s) for conv_a.
conv_a_stride (tuple) – convolutional stride size(s) for conv_a.
conv_a_padding (tuple) – convolutional padding(s) for conv_a.
conv_a (callable) – a callable that constructs the conv_a conv layer, examples include nn.Conv3d, OctaveConv, etc
conv_b_kernel_size (tuple) – convolutional kernel size(s) for conv_b.
conv_b_stride (tuple) – convolutional stride size(s) for conv_b.
conv_b_padding (tuple) – convolutional padding(s) for conv_b.
conv_b_num_groups (int) – number of groups for groupwise convolution for conv_b.
conv_b_dilation (tuple) – dilation for 3D convolution for conv_b.
conv_b (callable) – a callable that constructs the conv_b conv layer, examples include nn.Conv3d, OctaveConv, etc
conv_c (callable) – a callable that constructs the conv_c conv layer, examples include nn.Conv3d, OctaveConv, etc
norm (callable) – a callable that constructs normalization layer, examples include nn.BatchNorm3d, None (not performing normalization).
norm_eps (float) – normalization epsilon.
norm_momentum (float) – normalization momentum.
activation (callable) – a callable that constructs activation layer, examples include: nn.ReLU, nn.Softmax, nn.Sigmoid, and None (not performing activation).

Returns

(nn.Module) – 2plus1d bottleneck block.

Return type

torch.nn.modules.module.Module

pytorchvideo.models.r2plus1d.create_r2plus1d(*, input_channel=3, model_depth=50, model_num_class=400, dropout_rate=0.0, norm=<class 'torch.nn.modules.batchnorm.BatchNorm3d'>, norm_eps=1e-05, norm_momentum=0.1, activation=<class 'torch.nn.modules.activation.ReLU'>, stem_dim_out=64, stem_conv_kernel_size=(1, 7, 7), stem_conv_stride=(1, 2, 2), stage_conv_a_kernel_size=((1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1)), stage_conv_b_kernel_size=((3, 3, 3), (3, 3, 3), (3, 3, 3), (3, 3, 3)), stage_conv_b_num_groups=(1, 1, 1, 1), stage_conv_b_dilation=((1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1)), stage_spatial_stride=(2, 2, 2, 2), stage_temporal_stride=(1, 1, 2, 2), stage_bottleneck=(<function create_2plus1d_bottleneck_block>, <function create_2plus1d_bottleneck_block>, <function create_2plus1d_bottleneck_block>, <function create_2plus1d_bottleneck_block>), head_pool=<class 'torch.nn.modules.pooling.AvgPool3d'>, head_pool_kernel_size=(4, 7, 7), head_output_size=(1, 1, 1), head_activation=<class 'torch.nn.modules.activation.Softmax'>, head_output_with_global_average=True)[source]¶

Build the R(2+1)D network from:: A closer look at spatiotemporal convolutions for action recognition. Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, Manohar Paluri. CVPR 2018.

R(2+1)D follows the ResNet style architecture including three parts: Stem, Stages and Head. The three parts are assembled in the following order:

Input
  ↓
Stem
  ↓
Stage 1
  ↓
  .
  .
  .
  ↓
Stage N
  ↓
Head

Parameters

input_channel (int) – number of channels for the input video clip.
model_depth (int) – the depth of the resnet.
model_num_class (int) – the number of classes for the video dataset.
dropout_rate (float) – dropout rate.
norm (callable) – a callable that constructs normalization layer.
norm_eps (float) – normalization epsilon.
norm_momentum (float) – normalization momentum.
activation (callable) – a callable that constructs activation layer.
stem_dim_out (int) – output channel size for stem.
stem_conv_kernel_size (tuple) – convolutional kernel size(s) of stem.
stem_conv_stride (tuple) – convolutional stride size(s) of stem.
stage_conv_a_kernel_size (tuple) – convolutional kernel size(s) for conv_a.
stage_conv_b_kernel_size (tuple) – convolutional kernel size(s) for conv_b.
stage_conv_b_num_groups (tuple) – number of groups for groupwise convolution for conv_b. 1 for ResNet, and larger than 1 for ResNeXt.
stage_conv_b_dilation (tuple) – dilation for 3D convolution for conv_b.
stage_spatial_stride (tuple) – the spatial stride for each stage.
stage_temporal_stride (tuple) – the temporal stride for each stage.
stage_bottleneck (tuple) – a callable that constructs bottleneck block layer for each stage. Examples include: create_bottleneck_block, create_2plus1d_bottleneck_block.
head_pool (callable) – a callable that constructs resnet head pooling layer.
head_pool_kernel_size (tuple) – the pooling kernel size.
head_output_size (tuple) – the size of output tensor for head.
head_activation (callable) – a callable that constructs activation layer.
head_output_with_global_average (bool) – if True, perform global averaging on the head output.

Returns

(nn.Module) – basic resnet.

Return type

torch.nn.modules.module.Module