pytorchvideo.models.resnet¶

Building blocks for Resnet and resnet-like models

pytorchvideo.models.resnet.create_bottleneck_block(*, dim_in, dim_inner, dim_out, conv_a_kernel_size=(3, 1, 1), conv_a_stride=(2, 1, 1), conv_a_padding=(1, 0, 0), conv_a=<class 'torch.nn.modules.conv.Conv3d'>, conv_b_kernel_size=(1, 3, 3), conv_b_stride=(1, 2, 2), conv_b_padding=(0, 1, 1), conv_b_num_groups=1, conv_b_dilation=(1, 1, 1), conv_b=<class 'torch.nn.modules.conv.Conv3d'>, conv_c=<class 'torch.nn.modules.conv.Conv3d'>, norm=<class 'torch.nn.modules.batchnorm.BatchNorm3d'>, norm_eps=1e-05, norm_momentum=0.1, activation=<class 'torch.nn.modules.activation.ReLU'>)[source]¶

Bottleneck block: a sequence of spatiotemporal Convolution, Normalization, and Activations repeated in the following order:

   Conv3d (conv_a)
          ↓
Normalization (norm_a)
          ↓
  Activation (act_a)
          ↓
   Conv3d (conv_b)
          ↓
Normalization (norm_b)
          ↓
  Activation (act_b)
          ↓
   Conv3d (conv_c)
          ↓
Normalization (norm_c)

Normalization examples include: BatchNorm3d and None (no normalization). Activation examples include: ReLU, Softmax, Sigmoid, and None (no activation).

Parameters

dim_in (int) – input channel size to the bottleneck block.
dim_inner (int) – intermediate channel size of the bottleneck.
dim_out (int) – output channel size of the bottleneck.
conv_a_kernel_size (tuple) – convolutional kernel size(s) for conv_a.
conv_a_stride (tuple) – convolutional stride size(s) for conv_a.
conv_a_padding (tuple) – convolutional padding(s) for conv_a.
conv_a (callable) – a callable that constructs the conv_a conv layer, examples include nn.Conv3d, OctaveConv, etc
conv_b_kernel_size (tuple) – convolutional kernel size(s) for conv_b.
conv_b_stride (tuple) – convolutional stride size(s) for conv_b.
conv_b_padding (tuple) – convolutional padding(s) for conv_b.
conv_b_num_groups (int) – number of groups for groupwise convolution for conv_b.
conv_b_dilation (tuple) – dilation for 3D convolution for conv_b.
conv_b (callable) – a callable that constructs the conv_b conv layer, examples include nn.Conv3d, OctaveConv, etc
conv_c (callable) – a callable that constructs the conv_c conv layer, examples include nn.Conv3d, OctaveConv, etc
norm (callable) – a callable that constructs normalization layer, examples include nn.BatchNorm3d, None (not performing normalization).
norm_eps (float) – normalization epsilon.
norm_momentum (float) – normalization momentum.
activation (callable) – a callable that constructs activation layer, examples include: nn.ReLU, nn.Softmax, nn.Sigmoid, and None (not performing activation).

Returns

(nn.Module) – resnet bottleneck block.

Return type

torch.nn.modules.module.Module

pytorchvideo.models.resnet.create_acoustic_bottleneck_block(*, dim_in, dim_inner, dim_out, conv_a_kernel_size=(3, 1, 1), conv_a_stride=(2, 1, 1), conv_a_padding=(1, 0, 0), conv_a=<class 'torch.nn.modules.conv.Conv3d'>, conv_b_kernel_size=(1, 1, 1), conv_b_stride=(1, 1, 1), conv_b_padding=(0, 0, 0), conv_b_num_groups=1, conv_b_dilation=(1, 1, 1), conv_b=<class 'torch.nn.modules.conv.Conv3d'>, conv_c=<class 'torch.nn.modules.conv.Conv3d'>, norm=<class 'torch.nn.modules.batchnorm.BatchNorm3d'>, norm_eps=1e-05, norm_momentum=0.1, activation=<class 'torch.nn.modules.activation.ReLU'>)[source]¶

Acoustic Bottleneck block: a sequence of spatiotemporal Convolution, Normalization, and Activations repeated in the following order:

                    Conv3d (conv_a)
                           ↓
                 Normalization (norm_a)
                           ↓
                   Activation (act_a)
                           ↓
           ---------------------------------
           ↓                               ↓
Temporal Conv3d (conv_b)        Spatial Conv3d (conv_b)
           ↓                               ↓
 Normalization (norm_b)         Normalization (norm_b)
           ↓                               ↓
   Activation (act_b)              Activation (act_b)
           ↓                               ↓
           ---------------------------------
                           ↓
                    Conv3d (conv_c)
                           ↓
                 Normalization (norm_c)

Normalization examples include: BatchNorm3d and None (no normalization). Activation examples include: ReLU, Softmax, Sigmoid, and None (no activation).

Parameters

dim_in (int) – input channel size to the bottleneck block.
dim_inner (int) – intermediate channel size of the bottleneck.
dim_out (int) – output channel size of the bottleneck.
conv_a_kernel_size (tuple) – convolutional kernel size(s) for conv_a.
conv_a_stride (tuple) – convolutional stride size(s) for conv_a.
conv_a_padding (tuple) – convolutional padding(s) for conv_a.
conv_a (callable) – a callable that constructs the conv_a conv layer, examples include nn.Conv3d, OctaveConv, etc
conv_b_kernel_size (tuple) – convolutional kernel size(s) for conv_b.
conv_b_stride (tuple) – convolutional stride size(s) for conv_b.
conv_b_padding (tuple) – convolutional padding(s) for conv_b.
conv_b_num_groups (int) – number of groups for groupwise convolution for conv_b.
conv_b_dilation (tuple) – dilation for 3D convolution for conv_b.
conv_b (callable) – a callable that constructs the conv_b conv layer, examples include nn.Conv3d, OctaveConv, etc
conv_c (callable) – a callable that constructs the conv_c conv layer, examples include nn.Conv3d, OctaveConv, etc
norm (callable) – a callable that constructs normalization layer, examples include nn.BatchNorm3d, None (not performing normalization).
norm_eps (float) – normalization epsilon.
norm_momentum (float) – normalization momentum.
activation (callable) – a callable that constructs activation layer, examples include: nn.ReLU, nn.Softmax, nn.Sigmoid, and None (not performing activation).

Returns

(nn.Module) – resnet acoustic bottleneck block.

Return type

torch.nn.modules.module.Module

pytorchvideo.models.resnet.create_res_block(*, dim_in, dim_inner, dim_out, bottleneck, use_shortcut=False, branch_fusion=<function <lambda>>, conv_a_kernel_size=(3, 1, 1), conv_a_stride=(2, 1, 1), conv_a_padding=(1, 0, 0), conv_a=<class 'torch.nn.modules.conv.Conv3d'>, conv_b_kernel_size=(1, 3, 3), conv_b_stride=(1, 2, 2), conv_b_padding=(0, 1, 1), conv_b_num_groups=1, conv_b_dilation=(1, 1, 1), conv_b=<class 'torch.nn.modules.conv.Conv3d'>, conv_c=<class 'torch.nn.modules.conv.Conv3d'>, conv_skip=<class 'torch.nn.modules.conv.Conv3d'>, norm=<class 'torch.nn.modules.batchnorm.BatchNorm3d'>, norm_eps=1e-05, norm_momentum=0.1, activation_bottleneck=<class 'torch.nn.modules.activation.ReLU'>, activation_block=<class 'torch.nn.modules.activation.ReLU'>)[source]¶

Residual block. Performs a summation between an identity shortcut in branch1 and a main block in branch2. When the input and output dimensions are different, a convolution followed by a normalization will be performed.

  Input
    |-------+
    ↓       |
  Block     |
    ↓       |
Summation ←-+
    ↓
Activation

Normalization examples include: BatchNorm3d and None (no normalization). Activation examples include: ReLU, Softmax, Sigmoid, and None (no activation). Transform examples include: BottleneckBlock.

Parameters

dim_in (int) – input channel size to the bottleneck block.
dim_inner (int) – intermediate channel size of the bottleneck.
dim_out (int) – output channel size of the bottleneck.
bottleneck (callable) – a callable that constructs bottleneck block layer. Examples include: create_bottleneck_block.
use_shortcut (bool) – If true, use conv and norm layers in skip connection.
branch_fusion (callable) – a callable that constructs summation layer. Examples include: lambda x, y: x + y, OctaveSum.
conv_a_kernel_size (tuple) – convolutional kernel size(s) for conv_a.
conv_a_stride (tuple) – convolutional stride size(s) for conv_a.
conv_a_padding (tuple) – convolutional padding(s) for conv_a.
conv_a (callable) – a callable that constructs the conv_a conv layer, examples include nn.Conv3d, OctaveConv, etc
conv_b_kernel_size (tuple) – convolutional kernel size(s) for conv_b.
conv_b_stride (tuple) – convolutional stride size(s) for conv_b.
conv_b_padding (tuple) – convolutional padding(s) for conv_b.
conv_b_num_groups (int) – number of groups for groupwise convolution for conv_b.
conv_b_dilation (tuple) – dilation for 3D convolution for conv_b.
conv_b (callable) – a callable that constructs the conv_b conv layer, examples include nn.Conv3d, OctaveConv, etc
conv_c (callable) – a callable that constructs the conv_c conv layer, examples include nn.Conv3d, OctaveConv, etc
conv_skip (callable) – a callable that constructs the conv_skip conv layer,
include nn.Conv3d (examples) –
OctaveConv –
etc –
norm (callable) – a callable that constructs normalization layer. Examples include nn.BatchNorm3d, None (not performing normalization).
norm_eps (float) – normalization epsilon.
norm_momentum (float) – normalization momentum.
activation_bottleneck (callable) – a callable that constructs activation layer in bottleneck. Examples include: nn.ReLU, nn.Softmax, nn.Sigmoid, and None (not performing activation).
activation_block (callable) – a callable that constructs activation layer used at the end of the block. Examples include: nn.ReLU, nn.Softmax, nn.Sigmoid, and None (not performing activation).

Returns

(nn.Module) – resnet basic block layer.

Return type

torch.nn.modules.module.Module

pytorchvideo.models.resnet.create_res_stage(*, depth, dim_in, dim_inner, dim_out, bottleneck, conv_a_kernel_size=(3, 1, 1), conv_a_stride=(2, 1, 1), conv_a_padding=(1, 0, 0), conv_a=<class 'torch.nn.modules.conv.Conv3d'>, conv_b_kernel_size=(1, 3, 3), conv_b_stride=(1, 2, 2), conv_b_padding=(0, 1, 1), conv_b_num_groups=1, conv_b_dilation=(1, 1, 1), conv_b=<class 'torch.nn.modules.conv.Conv3d'>, conv_c=<class 'torch.nn.modules.conv.Conv3d'>, norm=<class 'torch.nn.modules.batchnorm.BatchNorm3d'>, norm_eps=1e-05, norm_momentum=0.1, activation=<class 'torch.nn.modules.activation.ReLU'>)[source]¶

Create Residual Stage, which composes sequential blocks that make up a ResNet. These blocks could be, for example, Residual blocks, Non-Local layers, or Squeeze-Excitation layers.

 Input
    ↓
ResBlock
    ↓
    .
    .
    .
    ↓
ResBlock

Normalization examples include: BatchNorm3d and None (no normalization). Activation examples include: ReLU, Softmax, Sigmoid, and None (no activation). Bottleneck examples include: create_bottleneck_block.

Parameters

depth (init) – number of blocks to create.
dim_in (int) – input channel size to the bottleneck block.
dim_inner (int) – intermediate channel size of the bottleneck.
dim_out (int) – output channel size of the bottleneck.
bottleneck (callable) – a callable that constructs bottleneck block layer. Examples include: create_bottleneck_block.
conv_a_kernel_size (tuple or list of tuple) – convolutional kernel size(s) for conv_a. If conv_a_kernel_size is a tuple, use it for all blocks in the stage. If conv_a_kernel_size is a list of tuple, the kernel sizes will be repeated until having same length of depth in the stage. For example, for conv_a_kernel_size = [(3, 1, 1), (1, 1, 1)], the kernel size for the first 6 blocks would be [(3, 1, 1), (1, 1, 1), (3, 1, 1), (1, 1, 1), (3, 1, 1)].
conv_a_stride (tuple) – convolutional stride size(s) for conv_a.
conv_a_padding (tuple or list of tuple) – convolutional padding(s) for conv_a. If conv_a_padding is a tuple, use it for all blocks in the stage. If conv_a_padding is a list of tuple, the padding sizes will be repeated until having same length of depth in the stage.
conv_a (callable) – a callable that constructs the conv_a conv layer, examples include nn.Conv3d, OctaveConv, etc
conv_b_kernel_size (tuple) – convolutional kernel size(s) for conv_b.
conv_b_stride (tuple) – convolutional stride size(s) for conv_b.
conv_b_padding (tuple) – convolutional padding(s) for conv_b.
conv_b_num_groups (int) – number of groups for groupwise convolution for conv_b.
conv_b_dilation (tuple) – dilation for 3D convolution for conv_b.
conv_b (callable) – a callable that constructs the conv_b conv layer, examples include nn.Conv3d, OctaveConv, etc
conv_c (callable) – a callable that constructs the conv_c conv layer, examples include nn.Conv3d, OctaveConv, etc
norm (callable) – a callable that constructs normalization layer. Examples include nn.BatchNorm3d, and None (not performing normalization).
norm_eps (float) – normalization epsilon.
norm_momentum (float) – normalization momentum.
activation (callable) – a callable that constructs activation layer. Examples include: nn.ReLU, nn.Softmax, nn.Sigmoid, and None (not performing activation).

Returns

(nn.Module) – resnet basic stage layer.

Return type

torch.nn.modules.module.Module

pytorchvideo.models.resnet.create_resnet(*, input_channel=3, model_depth=50, model_num_class=400, dropout_rate=0.5, norm=<class 'torch.nn.modules.batchnorm.BatchNorm3d'>, activation=<class 'torch.nn.modules.activation.ReLU'>, stem_dim_out=64, stem_conv_kernel_size=(3, 7, 7), stem_conv_stride=(1, 2, 2), stem_pool=<class 'torch.nn.modules.pooling.MaxPool3d'>, stem_pool_kernel_size=(1, 3, 3), stem_pool_stride=(1, 2, 2), stem=<function create_res_basic_stem>, stage1_pool=None, stage1_pool_kernel_size=(2, 1, 1), stage_conv_a_kernel_size=((1, 1, 1), (1, 1, 1), (3, 1, 1), (3, 1, 1)), stage_conv_b_kernel_size=((1, 3, 3), (1, 3, 3), (1, 3, 3), (1, 3, 3)), stage_conv_b_num_groups=(1, 1, 1, 1), stage_conv_b_dilation=((1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1)), stage_spatial_h_stride=(1, 2, 2, 2), stage_spatial_w_stride=(1, 2, 2, 2), stage_temporal_stride=(1, 1, 1, 1), bottleneck=<function create_bottleneck_block>, head=<function create_res_basic_head>, head_pool=<class 'torch.nn.modules.pooling.AvgPool3d'>, head_pool_kernel_size=(4, 7, 7), head_output_size=(1, 1, 1), head_activation=None, head_output_with_global_average=True)[source]¶

Build ResNet style models for video recognition. ResNet has three parts: Stem, Stages and Head. Stem is the first Convolution layer (Conv1) with an optional pooling layer. Stages are grouped residual blocks. There are usually multiple stages and each stage may include multiple residual blocks. Head may include pooling, dropout, a fully-connected layer and global spatial temporal averaging. The three parts are assembled in the following order:

Input
  ↓
Stem
  ↓
Stage 1
  ↓
  .
  .
  .
  ↓
Stage N
  ↓
Head

Parameters

input_channel (int) – number of channels for the input video clip.
model_depth (int) – the depth of the resnet. Options include: 50, 101, 152.
model_num_class (int) – the number of classes for the video dataset.
dropout_rate (float) – dropout rate.
norm (callable) – a callable that constructs normalization layer.
activation (callable) – a callable that constructs activation layer.
stem_dim_out (int) – output channel size to stem.
stem_conv_kernel_size (tuple) – convolutional kernel size(s) of stem.
stem_conv_stride (tuple) – convolutional stride size(s) of stem.
stem_pool (callable) – a callable that constructs resnet head pooling layer.
stem_pool_kernel_size (tuple) – pooling kernel size(s).
stem_pool_stride (tuple) – pooling stride size(s).
stem (callable) – a callable that constructs stem layer. Examples include: create_res_video_stem.
stage_conv_a_kernel_size (tuple) – convolutional kernel size(s) for conv_a.
stage_conv_b_kernel_size (tuple) – convolutional kernel size(s) for conv_b.
stage_conv_b_num_groups (tuple) – number of groups for groupwise convolution for conv_b. 1 for ResNet, and larger than 1 for ResNeXt.
stage_conv_b_dilation (tuple) – dilation for 3D convolution for conv_b.
stage_spatial_h_stride (tuple) – the spatial height stride for each stage.
stage_spatial_w_stride (tuple) – the spatial width stride for each stage.
stage_temporal_stride (tuple) – the temporal stride for each stage.
bottleneck (callable) – a callable that constructs bottleneck block layer. Examples include: create_bottleneck_block.
head (callable) – a callable that constructs the resnet-style head. Ex: create_res_basic_head
head_pool (callable) – a callable that constructs resnet head pooling layer.
head_pool_kernel_size (tuple) – the pooling kernel size.
head_output_size (tuple) – the size of output tensor for head.
head_activation (callable) – a callable that constructs activation layer.
head_output_with_global_average (bool) – if True, perform global averaging on the head output.
stage1_pool (Callable) –
stage1_pool_kernel_size (Tuple[int]) –

Returns

(nn.Module) – basic resnet.

Return type

torch.nn.modules.module.Module

pytorchvideo.models.resnet.create_resnet_with_roi_head(*, input_channel=3, model_depth=50, model_num_class=80, dropout_rate=0.5, norm=<class 'torch.nn.modules.batchnorm.BatchNorm3d'>, activation=<class 'torch.nn.modules.activation.ReLU'>, stem_dim_out=64, stem_conv_kernel_size=(1, 7, 7), stem_conv_stride=(1, 2, 2), stem_pool=<class 'torch.nn.modules.pooling.MaxPool3d'>, stem_pool_kernel_size=(1, 3, 3), stem_pool_stride=(1, 2, 2), stem=<function create_res_basic_stem>, stage1_pool=None, stage1_pool_kernel_size=(2, 1, 1), stage_conv_a_kernel_size=((1, 1, 1), (1, 1, 1), (3, 1, 1), (3, 1, 1)), stage_conv_b_kernel_size=((1, 3, 3), (1, 3, 3), (1, 3, 3), (1, 3, 3)), stage_conv_b_num_groups=(1, 1, 1, 1), stage_conv_b_dilation=((1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 2, 2)), stage_spatial_h_stride=(1, 2, 2, 1), stage_spatial_w_stride=(1, 2, 2, 1), stage_temporal_stride=(1, 1, 1, 1), bottleneck=<function create_bottleneck_block>, head=<function create_res_roi_pooling_head>, head_pool=<class 'torch.nn.modules.pooling.AvgPool3d'>, head_pool_kernel_size=(4, 1, 1), head_output_size=(1, 1, 1), head_activation=<class 'torch.nn.modules.activation.Sigmoid'>, head_output_with_global_average=False, head_spatial_resolution=(7, 7), head_spatial_scale=0.0625, head_sampling_ratio=0)[source]¶

Build ResNet style models for video detection. ResNet has three parts: Stem, Stages and Head. Stem is the first Convolution layer (Conv1) with an optional pooling layer. Stages are grouped residual blocks. There are usually multiple stages and each stage may include multiple residual blocks. Head may include pooling, dropout, a fully-connected layer and global spatial temporal averaging. The three parts are assembled in the following order:

Input Clip    Input Bounding Boxes
  ↓                       ↓
Stem                      ↓
  ↓                       ↓
Stage 1                   ↓
  ↓                       ↓
  .                       ↓
  .                       ↓
  .                       ↓
  ↓                       ↓
Stage N                   ↓
  ↓--------> Head <-------↓

Parameters

input_channel (int) – number of channels for the input video clip.
model_depth (int) – the depth of the resnet. Options include: 50, 101, 152.
model_num_class (int) – the number of classes for the video dataset.
dropout_rate (float) – dropout rate.
norm (callable) – a callable that constructs normalization layer.
activation (callable) – a callable that constructs activation layer.
stem_dim_out (int) – output channel size to stem.
stem_conv_kernel_size (tuple) – convolutional kernel size(s) of stem.
stem_conv_stride (tuple) – convolutional stride size(s) of stem.
stem_pool (callable) – a callable that constructs resnet head pooling layer.
stem_pool_kernel_size (tuple) – pooling kernel size(s).
stem_pool_stride (tuple) – pooling stride size(s).
stem (callable) – a callable that constructs stem layer. Examples include: create_res_video_stem.
stage_conv_a_kernel_size (tuple) – convolutional kernel size(s) for conv_a.
stage_conv_b_kernel_size (tuple) – convolutional kernel size(s) for conv_b.
stage_conv_b_num_groups (tuple) – number of groups for groupwise convolution for conv_b. 1 for ResNet, and larger than 1 for ResNeXt.
stage_conv_b_dilation (tuple) – dilation for 3D convolution for conv_b.
stage_spatial_h_stride (tuple) – the spatial height stride for each stage.
stage_spatial_w_stride (tuple) – the spatial width stride for each stage.
stage_temporal_stride (tuple) – the temporal stride for each stage.
bottleneck (callable) – a callable that constructs bottleneck block layer. Examples include: create_bottleneck_block.
head (callable) – a callable that constructs the detection head which can take in the additional input of bounding boxes. Ex: create_res_roi_pooling_head
head_pool (callable) – a callable that constructs resnet head pooling layer.
head_pool_kernel_size (tuple) – the pooling kernel size.
head_output_size (tuple) – the size of output tensor for head.
head_activation (callable) – a callable that constructs activation layer.
head_output_with_global_average (bool) – if True, perform global averaging on the head output.
head_spatial_resolution (tuple) – h, w sizes of the RoI interpolation.
head_spatial_scale (float) – scale the input boxes by this number.
head_sampling_ratio (int) – number of inputs samples to take for each output sample interpolation. 0 to take samples densely.
stage1_pool (Callable) –
stage1_pool_kernel_size (Tuple[int]) –

Returns

(nn.Module) – basic resnet.

Return type

torch.nn.modules.module.Module

pytorchvideo.models.resnet.create_acoustic_resnet(*, input_channel=1, model_depth=50, model_num_class=400, dropout_rate=0.5, norm=<class 'torch.nn.modules.batchnorm.BatchNorm3d'>, activation=<class 'torch.nn.modules.activation.ReLU'>, stem_dim_out=64, stem_conv_kernel_size=(9, 1, 9), stem_conv_stride=(1, 1, 3), stem_pool=None, stem_pool_kernel_size=(3, 1, 3), stem_pool_stride=(2, 1, 2), stem=<function create_acoustic_res_basic_stem>, stage1_pool=None, stage1_pool_kernel_size=(2, 1, 1), stage_conv_a_kernel_size=(3, 1, 1), stage_conv_b_kernel_size=(3, 1, 3), stage_conv_b_num_groups=(1, 1, 1, 1), stage_conv_b_dilation=(1, 1, 1), stage_spatial_h_stride=(1, 1, 1, 1), stage_spatial_w_stride=(1, 2, 2, 2), stage_temporal_stride=(1, 2, 2, 2), bottleneck=(<function create_acoustic_bottleneck_block>, <function create_acoustic_bottleneck_block>, <function create_bottleneck_block>, <function create_bottleneck_block>), head_pool=<class 'torch.nn.modules.pooling.AvgPool3d'>, head_pool_kernel_size=(4, 1, 2), head_output_size=(1, 1, 1), head_activation=None, head_output_with_global_average=True)[source]¶

Build ResNet style models for acoustic recognition. ResNet has three parts: Stem, Stages and Head. Stem is the first Convolution layer (Conv1) with an optional pooling layer. Stages are grouped residual blocks. There are usually multiple stages and each stage may include multiple residual blocks. Head may include pooling, dropout, a fully-connected layer and global spatial temporal averaging. The three parts are assembled in the following order:

Input
  ↓
Stem
  ↓
Stage 1
  ↓
  .
  .
  .
  ↓
Stage N
  ↓
Head

Parameters

input_channel (int) – number of channels for the input video clip.
model_depth (int) – the depth of the resnet. Options include: 50, 101, 152.
model_num_class (int) – the number of classes for the video dataset.
dropout_rate (float) – dropout rate.
norm (callable) – a callable that constructs normalization layer.
activation (callable) – a callable that constructs activation layer.
stem_dim_out (int) – output channel size to stem.
stem_conv_kernel_size (tuple) – convolutional kernel size(s) of stem.
stem_conv_stride (tuple) – convolutional stride size(s) of stem.
stem_pool (callable) – a callable that constructs resnet head pooling layer.
stem_pool_kernel_size (tuple) – pooling kernel size(s).
stem_pool_stride (tuple) – pooling stride size(s).
stem (callable) – a callable that constructs stem layer. Examples include: create_res_video_stem.
stage_conv_a_kernel_size (tuple) – convolutional kernel size(s) for conv_a.
stage_conv_b_kernel_size (tuple) – convolutional kernel size(s) for conv_b.
stage_conv_b_num_groups (tuple) – number of groups for groupwise convolution for conv_b. 1 for ResNet, and larger than 1 for ResNeXt.
stage_conv_b_dilation (tuple) – dilation for 3D convolution for conv_b.
stage_spatial_h_stride (tuple) – the spatial height stride for each stage.
stage_spatial_w_stride (tuple) – the spatial width stride for each stage.
stage_temporal_stride (tuple) – the temporal stride for each stage.
bottleneck (callable) – a callable that constructs bottleneck block layer. Examples include: create_bottleneck_block.
head_pool (callable) – a callable that constructs resnet head pooling layer.
head_pool_kernel_size (tuple) – the pooling kernel size.
head_output_size (tuple) – the size of output tensor for head.
head_activation (callable) – a callable that constructs activation layer.
head_output_with_global_average (bool) – if True, perform global averaging on the head output.
stage1_pool (Callable) –
stage1_pool_kernel_size (Tuple[int]) –

Returns

(nn.Module) –

audio resnet, that takes spectragram image input with: shape: (B, C, T, 1, F), where T is the time dimension and F is the frequency dimension.

Return type

torch.nn.modules.module.Module

class pytorchvideo.models.resnet.ResBlock(branch1_conv=None, branch1_norm=None, branch2=None, activation=None, branch_fusion=None)[source]¶

Residual block. Performs a summation between an identity shortcut in branch1 and a main block in branch2. When the input and output dimensions are different, a convolution followed by a normalization will be performed.

  Input
    |-------+
    ↓       |
  Block     |
    ↓       |
Summation ←-+
    ↓
Activation

The builder can be found in create_res_block.

__init__(branch1_conv=None, branch1_norm=None, branch2=None, activation=None, branch_fusion=None)[source]¶

Parameters

branch1_conv (torch.nn.modules) – convolutional module in branch1.
branch1_norm (torch.nn.modules) – normalization module in branch1.
branch2 (torch.nn.modules) – bottleneck block module in branch2.
activation (torch.nn.modules) – activation module.
branch_fusion (Callable) – (Callable): A callable or layer that combines branch1 and branch2.

Return type

torch.nn.modules.module.Module

class pytorchvideo.models.resnet.SeparableBottleneckBlock(*, conv_a, norm_a, act_a, conv_b, norm_b, act_b, conv_c, norm_c, reduce_method='sum')[source]¶

Separable Bottleneck block: a sequence of spatiotemporal Convolution, Normalization, and Activations repeated in the following order. Requires a tuple of models to be provided to conv_b, norm_b, act_b to perform Convolution, Normalization, and Activations in parallel Separably.

      Conv3d (conv_a)
             ↓
   Normalization (norm_a)
             ↓
     Activation (act_a)
             ↓
   Conv3d(s) (conv_b), ...
           ↓ (↓)
Normalization(s) (norm_b), ...
           ↓ (↓)
   Activation(s) (act_b), ...
           ↓ (↓)
    Reduce (sum or cat)
             ↓
      Conv3d (conv_c)
             ↓
   Normalization (norm_c)

__init__(*, conv_a, norm_a, act_a, conv_b, norm_b, act_b, conv_c, norm_c, reduce_method='sum')[source]¶

Parameters

conv_a (torch.nn.modules) – convolutional module.
norm_a (torch.nn.modules) – normalization module.
act_a (torch.nn.modules) – activation module.
conv_b (torch.nn.modules_list) – convolutional module(s).
norm_b (torch.nn.modules_list) – normalization module(s).
act_b (torch.nn.modules_list) – activation module(s).
conv_c (torch.nn.modules) – convolutional module.
norm_c (torch.nn.modules) – normalization module.
reduce_method (str) – if multiple conv_b is used, reduce the output with sum, or cat.

Return type

None

class pytorchvideo.models.resnet.BottleneckBlock(*, conv_a=None, norm_a=None, act_a=None, conv_b=None, norm_b=None, act_b=None, conv_c=None, norm_c=None)[source]¶

Bottleneck block: a sequence of spatiotemporal Convolution, Normalization, and Activations repeated in the following order:

   Conv3d (conv_a)
          ↓
Normalization (norm_a)
          ↓
  Activation (act_a)
          ↓
   Conv3d (conv_b)
          ↓
Normalization (norm_b)
          ↓
  Activation (act_b)
          ↓
   Conv3d (conv_c)
          ↓
Normalization (norm_c)

The builder can be found in create_bottleneck_block.

__init__(*, conv_a=None, norm_a=None, act_a=None, conv_b=None, norm_b=None, act_b=None, conv_c=None, norm_c=None)[source]¶

Parameters

conv_a (torch.nn.modules) – convolutional module.
norm_a (torch.nn.modules) – normalization module.
act_a (torch.nn.modules) – activation module.
conv_b (torch.nn.modules) – convolutional module.
norm_b (torch.nn.modules) – normalization module.
act_b (torch.nn.modules) – activation module.
conv_c (torch.nn.modules) – convolutional module.
norm_c (torch.nn.modules) – normalization module.

Return type

None

class pytorchvideo.models.resnet.ResStage(res_blocks)[source]¶

ResStage composes sequential blocks that make up a ResNet. These blocks could be, for example, Residual blocks, Non-Local layers, or Squeeze-Excitation layers.

 Input
    ↓
ResBlock
    ↓
    .
    .
    .
    ↓
ResBlock

The builder can be found in create_res_stage.

__init__(res_blocks)[source]¶

Parameters: res_blocks (torch.nn.module_list) – ResBlock module(s).
Return type: torch.nn.modules.module.Module