pytorchvideo.models.slowfast¶

pytorchvideo.models.slowfast.create_slowfast(*, slowfast_channel_reduction_ratio=(8, ), slowfast_conv_channel_fusion_ratio=2, slowfast_fusion_conv_kernel_size=(7, 1, 1), slowfast_fusion_conv_stride=(4, 1, 1), fusion_builder=None, input_channels=(3, 3), model_depth=50, model_num_class=400, dropout_rate=0.5, norm=<class 'torch.nn.modules.batchnorm.BatchNorm3d'>, activation=<class 'torch.nn.modules.activation.ReLU'>, stem_function=(<function create_res_basic_stem>, <function create_res_basic_stem>), stem_dim_outs=(64, 8), stem_conv_kernel_sizes=((1, 7, 7), (5, 7, 7)), stem_conv_strides=((1, 2, 2), (1, 2, 2)), stem_pool=(<class 'torch.nn.modules.pooling.MaxPool3d'>, <class 'torch.nn.modules.pooling.MaxPool3d'>), stem_pool_kernel_sizes=((1, 3, 3), (1, 3, 3)), stem_pool_strides=((1, 2, 2), (1, 2, 2)), stage_conv_a_kernel_sizes=(((1, 1, 1), (1, 1, 1), (3, 1, 1), (3, 1, 1)), ((3, 1, 1), (3, 1, 1), (3, 1, 1), (3, 1, 1))), stage_conv_b_kernel_sizes=(((1, 3, 3), (1, 3, 3), (1, 3, 3), (1, 3, 3)), ((1, 3, 3), (1, 3, 3), (1, 3, 3), (1, 3, 3))), stage_conv_b_num_groups=((1, 1, 1, 1), (1, 1, 1, 1)), stage_conv_b_dilations=(((1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1)), ((1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1))), stage_spatial_strides=((1, 2, 2, 2), (1, 2, 2, 2)), stage_temporal_strides=((1, 1, 1, 1), (1, 1, 1, 1)), bottleneck=((<function create_bottleneck_block>, <function create_bottleneck_block>, <function create_bottleneck_block>, <function create_bottleneck_block>), (<function create_bottleneck_block>, <function create_bottleneck_block>, <function create_bottleneck_block>, <function create_bottleneck_block>)), head=<function create_res_basic_head>, head_pool=<class 'torch.nn.modules.pooling.AvgPool3d'>, head_pool_kernel_sizes=((8, 7, 7), (32, 7, 7)), head_output_size=(1, 1, 1), head_activation=None, head_output_with_global_average=True)[source]¶

Build SlowFast model for video recognition, SlowFast model involves a Slow pathway, operating at low frame rate, to capture spatial semantics, and a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Details can be found from the paper:

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. “SlowFast networks for video recognition.” https://arxiv.org/pdf/1812.03982.pdf

Slow Input  Fast Input
     ↓           ↓
    Stem       Stem
     ↓ ⭠ Fusion- ↓
  Stage 1     Stage 1
     ↓ ⭠ Fusion- ↓
     .           .
     ↓           ↓
  Stage N     Stage N
     ↓ ⭠ Fusion- ↓
            ↓
          Head

Parameters

slowfast_channel_reduction_ratio (int) – Corresponds to the inverse of the channel reduction ratio, $eta$ between the Slow and Fast pathways.
slowfast_conv_channel_fusion_ratio (int) – Ratio of channel dimensions between the Slow and Fast pathways.
slowfast_fusion_conv_kernel_size (DEPRECATED) – the convolutional kernel size used for fusion.
slowfast_fusion_conv_stride (DEPRECATED) – the convolutional stride size used for fusion.
fusion_builder (Callable[[int, int], nn.Module]) – Builder function for generating the fusion modules based on stage dimension and index
input_channels (tuple) – number of channels for the input video clip.
model_depth (int) – the depth of the resnet.
model_num_class (int) – the number of classes for the video dataset.
dropout_rate (float) – dropout rate.
norm (callable) – a callable that constructs normalization layer.
activation (callable) – a callable that constructs activation layer.
stem_function (Tuple[Callable]) – a callable that constructs stem layer. Examples include create_res_basic_stem. Indexed by pathway
stem_dim_outs (tuple) – output channel size to stem.
stem_conv_kernel_sizes (tuple) – convolutional kernel size(s) of stem.
stem_conv_strides (tuple) – convolutional stride size(s) of stem.
stem_pool (Tuple[Callable]) – a callable that constructs resnet head pooling layer. Indexed by pathway
stem_pool_kernel_sizes (tuple) – pooling kernel size(s).
stem_pool_strides (tuple) – pooling stride size(s).
stage_conv_a_kernel_sizes (tuple) – convolutional kernel size(s) for conv_a.
stage_conv_b_kernel_sizes (tuple) – convolutional kernel size(s) for conv_b.
stage_conv_b_num_groups (tuple) – number of groups for groupwise convolution for conv_b. 1 for ResNet, and larger than 1 for ResNeXt.
stage_conv_b_dilations (tuple) – dilation for 3D convolution for conv_b.
stage_spatial_strides (tuple) – the spatial stride for each stage.
stage_temporal_strides (tuple) – the temporal stride for each stage.
bottleneck (Tuple[Tuple[Callable]]) – a callable that constructs bottleneck block layer. Examples include: create_bottleneck_block. Indexed by pathway and stage index
head (callable) – a callable that constructs the resnet-style head. Ex: create_res_basic_head
head_pool (callable) – a callable that constructs resnet head pooling layer.
head_output_sizes (tuple) – the size of output tensor for head.
head_activation (callable) – a callable that constructs activation layer.
head_output_with_global_average (bool) – if True, perform global averaging on the head output.
head_pool_kernel_sizes (Tuple[Tuple[int]]) –
head_output_size (Tuple[int]) –

Returns

(nn.Module) – SlowFast model.

Return type

torch.nn.modules.module.Module

pytorchvideo.models.slowfast.create_slowfast_with_roi_head(*, slowfast_channel_reduction_ratio=(8, ), slowfast_conv_channel_fusion_ratio=2, slowfast_fusion_conv_kernel_size=(7, 1, 1), slowfast_fusion_conv_stride=(4, 1, 1), fusion_builder=None, input_channels=(3, 3), model_depth=50, model_num_class=80, dropout_rate=0.5, norm=<class 'torch.nn.modules.batchnorm.BatchNorm3d'>, activation=<class 'torch.nn.modules.activation.ReLU'>, stem_function=(<function create_res_basic_stem>, <function create_res_basic_stem>), stem_dim_outs=(64, 8), stem_conv_kernel_sizes=((1, 7, 7), (5, 7, 7)), stem_conv_strides=((1, 2, 2), (1, 2, 2)), stem_pool=(<class 'torch.nn.modules.pooling.MaxPool3d'>, <class 'torch.nn.modules.pooling.MaxPool3d'>), stem_pool_kernel_sizes=((1, 3, 3), (1, 3, 3)), stem_pool_strides=((1, 2, 2), (1, 2, 2)), stage_conv_a_kernel_sizes=(((1, 1, 1), (1, 1, 1), (3, 1, 1), (3, 1, 1)), ((3, 1, 1), (3, 1, 1), (3, 1, 1), (3, 1, 1))), stage_conv_b_kernel_sizes=(((1, 3, 3), (1, 3, 3), (1, 3, 3), (1, 3, 3)), ((1, 3, 3), (1, 3, 3), (1, 3, 3), (1, 3, 3))), stage_conv_b_num_groups=((1, 1, 1, 1), (1, 1, 1, 1)), stage_conv_b_dilations=(((1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 2, 2)), ((1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 2, 2))), stage_spatial_strides=((1, 2, 2, 1), (1, 2, 2, 1)), stage_temporal_strides=((1, 1, 1, 1), (1, 1, 1, 1)), bottleneck=((<function create_bottleneck_block>, <function create_bottleneck_block>, <function create_bottleneck_block>, <function create_bottleneck_block>), (<function create_bottleneck_block>, <function create_bottleneck_block>, <function create_bottleneck_block>, <function create_bottleneck_block>)), head=<function create_res_roi_pooling_head>, head_pool=<class 'torch.nn.modules.pooling.AvgPool3d'>, head_pool_kernel_sizes=((8, 1, 1), (32, 1, 1)), head_output_size=(1, 1, 1), head_activation=<class 'torch.nn.modules.activation.Sigmoid'>, head_output_with_global_average=False, head_spatial_resolution=(7, 7), head_spatial_scale=0.0625, head_sampling_ratio=0)[source]¶

Build SlowFast model for video detection, SlowFast model involves a Slow pathway, operating at low frame rate, to capture spatial semantics, and a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Details can be found from the paper:

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. “SlowFast networks for video recognition.” https://arxiv.org/pdf/1812.03982.pdf

Slow Input  Fast Input         Bounding Box Input
    ↓           ↓                      ↓
   Stem       Stem                     ↓
    ↓ ⭠ Fusion- ↓                     ↓
  Stage 1     Stage 1                  ↓
    ↓ ⭠ Fusion- ↓                     ↓
    .           .                      ↓
    ↓           ↓                      ↓
  Stage N     Stage N                  ↓
    ↓ ⭠ Fusion- ↓                     ↓
            ↓                          ↓
            ↓----------> Head <--------↓

Parameters

slowfast_channel_reduction_ratio (int) – Corresponds to the inverse of the channel reduction ratio, $eta$ between the Slow and Fast pathways.
slowfast_conv_channel_fusion_ratio (int) – Ratio of channel dimensions between the Slow and Fast pathways.
slowfast_fusion_conv_kernel_size (DEPRECATED) – the convolutional kernel size used for fusion.
slowfast_fusion_conv_stride (DEPRECATED) – the convolutional stride size used for fusion.
fusion_builder (Callable[[int, int], nn.Module]) – Builder function for generating the fusion modules based on stage dimension and index
input_channels (tuple) – number of channels for the input video clip.
model_depth (int) – the depth of the resnet.
model_num_class (int) – the number of classes for the video dataset.
dropout_rate (float) – dropout rate.
norm (callable) – a callable that constructs normalization layer.
activation (callable) – a callable that constructs activation layer.
stem_function (Tuple[Callable]) – a callable that constructs stem layer. Examples include create_res_basic_stem. Indexed by pathway
stem_dim_outs (tuple) – output channel size to stem.
stem_conv_kernel_sizes (tuple) – convolutional kernel size(s) of stem.
stem_conv_strides (tuple) – convolutional stride size(s) of stem.
stem_pool (Tuple[Callable]) – a callable that constructs resnet head pooling layer. Indexed by pathway
stem_pool_kernel_sizes (tuple) – pooling kernel size(s).
stem_pool_strides (tuple) – pooling stride size(s).
stage_conv_a_kernel_sizes (tuple) – convolutional kernel size(s) for conv_a.
stage_conv_b_kernel_sizes (tuple) – convolutional kernel size(s) for conv_b.
stage_conv_b_num_groups (tuple) – number of groups for groupwise convolution for conv_b. 1 for ResNet, and larger than 1 for ResNeXt.
stage_conv_b_dilations (tuple) – dilation for 3D convolution for conv_b.
stage_spatial_strides (tuple) – the spatial stride for each stage.
stage_temporal_strides (tuple) – the temporal stride for each stage.
bottleneck (Tuple[Tuple[Callable]]) – a callable that constructs bottleneck block layer. Examples include: create_bottleneck_block. Indexed by pathway and stage index
head (callable) – a a callable that constructs the detection head which can take in the additional input of bounding boxes. Ex: create_res_roi_pooling_head
head_pool (callable) – a callable that constructs resnet head pooling layer.
head_output_sizes (tuple) – the size of output tensor for head.
head_activation (callable) – a callable that constructs activation layer.
head_output_with_global_average (bool) – if True, perform global averaging on the head output.
head_spatial_resolution (tuple) – h, w sizes of the RoI interpolation.
head_spatial_scale (float) – scale the input boxes by this number.
head_sampling_ratio (int) – number of inputs samples to take for each output sample interpolation. 0 to take samples densely.
head_pool_kernel_sizes (Tuple[Tuple[int]]) –
head_output_size (Tuple[int]) –

Returns

(nn.Module) – SlowFast model.

Return type

torch.nn.modules.module.Module

class pytorchvideo.models.slowfast.PoolConcatPathway(retain_list=False, pool=None, dim=1)[source]¶

Given a list of tensors, perform optional spatio-temporal pool and concatenate the: tensors along the channel dimension.

__init__(retain_list=False, pool=None, dim=1)[source]¶

Parameters

retain_list (bool) – if True, return the concatenated tensor in a list.
pool (nn.module_list) – if not None, list of pooling models for different pathway before performing concatenation.
dim (int) – dimension to performance concatenation.

Return type

None

class pytorchvideo.models.slowfast.FuseFastToSlow(conv_fast_to_slow, norm=None, activation=None)[source]¶

Given a list of two tensors from Slow pathway and Fast pathway, fusion information from the Fast pathway to the Slow on through a convolution followed by a concatenation, then return the fused list of tensors from Slow and Fast pathway in order.

__init__(conv_fast_to_slow, norm=None, activation=None)[source]¶

Parameters

conv_fast_to_slow (nn.module) – convolution to perform fusion.
norm (nn.module) – normalization module.
activation (torch.nn.modules) – activation module.

Return type

None