实时摔倒识别 / 运动分析 / 打架等异常行为识别 / 控制手势识别等所有行为识别全家桶原理 + 代码 + 数据 + 模型开源！

cv君

关注

发布于: 2021 年 03 月 22 日

实时摔倒识别 /运动分析/打架等异常行为识别/控制手势识别等所有行为识别全家桶原理 + 代码 + 数据+ 模型开源！

大家好，我是 cv 君，很多大创，比赛，项目，工程，科研，学术的炼丹术士问我上述这些识别，该怎么做，怎么选择框架，今天可以和大家分析一下一些方案：

用单帧目标检测做的话，前后语义相关性很差（也有优化版），效果不能达到实际项目需求，尤其是在误检上较难，并且目标检测是需要大量数据来拟合的。标注需求极大。

用姿态加目标检测结合的方式，效果是很不错的，不过一些这样类似 Two stage 的方案，速度较慢（也有很多实时的），同样有着一些不能通过解决时间上下文的问题。

即：摔倒检测我们正常是应该有一个摔倒过程，才能被判断为摔倒的，而不是人倒下的就一定是摔倒（纯目标检测弊病）

运动检测比如引体向上，和高抬腿计数，球类运动，若是使用目标检测做，那么会出现什么问题呢？引体向上无法实现动作是否规范（当然可以通过后处理判断下巴是否过框，效果是不够人工智能的），高抬腿计数，目标检测是无法计数的，判断人物的球类运动，目标检测是有很大的误检的：第一种使用球检测，误检很大，第二种使用打球手势检测，遇到人物遮挡球类，就无法识别目标，在标注上也需要大量数据…

今天 cv 君铺垫了这么多，只是为了给大家推荐一个全新出炉视频序列检测方法，目前代码已开源至 Github：https://github.com/CVUsers/CV-Action欢迎 star~

欢迎移步。只需要很少的训练数据，就可以拟合哦！不信你来试试吧~几个训练集即可。

神经网络使用的是这两个月开源的实时动作序列强分类神经网络：realtimenet 。

我的 github 将收集所有的上述说到的动作序列视频数据，训练出能实用的检测任务：目前实现了手势控制的检测，等等，大家欢迎关注公众号，后续会接着更新。

开始

目前以手势和运动识别为例子，因为 cv 君没什么数据哈哈

项目演示：

本人做的没转 gif，所以大家可以看看其他的演示效果图，跟我的是几乎一样的~ 只是训练数据不同

在这里插入图片描述在这里插入图片描述

一、基本过程和思想

基本思想是将数据集中视频及分类标签转换为图像（视频帧）和其对应的分类标签，也可以不标注，单独给一个小视频标注上分类类别，再采用 CNN 网络对图像进行训练学习和测试，将视频分类问题转化为图形分类问题。具体步骤包括：

（1）对每个视频(训练和测试视频)以一定的 FPS 截出视频帧（jpegs）保存为训练集和测试集，将对图像的分类性能作为所对应视频的分类性能

（2）训练一个人物等特征提取模型，并采用模型融合策略，一个特征提取，一个分类模型。特征工程部分通用人物行为，分类模型，训练自己的类别的分类模型即可。

（4）训练完成后载入模型对 test set 内所有的视频帧进行检查验证，得出全测试集上的 top1 准确率和 top5 准确率输出。

（5）实时检测。

二、视频理解还有哪些优秀框架

第一个就是我 github 这个了，比较方便，但不敢排前几，因为没有什么集成，

然后 MMaction ，就是视频理解框架了，众所周知，他们家的东西很棒

第二个就是 facebook 家的一些了，

再下来基本上就不多了，全面好用的实时框架。

好，所以我们先来说说我的使用过程。

三、效果体验~使用

体验官方的一些模型（模型我已经放在里面了）

pip install -r requirements.txt

将模型放置此处：

resources

├── backbone

│ ├── stridedinflatedefficientnet.ckpt

│ └── stridedinflatedmobilenet.ckpt

├── fitnessactivityrecognition

│ └── ...

├── gesture_recognition

│ └── ...

└── ...

首先，请试用我们提供的演示。在 sense/examples 目录中，您将找到 3 个 Python 脚本， rungesturerecognition.py ，健身跟踪器 runfitnesstracker.py .py，并运行卡路里估算 runcalorieestimation .py. 启动每个演示就像在终端中运行脚本一样简单，如下所述。

手势：

cd examples/

python rungesturerecognition.py

健身_跟踪器：

python examples/runfitnesstracker.py --weight=65 --age=30 --height=170 --gender=female

--cameraid=CAMERAID ID of the camera to stream from

--path_in=FILENAME Video file to stream from. This assumes that the video was encoded at 16 fps.

卡路里计算

python examples/runcalorieestimation.py --weight=65 --age=30 --height=170 --gender=female

三、训练自己数据集步骤

首先 clone 一下我的 github，或者原作者 github，

然后自己录制几个视频，比如我这里 capture 一个类别，录制了几个视频，可以以 MP4 或者 avi 后缀，再来个类别，再录制一些视频，以名字为类别。

然后

cd tools\sensestudio\sensestudio.py

这一步，会显示：

然后，打开这个网址：

来到前端界面

点击一下 start new project

这样编写

然后点击 create project 即可制作数据。

但是官方的制作方法是有着严重 bug 的~我们该怎么做呢！

下面，我修改后，可以这样！

这里请仔细看：

我们在 sense_studio 文件夹下，新建一个文件夹：我叫他 cvdemo1

然后新建两个文件夹：videostrain 和 videosvalid 里面存放的 capture 是你的类别名字的数据集，capture 存放相关的训练集，click 存放 click 的训练集，同样的 videos_valid 存放验证集，

在 cvdemo1 文件夹下新建 project_config.json ，里面写什么呢？可以复制我的下面的代码：

{

"name": "cvdemo1",

"date_created": "2021-02-03",

"classes": {

"capture": [

"capture",

"capture"

],

"click": [

"click",

"click"

]

}

里面的 name 改成你的文件夹名字即可。

就这么简单！

然后就可以训练：

python train_classifier.py 你可以将 main 中修改一下。

将 path in 修改成我们的训练数据地址，即可，其他的修改不多，就按照我的走即可，

# Parse arguments

# args = docopt(doc)

pathin = './sensestudio/cvdemo1/'

pathout = pathin

os.makedirs(pathout, existok=True)

use_gpu = True

pathannotationstrain = None

pathannotationsvalid =None

numlayersto_finetune = 9

temporal_training = False

# Load feature extractor

featureextractor = featureextractors.StridedInflatedEfficientNet()

checkpoint = torch.load('../resources/backbone/stridedinflatedefficientnet.ckpt')

featureextractor.loadstate_dict(checkpoint)

feature_extractor.eval()

# Get the require temporal dimension of feature tensors in order to

# finetune the provided number of layers.

if numlayersto_finetune > 0:

numtimesteps = featureextractor.numrequiredframesperlayer.get(-numlayersto_finetune)

if not num_timesteps:

# Remove 1 because we added 0 to temporal_dependencies

numlayers = len(featureextractor.numrequiredframesperlayer) - 1

raise IndexError(f'Num of layers to finetune not compatible. '

f'Must be an integer between 0 and {num_layers}')

else:

num_timesteps = 1

训练特别快，10 分钟即可，

然后，你可以运行 runcustomclassifier.py

# Parse arguments

# args = docopt(doc)

camera_id = 0

path_in = None

path_out = None

customclassifier = './sensestudio/cvdemo1/'

title = None

use_gpu = True

# Load original feature extractor

featureextractor = featureextractors.StridedInflatedEfficientNet()

featureextractor.loadweightsfromresources('../resources/backbone/stridedinflatedefficientnet.ckpt')

# featureextractor = featureextractors.StridedInflatedMobileNetV2()

# featureextractor.loadweightsfromresources(r'../resources\backbone\stridedinflatedmobilenet.ckpt')

checkpoint = featureextractor.statedict()

# Load custom classifier

checkpointclassifier = torch.load(os.path.join(customclassifier, 'classifier.checkpoint'))

# Update original weights in case some intermediate layers have been finetuned

namefinetunedlayers = set(checkpoint.keys()).intersection(checkpoint_classifier.keys())

for key in namefinetunedlayers:

checkpoint[key] = checkpoint_classifier.pop(key)

featureextractor.loadstate_dict(checkpoint)

feature_extractor.eval()

print('[debug] net:', feature_extractor)

with open(os.path.join(custom_classifier, 'label2int.json')) as file:

class2int = json.load(file)

INT2LAB = {value: key for key, value in class2int.items()}

gestureclassifier = LogisticRegression(numin=featureextractor.featuredim,

num_out=len(INT2LAB))

gestureclassifier.loadstatedict(checkpointclassifier)

gesture_classifier.eval()

print(gesture_classifier)

同样修改路径即可。

结果就可以实时检测了

原代码解读

同样的，我们使用的是使用 efficienct 来做的特征，你也可以改成 mobilenet 来做，有示例代码,就是训练的时候，用 mobilenet ，检测的时候也是，只需要修改几行代码即可。

efficienct 提取特征部分代码：

class StridedInflatedEfficientNet(StridedInflatedMobileNetV2):

def init(self):

super().init()

self.cnn = nn.Sequential(

ConvReLU(3, 32, 3, stride=2),

InvertedResidual(32, 24, 3, spatial_stride=1),

InvertedResidual(24, 32, 3, spatialstride=2, expandratio=6),

InvertedResidual(32, 32, 3, spatialstride=1, expandratio=6, temporal_shift=True),

InvertedResidual(32, 32, 3, spatialstride=1, expandratio=6),

InvertedResidual(32, 56, 5, spatialstride=2, expandratio=6),

InvertedResidual(56, 56, 5, spatialstride=1, expandratio=6, temporalshift=True, temporalstride=True),

InvertedResidual(56, 56, 5, spatialstride=1, expandratio=6),

InvertedResidual(56, 112, 3, spatialstride=2, expandratio=6),

InvertedResidual(112, 112, 3, spatialstride=1, expandratio=6, temporal_shift=True),

InvertedResidual(112, 112, 3, spatialstride=1, expandratio=6),

InvertedResidual(112, 112, 3, spatialstride=1, expandratio=6, temporalshift=True, temporalstride=True),

InvertedResidual(112, 112, 3, spatialstride=1, expandratio=6),

InvertedResidual(112, 160, 5, spatialstride=1, expandratio=6),

InvertedResidual(160, 160, 5, spatialstride=1, expandratio=6, temporal_shift=True),

InvertedResidual(160, 160, 5, spatialstride=1, expandratio=6),

InvertedResidual(160, 160, 5, spatialstride=1, expandratio=6, temporal_shift=True),

InvertedResidual(160, 160, 5, spatialstride=1, expandratio=6),

InvertedResidual(160, 272, 5, spatialstride=2, expandratio=6),

InvertedResidual(272, 272, 5, spatialstride=1, expandratio=6, temporal_shift=True),

InvertedResidual(272, 272, 5, spatialstride=1, expandratio=6),

InvertedResidual(272, 272, 5, spatialstride=1, expandratio=6, temporal_shift=True),

InvertedResidual(272, 272, 5, spatialstride=1, expandratio=6),

InvertedResidual(272, 448, 3, spatialstride=1, expandratio=6),

ConvReLU(448, 1280, 1)

)

这个 InvertedResidual 在这，

class InvertedResidual(nn.Module): # noqa: D101

def init(self, inplanes, outplanes, spatialkernelsize=3, spatialstride=1, expandratio=1,

temporalshift=False, temporalstride=False, sparsetemporalconv=False):

super().init()

assert spatial_stride in [1, 2]

hiddendim = round(inplanes * expand_ratio)

self.useresidual = spatialstride == 1 and inplanes == outplanes

self.temporalshift = temporalshift

self.temporalstride = temporalstride

layers = []

if expand_ratio != 1:

# Point-wise expansion

stride = 1 if not temporal_stride else (2, 1, 1)

if temporalshift and sparsetemporal_conv:

convlayer = SteppableSparseConv3dAs2d

kernel_size = 1

elif temporal_shift:

convlayer = SteppableConv3dAs2d

kernel_size = (3, 1, 1)

else:

convlayer = nn.Conv2d

kernel_size = 1

layers.append(ConvReLU(inplanes, hiddendim, kernelsize=kernelsize, stride=stride,

padding=0, convlayer=convlayer))

layers.extend([

# Depth-wise convolution

ConvReLU(hiddendim, hiddendim, kernelsize=spatialkernelsize, stride=spatialstride,

groups=hidden_dim),

# Point-wise mapping

nn.Conv2d(hiddendim, outplanes, 1, 1, 0),

# nn.BatchNorm2d(out_planes)

])

self.conv = nn.Sequential(*layers)

def forward(self, input_): # noqa: D102

output = self.conv(input)

residual = self.realign(input, output)

if self.use_residual:

output_ += residual

return output_

def realign(self, input, output): # noqa: D102

nout = output.shape[0]

if self.temporal_stride:

indices = [-1 - 2 * idx for idx in range(n_out)]

return input_[indices[::-1]]

else:

return input[-nout:]

我们 finetune 自己的数据集

def extractfeatures(pathin, net, numlayersfinetune, usegpu, numtimesteps=1):

# Create inference engine

inferenceengine = engine.InferenceEngine(net, usegpu=use_gpu)

# extract features

for dataset in ["train", "valid"]:

videosdir = os.path.join(pathin, f"videos_{dataset}")

featuresdir = os.path.join(pathin, f"features{dataset}num_layers_tofinetune={numlayers_finetune}")

videofiles = glob.glob(os.path.join(videosdir, "", ".avi"))

print(f"\nFound {len(video_files)} videos to process in the {dataset}set")

for videoindex, videopath in enumerate(video_files):

print(f"\rExtract features from video {videoindex + 1} / {len(videofiles)}",

end="")

pathout = videopath.replace(videosdir, featuresdir).replace(".mp4", ".npy")

if os.path.isfile(path_out):

print("\n\tSkipped - feature was already precomputed.")

else:

# Read all frames

computefeatures(videopath, pathout, inferenceengine,

numtimesteps=numtimesteps, pathframes=None, batchsize=16)

print('\n')

构建数据的 dataloader

def generatedataloader(datasetdir, featuresdir, tagsdir, labelnames, label2int,

label2inttemporalannotation, numtimesteps=5, batchsize=16, shuffle=True,

stride=4, pathannotations=None, temporalannotation_only=False,

fullnetworkminimumframes=MODELTEMPORAL_DEPENDENCY):

# Find pre-computed features and derive corresponding labels

tagsdir = os.path.join(datasetdir, tags_dir)

featuresdir = os.path.join(datasetdir, features_dir)

labels_string = []

temporal_annotation = []

if not path_annotations:

# Use all pre-computed features

features = []

labels = []

for label in label_names:

featuretemp = glob.glob(f'{featuresdir}/{label}/*.npy')

features += feature_temp

labels += [label2int[label]] * len(feature_temp)

labelsstring += [label] * len(featuretemp)

else:

with open(path_annotations, 'r') as f:

annotations = json.load(f)

features = ['{}/{}/{}.npy'.format(features_dir, entry['label'],

os.path.splitext(os.path.basename(entry['file']))[0])

for entry in annotations]

labels = [label2int[entry['label']] for entry in annotations]

labels_string = [entry['label'] for entry in annotations]

# check if annotation exist for each video

for label, feature in zip(labels_string, features):

classemapping = {0: "countingbackground",

1: f'{label}position1', 2:

f'{label}position2'}

temporalannotationfile = feature.replace(featuresdir, tagsdir).replace(".npy", ".json")

if os.path.isfile(temporalannotationfile):

annotation = json.load(open(temporalannotationfile))["time_annotation"]

annotation = np.array([label2inttemporalannotation[classe_mapping[y]] for y in annotation])

temporal_annotation.append(annotation)

else:

temporal_annotation.append(None)

if temporalannotationonly:

features = [x for x, y in zip(features, temporal_annotation) if y is not None]

labels = [x for x, y in zip(labels, temporal_annotation) if y is not None]

temporalannotation = [x for x in temporalannotation if x is not None]

# Build dataloader

dataset = FeaturesDataset(features, labels, temporal_annotation,

numtimesteps=numtimesteps, stride=stride,

fullnetworkminimumframes=fullnetworkminimumframes)

dataloader = torch.utils.data.DataLoader(dataset, shuffle=shuffle, batchsize=batch_size)

return data_loader

如何实时检测视频序列的？

这个问题，主要是通过系列时间内帧间图像组合成一个序列，送到网络中进行分类的，可以在许多地方找到相关参数，比如 display.py :

class DisplayClassnameOverlay(BaseDisplay):

"""

Display recognized class name as a large video overlay. Once the probability for a class passes the threshold,

the name is shown and stays visible for a certain duration.

"""

def init(

self,

thresholds: Dict[str, float],

duration: float = 2.,

font_scale: float = 3.,

thickness: int = 2,

border_size: int = 50,

**kwargs

):

"""

:param thresholds:

Dictionary of thresholds for all classes.

:param duration:

Duration in seconds how long the class name should be displayed after it has been recognized.

:param font_scale:

Font scale factor for modifying the font size.

:param thickness:

Thickness of the lines used to draw the text.

:param border_size:

Height of the border on top of the video display. Used for correctly centering the displayed class name

on the video.

"""

super().init(**kwargs)

self.thresholds = thresholds

self.duration = duration

self.fontscale = fontscale

self.thickness = thickness

self.bordersize = bordersize

self.currentclass_name = None

self.starttime = None

def getcenter_coordinates(self, img: np.ndarray, text: str):

textsize = cv2.getTextSize(text, FONT, self.font_scale, self.thickness)[0]

height, width, _ = img.shape

height -= self.border_size

x = int((width - textsize[0]) / 2)

y = int((height + textsize[1]) / 2) + self.border_size

return x, y

def displayclassname(self, img: np.ndarray, classname: str):

pos = self.getcentercoordinates(img, classname)

puttext(img, classname, position=pos, fontscale=self.fontscale, thickness=self.thickness)

def display(self, img: np.ndarray, display_data: dict):

now = time.perf_counter()

if self.currentclassname and now - self.start_time < self.duration:

# Keep displaying the same class name

self.displayclassname(img, self.currentclassname)

else:

self.currentclass_name = None

for classname, proba in displaydata['sorted_predictions']:

if classname in self.thresholds and proba > self.thresholds[classname]:

# Display new class name

self.displayclassname(img, classname)

self.currentclassname = classname

self.starttime = now

break

return img

对了

每个类别只需要 5 个左右的视频，即可得到不错的效果嗷~

欢迎 Star github~

因为后续会更新标题的所有模型。

发布于: 2021 年 03 月 22 日阅读数: 16

原文链接:【http://xie.infoq.cn/article/ca89527294edbc5b16c03d235】。文章转载请联系作者。

cv君

关注

还未添加个人签名 2021.03.22 加入

还未添加个人简介

发布

暂无评论

创作场景

实时 摔倒识别 / 运动分析 / 打架等异常行为识别 / 控制手势识别等所有行为识别全家桶 原理 + 代码 + 数据 + 模型 开源！

开始

项目演示：

一、 基本过程和思想

二 、视频理解还有哪些优秀框架

cv君

评论

实时摔倒识别 / 运动分析 / 打架等异常行为识别 / 控制手势识别等所有行为识别全家桶原理 + 代码 + 数据 + 模型开源！

一、基本过程和思想

二、视频理解还有哪些优秀框架