Video captioning based on vision transformer and reinforcement learning

doi:10.7717/peerj-cs.916

	Video captioning based on vision transformer and reinforcement learning
	Zhao, Hong 1; Chen, Zhiwen 1; Guo, Lan 1; Han, Zeyu 2
	2022-03-16
发表期刊	PEERJ COMPUTER SCIENCE
卷号	8
摘要	Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from videos. Secondly, the encoding block of the ViT network is applied to encode video features. Thirdly, the encoded features are fed into a Long Short-Term Memory (LSTM) network to generate a video content description. Finally, the accuracy of video content description is further improved by fine-tuning reinforcement learning. We conducted experiments on the benchmark dataset MSR-VTT used for video captioning. The results show that compared with the current mainstream methods, the model in this paper has improved by 2.9%, 1.4%, 0.9% and 4.8% under the four evaluation indicators of LEU-4, METEOR, ROUGE-L and CIDEr-D, respectively.
关键词	Video captioning Vision transformer Reinforcement learning Long short-term memory network Computer vision Natural language processing Attention mechanism Encode-decode Deep learning
DOI	10.7717/peerj-cs.916
收录类别	SCIE ; EI
语种	英语
WOS研究方向	Computer Science
WOS类目	Computer Science, Artificial Intelligence ; Computer Science, Information Systems ; Computer Science, Theory & Methods
WOS记录号	WOS:000773302200003
出版者	PEERJ INC
EI入藏号	20221611987362
EI主题词	Long short-term memory
EI分类号	461.1 Biomedical Engineering ; 716.1 Information Theory and Signal Processing ; 716.4 Television Systems and Equipment ; 723.2 Data Processing and Image Processing ; 723.4 Artificial Intelligence ; 723.5 Computer Applications ; 741.2 Vision
来源库	WOS
引用统计	被引频次：4[WOS] [WOS记录] [WOS相关记录]
文献类型	期刊论文
条目标识符	https://ir.lut.edu.cn/handle/2XXMBERH/158092
专题	兰州理工大学
通讯作者	Chen, Zhiwen
作者单位	1.Lanzhou Univ Technol, Sch Comp & Commun, Lanzhou, Gansu, Peoples R China; 2.Lanzhou Univ Technol, Network & Informat Ctr, Lanzhou, Gansu, Peoples R China
第一作者单位	兰州理工大学
通讯作者单位	兰州理工大学
第一作者的第一单位	兰州理工大学
推荐引用方式 GB/T 7714	Zhao, Hong,Chen, Zhiwen,Guo, Lan,et al. Video captioning based on vision transformer and reinforcement learning[J]. PEERJ COMPUTER SCIENCE,2022,8.
APA	Zhao, Hong,Chen, Zhiwen,Guo, Lan,&Han, Zeyu.(2022).Video captioning based on vision transformer and reinforcement learning.PEERJ COMPUTER SCIENCE,8.
MLA	Zhao, Hong,et al."Video captioning based on vision transformer and reinforcement learning".PEERJ COMPUTER SCIENCE 8(2022).