IR
Video captioning based on vision transformer and reinforcement learning
Zhao, Hong1; Chen, Zhiwen1; Guo, Lan1; Han, Zeyu2
2022-03-16
发表期刊PEERJ COMPUTER SCIENCE
卷号8
摘要Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from videos. Secondly, the encoding block of the ViT network is applied to encode video features. Thirdly, the encoded features are fed into a Long Short-Term Memory (LSTM) network to generate a video content description. Finally, the accuracy of video content description is further improved by fine-tuning reinforcement learning. We conducted experiments on the benchmark dataset MSR-VTT used for video captioning. The results show that compared with the current mainstream methods, the model in this paper has improved by 2.9%, 1.4%, 0.9% and 4.8% under the four evaluation indicators of LEU-4, METEOR, ROUGE-L and CIDEr-D, respectively.
关键词Video captioning Vision transformer Reinforcement learning Long short-term memory network Computer vision Natural language processing Attention mechanism Encode-decode Deep learning
DOI10.7717/peerj-cs.916
收录类别SCIE ; EI
语种英语
WOS研究方向Computer Science
WOS类目Computer Science, Artificial Intelligence ; Computer Science, Information Systems ; Computer Science, Theory & Methods
WOS记录号WOS:000773302200003
出版者PEERJ INC
EI入藏号20221611987362
EI主题词Long short-term memory
EI分类号461.1 Biomedical Engineering ; 716.1 Information Theory and Signal Processing ; 716.4 Television Systems and Equipment ; 723.2 Data Processing and Image Processing ; 723.4 Artificial Intelligence ; 723.5 Computer Applications ; 741.2 Vision
来源库WOS
引用统计
被引频次:4[WOS]   [WOS记录]     [WOS相关记录]
文献类型期刊论文
条目标识符https://ir.lut.edu.cn/handle/2XXMBERH/158092
专题兰州理工大学
通讯作者Chen, Zhiwen
作者单位1.Lanzhou Univ Technol, Sch Comp & Commun, Lanzhou, Gansu, Peoples R China;
2.Lanzhou Univ Technol, Network & Informat Ctr, Lanzhou, Gansu, Peoples R China
第一作者单位兰州理工大学
通讯作者单位兰州理工大学
第一作者的第一单位兰州理工大学
推荐引用方式
GB/T 7714
Zhao, Hong,Chen, Zhiwen,Guo, Lan,et al. Video captioning based on vision transformer and reinforcement learning[J]. PEERJ COMPUTER SCIENCE,2022,8.
APA Zhao, Hong,Chen, Zhiwen,Guo, Lan,&Han, Zeyu.(2022).Video captioning based on vision transformer and reinforcement learning.PEERJ COMPUTER SCIENCE,8.
MLA Zhao, Hong,et al."Video captioning based on vision transformer and reinforcement learning".PEERJ COMPUTER SCIENCE 8(2022).
条目包含的文件
条目无相关文件。
个性服务
查看访问统计
谷歌学术
谷歌学术中相似的文章
[Zhao, Hong]的文章
[Chen, Zhiwen]的文章
[Guo, Lan]的文章
百度学术
百度学术中相似的文章
[Zhao, Hong]的文章
[Chen, Zhiwen]的文章
[Guo, Lan]的文章
必应学术
必应学术中相似的文章
[Zhao, Hong]的文章
[Chen, Zhiwen]的文章
[Guo, Lan]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。