Lanzhou University of Technology Institutional Repository (LUT_IR)
Video captioning based on vision transformer and reinforcement learning | |
Zhao, Hong1; Chen, Zhiwen1; Guo, Lan1; Han, Zeyu2 | |
2022-03-16 | |
发表期刊 | PEERJ COMPUTER SCIENCE |
卷号 | 8 |
摘要 | Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from videos. Secondly, the encoding block of the ViT network is applied to encode video features. Thirdly, the encoded features are fed into a Long Short-Term Memory (LSTM) network to generate a video content description. Finally, the accuracy of video content description is further improved by fine-tuning reinforcement learning. We conducted experiments on the benchmark dataset MSR-VTT used for video captioning. The results show that compared with the current mainstream methods, the model in this paper has improved by 2.9%, 1.4%, 0.9% and 4.8% under the four evaluation indicators of LEU-4, METEOR, ROUGE-L and CIDEr-D, respectively. |
关键词 | Video captioning Vision transformer Reinforcement learning Long short-term memory network Computer vision Natural language processing Attention mechanism Encode-decode Deep learning |
DOI | 10.7717/peerj-cs.916 |
收录类别 | SCIE ; EI |
语种 | 英语 |
WOS研究方向 | Computer Science |
WOS类目 | Computer Science, Artificial Intelligence ; Computer Science, Information Systems ; Computer Science, Theory & Methods |
WOS记录号 | WOS:000773302200003 |
出版者 | PEERJ INC |
EI入藏号 | 20221611987362 |
EI主题词 | Long short-term memory |
EI分类号 | 461.1 Biomedical Engineering ; 716.1 Information Theory and Signal Processing ; 716.4 Television Systems and Equipment ; 723.2 Data Processing and Image Processing ; 723.4 Artificial Intelligence ; 723.5 Computer Applications ; 741.2 Vision |
来源库 | WOS |
引用统计 | |
文献类型 | 期刊论文 |
条目标识符 | https://ir.lut.edu.cn/handle/2XXMBERH/158092 |
专题 | 兰州理工大学 |
通讯作者 | Chen, Zhiwen |
作者单位 | 1.Lanzhou Univ Technol, Sch Comp & Commun, Lanzhou, Gansu, Peoples R China; 2.Lanzhou Univ Technol, Network & Informat Ctr, Lanzhou, Gansu, Peoples R China |
第一作者单位 | 兰州理工大学 |
通讯作者单位 | 兰州理工大学 |
第一作者的第一单位 | 兰州理工大学 |
推荐引用方式 GB/T 7714 | Zhao, Hong,Chen, Zhiwen,Guo, Lan,et al. Video captioning based on vision transformer and reinforcement learning[J]. PEERJ COMPUTER SCIENCE,2022,8. |
APA | Zhao, Hong,Chen, Zhiwen,Guo, Lan,&Han, Zeyu.(2022).Video captioning based on vision transformer and reinforcement learning.PEERJ COMPUTER SCIENCE,8. |
MLA | Zhao, Hong,et al."Video captioning based on vision transformer and reinforcement learning".PEERJ COMPUTER SCIENCE 8(2022). |
条目包含的文件 | 条目无相关文件。 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论