Paper Title
Visual Tracking with Multi-Scale Fusion Transformers

Abstract
Visual object tracking is a task in which the targetregion in the first frame is given and then the target is automaticallyfound in subsequent frames. The appearance of the targetmight change dramatically due to partial occlusion, illuminationchange, object deformation, etc. This makes the tracking taskquite challenging. In this paper, we propose an object trackingframework called MSFormer. Our proposed method consists oftwomajor parts: backbone and head. In contrast to the existingobject tracking frameworks, we integrate feature extraction andrelation module into the backbone. This makes our proposedframework simpler and more efficient. The backbone is basedon the multi-scale feature fusion module (MSFM), which fusesthe features of template and search region at various scales.Thus, it can deal with scale variations of the target during thetracking process. The head consists of the corner head and scorehead. The corner head is responsible for determining the top-leftand bottom-right corners of the bounding box. We haveconducted experiments on two benchmark datasets, i.e., GOT-10kand LaSOT. Our MSFormerachieves comparable performanceas the state-of-the-art approaches. Keywords - Object Tracking, Computer Vision, Deep Learning.