<strong>Paper Title</strong><br>
Visual Tracking with Multi-Scale Fusion Transformers<br>
<br>

<strong>Abstract</strong><br>
Visual object tracking is a task in which the targetregion in the first frame is given and then the target is
automaticallyfound in subsequent frames. The appearance of the targetmight change dramatically due to partial occlusion,
illuminationchange, object deformation, etc. This makes the tracking taskquite challenging. In this paper, we propose an
object trackingframework called MSFormer. Our proposed method consists oftwomajor parts: backbone and head. In
contrast to the existingobject tracking frameworks, we integrate feature extraction andrelation module into the backbone.
This makes our proposedframework simpler and more efficient. The backbone is basedon the multi-scale feature fusion
module (MSFM), which fusesthe features of template and search region at various scales.Thus, it can deal with scale
variations of the target during thetracking process. The head consists of the corner head and scorehead. The corner head is
responsible for determining the top-leftand bottom-right corners of the bounding box. We haveconducted experiments on
two benchmark datasets, i.e., GOT-10kand LaSOT. Our MSFormerachieves comparable performanceas the state-of-the-art
approaches.
Keywords - Object Tracking, Computer Vision, Deep Learning.