Video super-resolution (VSR) aims to restore a high-resolution (HR) image from multiple low-resolution (LR) frames. Previous works deal with inputs LR frames by stacking or warping and only use single scale features for reconstruction. Most of them didn't consider fusing multi-scale spatial and inter-frame temporal information, which may result in loss of details. In this paper, a novel architecture named Wave-shape network is proposed, which is designed to treat each frame as a separate source of information and fuse different temporal frames through a multi-scale structure. This fusion strategy enables us to capture more complete structure and context information for HR image quality improvement. We evaluate this model on Vid4 dataset and the results show Waveshape network not only achieves significant improvement in vision but also obtains much higher PSNR and SSIM than most previous VSR methods.