In this paper, we propose a spatio-temporal predictive network with attention weighting of multiple physical Deep Learning (DL) models for videos with various physical properties. Previous approaches have been models with multiple branches for difference properties in videos, but the outputs of branches have been simply summed even with properties that change in time and space. In addition, it is difficult to train previous models for sufficient representations of physical properties in videos. Therefore, we propose the design of the spatio-temporal prediction network and the training method for videos with multiple physical properties, motivated by the Mixtures of Experts framework. Multiple spatio-temporal DL branches/experts for multiple physical properties and pixel-wise and expert-wise attention mechanism for adaptively integrating outputs of experts, i.e., Spatial-Temporal Gating Networks (STGNs) are proposed. Experts are trained with a vast amount of synthetic image sequences by physical equations and noise models. Instead, the whole network including STGNs is allowed to be trained only with a limited number of real datasets. Experiments on various videos, i.e., traffic, pedestrian, Dynamic Texture videos, and radar images, show the superiority of our proposed approach compared with previous approaches.