Deep reinforcement learning algorithms have made great progress in video games. However, there are still some problems, such as sample inefficiency and poor generalization. In this paper, we highlight that these problems are partially caused by the inability of convolutional neural networks (CNNs) to reason with the underlying relations between the objects in the image observations. Based on this point, we try to alleviate these problems in a more efficient and explainable way, including learning the representations of objects and reasoning the relations between them with a relation network (RN). Each pixel in the feature maps is treated as an object and our model explicitly learns the relations between object pairs. The relations are summarized through an attention mechanism and then fed into the downstream fully-connected layers. In the experiments, our model is compared with baseline models in three typical object based Atari games. Under the same hyperparameter settings, our model still achieves better sample efficiency and generalization capability. Further studies throw light on the impact of hyperparameters and verify the interpretability of the model.