We propose a motion generation model that can achieve robust behavior against environmental changes based on language instructions at a low cost. Conventional robots that communicate with humans use a restricted environment and language to build up a mapping between language and motion, and thus need to prepare a huge training set in order to achieve versatility. Our method trains pairs of language, visual, and motor information of the robot, and generates motions in real-time based on the 'attention' of the language instructions. Specifically, the robot generates motions while focusing on the indicated objects by the human when multiple objects are in the field of view. In addition, since position recognition and motion generation of the indicated object are performed in real-time, robust motion generation is possible in response to changes in the object position and lighting conditions. We clarified that features related to the object name and its location are self-organized in the latent (PB: Parametric Bias) space by end-to-end learning of robot motion and sentences. These observations may indicate the importance of integrated learning of robot motion and sentences since such feature representations cannot be obtained by learning motions alone.