We propose a model of evolutionary interaction between two robots where signs used for communication emerge through mutual adaptation. Signs used in human interaction, e.g., language, gestures and eye contact change and evolve in form and meaning through repeated use. To create flexible human-like interaction systems, it is necessary to deal with signs as a dynamic property and to construct a framework in which signs emerge from mutual adaptation by agents. Our target is multi-modal interaction using voice and motion between two robots where a voice/motion pattern is used as a sign referring to a motion/voice pattern. To enable evolutionary signs (voice and motion patterns) to be recognized and generated, we utilized a dynamics model: Multiple Timescale Recurrent Neural Network (MTRNN). To enable the robots to interpret signs, we utilized hierarchical neural networks, which transform dynamics model parameters of voice/motion into those of motion/voice. In our experiment, two robots modified their own interpretation of signs constantly through mutual adaptation in interaction where they responded to the other's voice with motion one after the other. As a result of the experiment, we found that the interaction kept evolving through the robots' repeated and alternate miscommunications and readaptations, and this induced the emergence of diverse new signs that depended on the robots' body dynamics through the generalization capability of MTRNN.