Robots are required to process multimodal information because the information in the real world comes from various modal inputs. However, there exist only a few robots integrating multimodal information. Humans can recognize the environment effectively by cross-modal processing. We focus on modeling synaesthesia phenomenon known to be a cross-modal perception of humans. Recently, deep neural networks (DNNs) have gained more attention and successfully applied to process high-dimensional data composed not only of single modality but also of multimodal information. We introduced DNNs to construct multimodal association model which can reconstruct one modality from the other modality. Our model is composed of two DNNs: one for image compression and the other for audio-visual sequential learning. We tried to reproduce synaesthesia phenomenon by training our model with the multimodal data acquired from psychological experiment. Cross-modal association experiment showed that our model can reconstruct the same or similar images from sound as synaesthetes, those who experience synaesthesia. The analysis of middle layers of DNNs representing multimodal features implied that DNNs self-organized the difference of perception between individual synaesthetes.