The Waseda Flutist Robot is able to play the flute at the level of an intermediate human player. This ability opens a wide field of possibilities to research human-robot musical interaction. This research is focused on enabling the flutist robot to interact more naturally with musical partners in the context of a Jazz band. For this purpose a Musical-Based Interaction System (MbIS) has been proposed to enable the robot to process both visual and aural cues coming throughout the interaction with musicians. In a previous publication, we have concentrated on the implementation of visual communication techniques. We created an interaction interface that enabled the robot to detect instrument gestures of partner musicians during a musical performance. Two computer vision approaches were implemented to create a two-skill-level interface for visual human-robot interaction in a musical context. In this paper we focus on the aural perception system of the robot. The method introduced here enables the robot to, a suitable environment provided, detect the tempo and harmony of a partner musician's play, with a specific focus on improvisation. We achieve this by examining the rhythmical and harmonic characteristics of the recorded sound. We apply the same approach to amplitude and frequency spectrum, thus, in the former case tracking amplitude transients. In the latter case, as we focus on communication with monophonic woodwind instruments, we follow the most prominent peak in the frequency spectrum. We specifically use a similar technique for the audio analysis as we did for our previous research on motion tracking. From the experimental results, we have shown that after implementing our algorithm the robot is able to correctly recognize a number of rhythms and harmonies. It is able to engage in a simple form of stimuli and reaction play with a human musician.