We propose a new neuro-robotics network architecture that can generate goal-oriented behavior for visually-guided multiple object manipulation task by a humanoid robot. For examples, given a "sequential hit" multiple objects task, the proposed network is able to modulate a humanoid robot's behavior by taking advantage of suitable timing for gazing, approaching and hitting the object and again for the other object. To solve a multiple object manipulation task via learning by examples, the current study considers two important mechanisms: (1) stereo visual attention with depth estimation for movement generation, dynamic neural networks for behavior generation and (2) their adaptive coordination. Stereo visual attention provides a goal-directed shift sequence in a visual scan path, and it can guide the generation of a behavior plan considering depth information for robot movement. The proposed model can simultaneously generate the corresponding sequences of goal-directed visual attention shifts and robot movement timing with regards to the current sensory states including visual stimuli and body postures. The experiments show that the proposed network can solve a multiple object manipulation task through learning, by which some novel behaviors without prior learning can be successfully generated.