In this paper, we present a speech enhancement method using two microphones in underdetermined situations. Time-frequency (TF) binary masking is a conventional method of enhancing speech in underdetermined situations by appropriately multiplying each TF component by zero or one. Extending this method, we previously proposed a new method called the time-frequency-bin-wise switching (TFS) beamformer. In this method, we switch multiple preconstructed beamformers in each TF bin, each of which suppresses a particular interferer. However, this method requires the pre-estimation of beamformer filter coefficients using the target-active period and interferer-wise-active periods as the prior information. In this paper, to overcome this limitation, we formulate the switching and construction of spatial filters as a joint optimization problem, which can be understood from two viewpoints: the clustering of the most dominant interferer signal in each TF bin and the construction of a minimum variance distortionless response beamformer using such bins. In an experiment, we confirmed that the proposed method was superior to conventional TF masking and fixed beamforming during speech enhancement regardless of the direction of interferers.