Conditional human image generation, or generation of human images with specified pose based on one or more reference images, is an inherently ill-defined problem, as there can be multiple plausible appearance for parts that are occluded in the reference. Using multiple images can mitigate this problem while boosting the performance. In this work, we introduce a differentiable vertex and edge renderer for incorporating the pose information to realize human image generation conditioned on multiple reference images. The differentiable renderer has parameters that can be jointly optimized with other parts of the system to obtain better results by learning more meaningful shape representation of human pose. We evaluate our method on the Market-1501 and DeepFashion datasets and comparison with existing approaches validates the effectiveness of our approach.