Creative natural language generation, such as poetry generation, writing lyrics, and storytelling, is appealing but difficult to evaluate. We take the application of image-inspired poetry generation as a showcase and investigate two problems in evaluation: (1) how to evaluate the generated text when there are no ground truths, and (2) how to evaluate nondeterministic systems that output different texts given the same input image. Regarding the first problem, we first design a judgment tool to collect ratings of a few poems for comparison with the inspiring image shown to assessors. We then propose a novelty measurement that quantifies how different a generated text is compared to a known corpus. Regarding the second problem, we experiment with different strategies to approximate evaluating multiple trials of output poems. We also use a measure for quantifying the diversity of different texts generated in response to the same input image, and discuss their merits.