This study uses 36 conceptual hydrologic models calibrated to streamflow observations in 559 catchments across the United States to investigate differences and similarities in model performance. Central in this talk is the common approach to setting up hydrologic models that uses separate calibration and evaluation periods and a single objective function to quantify model performance. We investigate this topic from multiple angles and show that several common, and sometimes implicit, assumptions in this approach are not supported by our large-sample results. This study provides ample large-sample evidence that the traditional approach to calibrating and evaluating conceptual models is not sufficient to ensure a model produces “the right results for the right reasons” and that more thoughtful model evaluation is needed.