The human ability to learn and compose conceptual operations is foundational to making flexible generalizations, such as creating new dishes from known cooking processes. Beyond naive chaining of functions, there is evidence from the linguistic literature that people can learn and apply context-sensitive, interactive rules, such that output production depends on context changes induced by different function orderings. Extending the investigation into the visual domain, we developed a function learning paradigm to explore the capacity of humans and neural network models in learning and reasoning with compositional functions under varied interaction conditions. Following brief training on individual functions, human participants were assessed on composing two learned functions, in ways covering four main interaction types, including instances in which the application of the first function creates or removes the context for applying the second function. Our findings indicate that humans can make zero-shot generalizations on novel visual function compositions across interaction conditions, demonstrating sensitivity to contextual changes. A comparison with a neural network model on the same task reveals that, through the meta-learning for compositionality (MLC) approach, a standard sequence-to-sequence Transformer can approximate a strong function learner, and also mimic human error patterns with additional fine-tuning.