This is an old problem, and seems to me that it has little to do with the dichotomy between interpretable vs black box predictive models. It’s more a fundamental problem with measurement and incentives, captured by the cliche “what you measure is what you get.” See, e.g., standardized testing, employee performance evaluations, academics churning out the “least publishable unit,” etc. etc., where the metric for success becomes the entire thing that people pursue.
Seen that way, the difference between the problems you’re seeing and the classic problem is that the metrics you’re using are slightly better in the absence of incentive effects, because they’ve got better math under the hood.
But maybe that actually makes the difference. Hmm. Here’s a slightly weird idea: presumably a gameable model will become less predictive over time. If math grades predict graduation at time T, teachers inflate the grades of everyone (including those who aren’t about to graduate any time soon) at T+1, then at T+2 math grades don’t predict as well. But doesn’t that mean that we can evaluate the gameability of a model in part by how stable its predictive power is over time, all else being equal… without knowing the details of the underlying behavior?
If that’s right then then maybe this is even a self-correcting problem, since one ought to be continually evaluating one’s models? If the AUC on the high school graduation model takes a nosedive in two years, that’s a good reason to throw it out, and doing so might incidentally prevent gaming
… or at least prevent long-term gaming. Thinking in this direction starts to have an arms race kind of feel. You get a model, it works for a while until people learn to game it, then it stops working, then you make a new one. Intermittent success. Is that adequate?