[microblog] One book is worth ‘0.06%’ benchmark points to AI; is ‘no different from noise’. What gives?

Commenting on recent coverage of, and discussion about, Meta’s arguments about training data value quantification.

Published

April 21, 2025

Original Substack post: https://dataleverage.substack.com/p/microblog-one-book-is-worth-006-benchmark

This will be a “contextualized microblog”. I saw several posts in quick succession that I thought could be worth commenting on as a group.

First, a set of relevant links that discuss training data value in the context of lawsuits, compensation and consent:

I think Beyer’s points are critical to account for in data value discussions. Robust data valuation remains expensive and there’s so much more context needed to throw around numbers like this (as he notes: model scale, training duration, exact benchmark details, and so on).

The Short Argument

What I want to add (echoing the September post) is this: in any data valuation experiment, another factor that must be accounted for is whether and how data points will be grouped. In pure “leave-one-out” experiments, there are no groups; it’s every data point for itself. There are technicalarguments for grouping data points: we might want to compute data value along natural groupings that emerge from the data (e.g., demographic groups), compute Shapley values with account for various “coalitions” that might exist, or compute data values in a way that gives more weight to either smaller or larger groups.1

Different approaches can be empirically tested for effectiveness on various tasks like data selection or mislabel detection. However, in the context of training data and consent/compensation/law (topics that are typically the forte of economics, philosophy, human-computer interaction, etc.), there is no purely technical approach that can determine the appropriate group size to test. Rather, for practical purposes (e.g. for a market or for a lawsuit) the appropriate group size is dependent on coordination and collective bargaining capabilities of the training data creators.

The 0.06% number (again, with all its many caveats) is really only relevant if we accept an implicit assumption that all authors would be acting as individual agents in some hypothetical data market. Another extreme (also unrealistic) would be to assume perfect cooperation between all data creators in the world, and then to attribute 100% of model performance to that single coalition. In my opinion: the approach that would most benefit these discussions would be a middle ground. We should be calculating and openly discussing data values at the level of economic sectors, groups of firms, individual firms, and perhaps specific interest groups, but probably not at the level of individual people or books.

1

To briefly list just a few papers:

  • Koh, P. W. W., Ang, K. S., Teo, H., & Liang, P. S. (2019). On the accuracy of influence functions for measuring group effects. Advances in neural information processing systems, 32. [link]
  • Ghorbani, A., & Zou, J. (2019, May). Data shapley: Equitable valuation of data for machine learning. In International conference on machine learning (pp. 2242-2251). PMLR. [link]
  • Jia, R., Dao, D., Wang, B., Hubis, F. A., Hynes, N., Gürel, N. M., … & Spanos, C. J. (2019, April). Towards efficient data valuation based on the shapley value. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 1167-1176). PMLR. [link]
  • Kwon, Y., & Zou, J. (2021). Beta shapley: a unified and noise-reduced data valuation framework for machine learning. arXiv preprint arXiv:2110.14049. [link]