Keystone Strategy

Laying the foundations for competition in AI? Some reflections on the CMA’s Foundation Models Initial Report

By Stefan Hunt, Nitika Bagaria, Emily Chissell and Wen Jian
September 20, 2023   /   5 Minute Read

The CMA published its initial report on AI foundation models on Monday (18th September).  It’s a solid report, packed with useful facts, explanations and descriptions of the rapidly evolving Gen AI market based on independent research and detailed discussions with market participants. And it has some interesting takes on hot topics such as open-source versus closed-source, the use of synthetic data and the watermarking of AI outputs.

But the report could perhaps have gone further and given some views on how to start tackling some of the AI competition and consumer protection issues. This initial report mainly focuses on principles (access, diversity, choice, flexibility, fair dealing and transparency), which are hard to disagree with.

So, what are the areas where the report could have gone further and we would call for the CMA to focus on during the next phase of this project? To name a few: how to help data input markets function better, address data feedback loops, ensure effective consumer protection, and democratise responsible AI (i.e., ways to make it easier for all AI systems to operate responsibly). We explore them in this blog.

What does the CMA report cover?

Figure from the report: Overview of generative AI markets

As shown in the figure, the CMA identifies several distinct stages in the foundation model value chain. The chain begins with AI infrastructure – compute, expertise and data – the raw inputs for foundation models and then the following three stages:

  • upstream foundation models trained on a large corpus of data to generate text, images etc;
  • fine-tuned models that are trained on additional domain-specific data to perform domain-specific tasks; and
  • downstream user-facing apps based on fine-tuned models and offer interactive interfaces for end-users.

For more on the topic of the structure of AI markets, see Keystone’s recent paper on data in the AI value chain.

In the AI infrastructure layer, the CMA focuses on providing equal access to inputs (i.e. compute, expertise and data) and ensuring that first-movers do not have entrenched advantages or that partnerships do not reduce others’ ability to compete.

In the AI development layer, the CMA’s main competition concern is ensuring enough choice and diversity for developers to access the latest technology in terms of both closed and open-source foundation models. Allowing developers the flexibility to switch from one foundation model to another without being locked into any one ecosystem or infrastructure would also be another competition concern.

In the AI deployment layer, the CMA’s competition concern is that vertically integrated companies could potentially foreclose competition downstream by limiting access upstream. The report calls for fair dealing and refers to concerns of possible anti-competitive behaviours such as self-preferencing, tying or bundling – all well-known digital ‘red flags’ in the competition world. It also calls for transparency from foundational model (FM) developers in terms of risks and limitations so that deployers can, in turn, manage their responsibilities to consumers.

Some of the other interesting issues in AI that the CMA touches on include:

  • Whether synthetic data – data generated from AI systems – can be used as an input to FMs to train the models. While the CMA discusses the view that synthetic data does not yet lend itself to be used in training foundation models and there is a risk of ‘model collapse’, it is open-minded about the potential for pre-training using synthetic data (counter to our view in the Keystone paper on data in AI, mentioned earlier).
  • Whether watermarking techniques similar to existing copyright protection can work. The CMA report explains that these techniques could potentially be applied to audio and images generated by AI, but text watermarking remains a hurdle yet to be overcome.
  • Whether open-source models could compete with closed models in the long-run, and whether downstream applications built on open-source models could sustain commercial viability. The CMA seems to think neither might happen.
  • Whether YouTube could give Google an advantage as a repository of conversational style video data and accompanying text. The CMA’s preliminary view is that it might.

What more might the CMA have said?

Need for a well-functioning data market

The CMA rightly highlights the importance of data as an input in each of the stages. It questions whether proprietary data could be a source of competition in the future and whether user feedback (and the ability to access this) could be a source of unfair competition for large vertically integrated firms. However, it could have gone further…

The CMA could have discussed the need for well-functioning proprietary data markets. Most of the key FMs now do not systematically reveal their data sources. Without this, it is challenging (and sometimes impossible) to identify the data used in training. Reverse engineering this, especially for language data, can be extremely difficult.

But why is that a problem? Without knowing the “value” of their data for FM training or fine-tuning, data providers suffer from stark information asymmetry. They have few means of assessing how important their data is and what its price should be. This could reduce incentives to produce or release proprietary data for models. If high-quality data becomes relatively scarce (and recent research suggests that models are close to exhausting available sources), then this could hamper the training and fine-tuning of models, including dealing with consumer protection issues such as hallucinations from models.

Addressing feedback loops

The CMA also acknowledges that user data can be relevant downstream for fine-tuning and user-facing applications. In many cases, the more users interact with a model, the better the application gets – which, of course, can lead to market tipping. Though the significance of this feedback effect depends on the circumstances – for example, where the needs of people are very specific, feedback data may be less important.

The CMA ultimately sidesteps what could be done to minimise market distortions from feedback loops before they ‘tip’ competition– simply noting it’s too early to call it. But what it could usefully have elaborated on is what are the warning signs to look out for and what could it do, if these issues do materialise (and there’s plenty, e.g., data access remedies – think Google click and query).

Incentives (and disincentives) for firms to protect consumers

The CMA recognises that consumers might face significant risks from generative AI applications, including deep fakes, hidden advertising, phishing and fake reviews. These concerns could be amplified if the cost of launching large-scale attacks plummets as Gen AI starts automatically creating fake (but believable) content at scale.

Though the CMA has put an early stake in the ground it does not suggest or explore potential technical remedies (of which there are many) for grappling with these issues.

Democratising Responsible AI

Responsible AI refers to the practice of designing, developing, deploying and monitoring AI systems to ensure they operate ethically, transparently, fairly and do not cause harm. One way to achieve this is through the use of human generated data to “align” model outputs.

The CMA sits on the fence about how difficult it is to obtain alignment data and whether the cost could potentially be a barrier to market entry– in our minds, this is a significant gap. The CMA notes the existence of open-source alignment data but casts scepticism on the viability of these data sets in the long term. We think that this type of data is an important element in fine-tuning model performance and something that deserves a closer look during the next phase of the project.

Those are a few initial thoughts. We’d be interested to hear yours.