Environmental impacts come in many forms, but here we focus primarily on carbon dioxide equivalent (CO2e) emissions, since CO2 is the main greenhouse gas (GHG) causing global warming and the biggest threat to the environment.
The carbon emissions from a large language model (LLM) primarily come from two phases: (1) the up-front cost to build the model (the training cost) and (2) the cost to operate the model on an ongoing basis (the inference cost).
The up-front costs include the emissions generated to manufacture the relevant hardware (embodied carbon) and the cost to run that hardware during the training procedure, both while the machines are operating at full capacity (dynamic computing) and while they are not (idle computing). The best estimate of the dynamic computing cost in the case of GPT-3, the model behind the original ChatGPT, is approximately 1,287,000 kWh (kilowatt-hours), or 552 tonnes (metric tons) of CO2e.
This figure is approximately the same emissions as two or three full Boeing 767s flying round-trip from New York City to San Francisco. Figures for the training of Llama 2 are similar: 1,273,000 kWh, with 539 tonnes of CO2e. Analysis of the open source model BLOOM suggests that accounting for idle computing and embodied carbon could double this requirement.
The ongoing usage costs do not include any additional embodied carbon (e.g., from manufacturing the computers, which have been accounted for in the building cost) and are very small per query, but multiplying over the billions of monthly visits results in an aggregate impact likely far greater than the training costs.
Estimates from one study for the aggregate cost of inferences for ChatGPT over a monthly period were between 1 to 23 million kWh considering a range of scenarios, with the top end corresponding to the emissions of 175,000 residents of the author’s home country of Denmark. Another pair of authors arrived at 4 million kWh via a different methodology, suggesting these estimates are probably in the right ballpark.
We note that in any event, the electricity usage of ChatGPT in inference likely surpasses the electricity usage of its training within weeks or even days. This aligns with claims from AWS and Nvidia that inference accounts for as much as 90% of the cost of large-scale AI workloads.
One comment about efficiency. Continuing our earlier analogy, instead of two or three full Boeing 767s flying round-trip from New York to San Francisco, current provision of consumer LLMs may be more like a Boeing 767 carrying one passenger at a time on that same journey. For all their power, people often use the largest LLMs for relatively trivial interactions that could be handled by a smaller model or another sort of application, such as a search engine, or for interactions that arguably need not happen at all. Indeed, some not-exactly-necessary uses of ChatGPT, such as “write a biblical verse in the style of the King James Bible explaining how to remove a peanut butter sandwich from a VCR” bear more resemblance to a single-passenger flight from New York to Cancún than from New York to San Francisco.
Excitement around generative artificial intelligence (GAI) has produced an “arms race” between major providers like OpenAI and Google, with the goal of producing the model that can handle the widest range of possible use cases to the highest standard possible for the largest number of users. The result is overcapacity for the sake of market dominance by a single flagship model, not unlike airlines flying empty planes between pairs of airports to maintain claims on key routes in a larger network. The high levels of venture capital funding currently on offer in the GAI space enable providers to tolerate overcapacity for the sake of performance and growth. Business models that are much more energy- and cost-efficient are available.
We must emphasize the huge uncertainty surrounding the estimates on which this analysis is based, which stems from both lack of standard methodology and lack of transparency in the construction of LLMs. ChatGPT maker OpenAI has not publicly announced either the data used to train the model nor the number of parameters in its latest model, GPT-4. Speculation and leaks about GPT-4 put the figure at approximately 10 times the number of parameters in GPT-3, the model powering the original ChatGPT. Google has not released full details about the LamMDA model powering its chatbot, Bard. DeepMind, Baidu, and Anthropic have similarly declined to release full details for training their flagship LLMs.
Uncertainty remains even for open source models, since the true impact of a model involves accounting for the cost of deploying the model to an unknown and varying number of users, as well as the emissions used to produce the hardware that serves these models to end users. Still greater complexity derives from the precise mix of fossil fuels and renewable energy used where the models are trained and deployed.
Finally, we mention briefly that the water consumption of ChatGPT has been estimated at 500 milliliters for a session of 20-50 queries. Aggregating this over the billions of visitors ChatGPT has received since its launch in December 2022 amounts to billions of liters of water spent directly cooling computers and indirectly in the process of electricity generation.
[For more from the authors on this topic, see: “Environmental Impact of Large Language Models.”]