Bias is one of the biggest ethical concerns of many generative AI models and is not a problem that is likely to have a solution in the near future. Because of the way Large Language Models (LLMs) work - they are trained to identity patterns in large amounts of data, then produce outputs by making predictions based on those patterns - any biases present in the data will be recognised as a pattern, then replicated in the textual/visual/audio output.
Below is breakdown of the kind of bias that may affect LLMs.
1. Training data bias - The data used to train the model may itself be biased, reflecting the biases and perspectives present in the original data sources. If the training data over-represents certain groups or viewpoints, the model can learn and amplify those biases.
2. Algorithmic bias - The machine learning algorithms used to train the model may have inherent biases built into them, which then get reflected in the model's outputs.
3. Societal biases - The biases and stereotypes present in society can get encoded into the model if it is not carefully designed to counteract those biases.
4. Lack of diversity - If the teams developing the AI models lack diversity in terms of backgrounds, experiences, and perspectives, they may not be as aware of or able to identify the biases present in the models they create.
Example of bias
Vincent, J. (2020, June 24). What a machine learning tool that turns Obama white can (and can’t) tell us about AI bias. The Verge. https://www.theverge.com/21298762/face-depixelizer-ai-machine-learning-tool-pulse-stylegan-obama-bias
The above is a sample from a depixelation tool called Pulse. When given a picture of America's only black president, Barack Obama, its algorithms produced an image of a white man.
One of the main concerns around data bias is its potential to perpetuate and reinforce harmful stereotypes, discrimination, and exclusion of marginalized groups.
Created with the assistance of GenAI: Anthropic. (2024). Claude 3 haiku (March 4 version) [Large language model]. https://claude.ai/
Privacy is an issue all users of Large Language Models need to be aware of and consider carefully before choosing to use them.
Lack of transparency: The inner workings of these complex models are often opaque, making it difficult to know exactly what information they have learned and how they might be used in ways that compromise privacy. Most LLMs use the prompts inputted by the user to train and refine their models.
Potential for re-identification: Even if models are trained on anonymized data, there are concerns that the outputs could potentially be used to re-identify individuals, especially for high-profile people.
Monetization of personal information: Large tech companies that develop these models may be tempted to monetize user data by selling it to advertisers, data brokers, or other third parties, prioritizing profits over privacy.
Created with the assistance of GenAI: Anthropic. (2024). Claude 3 haiku (March 4 version) [Large language model]. https://claude.ai/
The unreliability of generative AI as an information retrieval tool is a significant problem. Misinformation and disinformation lie at its core:
1. Difficulty in content moderation: The sheer volume of content that language models can generate poses significant challenges for human-based content moderation efforts, allowing erroneous information and disinformation to slip through the cracks.
2. Inability to distinguish truth from fiction: Language models are designed to generate plausible-sounding text, but they lack the contextual understanding and reasoning capabilities to reliably distinguish between true and false information.
3. Ability to generate plausible-sounding content: The advanced text generation capabilities of large language models allow them to produce content that can appear highly convincing and authentic, even if it is factually inaccurate or misleading.
4. Scalability of mis- and disinformation: Language models can rapidly generate large volumes of text, enabling the widespread dissemination of misinformation at scale, far beyond what individual human actors could achieve.
5. Difficulty in detection: It can be challenging for average users to distinguish AI-generated content from human-written text, making it harder to identify and fact-check misinformation.
6. Amplification of existing biases and prejudices: If language models are trained on data that contains biases or prejudices, they may generate content that reinforces and amplifies those harmful perspectives.
7. Disinformation: These models have the ability to generate highly realistic and convincing content, whether it's text, images, audio or video, which raises significant concerns about the potential for malicious actors to create "deepfakes" - content that appears authentic but is fabricated.
Created with the assistance of GenAI: Anthropic. (2024). Claude 3 haiku (March 4 version) [Large language model]. https://claude.ai/
'Hallucations' - what are those?
Hallucinations is the term used to describe false information dressed as fact that GenAI produces. Unfortunately, because of the way this technology works, it's not an easily solvable problem. According to this July 2024 study from Cornell, Washington and Waterloo Universities, minimal progress in this area has been made with new version releases of the most popular LLM models, despite what the companies responsible for them say. One of the authors, Wenting Zhao, says "At present, even the best models can generate hallucination-free text only about 35% of the time" (Wiggers, 2024, para. 3). A May 2024 study from Yale and Stanford Universities indicates that AI tools using an information retrieval system (RAG - retrieval augmented generation) coupled with an LLM reduce the instances of hallucinations, but not significantly. They studied legal research AI tools and found their rate of hallucination to be between 17% and 33% (Magesh et al., 2024). So, yes, evidence suggests hallucinations are a major issue and one not likely to be solved any time soon.
Sources: Wiggers, K. (2024, August 14). Study suggests that even the best AI models hallucinate a bunch. TechCrunch. https://techcrunch.com/2024/08/14/study-suggests-that-even-the-best-ai-models-hallucinate-a-bunch/
Magesh, V., Surnani, F., Dahl, M., Suzgun, M., Manning, C. D., & Ho, D. E. (2024, May 30). Hallucination-free? Assessing the reliability of leading AI legal research tools. Cornell University. https://arxiv.org/abs/2405.20362
Some key ethical considerations around equity for generative AI tools in education:
Accessibility: Ensuring that generative AI tools are accessible and usable for students with disabilities or from diverse backgrounds, so that they can equitably benefit from the technology.
Bias and fairness: Mitigating biases in the training data and algorithms of generative AI tools to prevent unfair or discriminatory outputs that could disadvantage certain groups of students.
Transparency and explainability: Making the inner workings and decision-making processes of generative AI tools transparent, so students and educators can understand how the technology is arriving at its outputs.
Equitable access: Ensuring that all students, regardless of socioeconomic status, geographic location, or other factors, have equal opportunities to access and benefit from generative AI tools for learning.
Student privacy and data rights: Protecting the personal data and privacy of students using generative AI tools, and giving them control over how their information is used.
Created with the assistance of GenAI: Anthropic. (2024). Claude 3 haiku (March 4 version) [Large language model]. https://claude.ai/
Copyright is a significant ethical issue for large language models due to the way they are typically trained on vast amounts of online data, which often includes copyrighted material without permission. Here are a few key reasons why this raises ethical concerns:
Unauthorized use of copyrighted content: By training on copyrighted text, images, code, and other materials without the explicit consent of the copyright holders, language model developers are effectively appropriating that content for their own commercial gain.
Potential for infringement: The outputs generated by language models, such as text, code, or images, may inadvertently reproduce or resemble copyrighted works, leading to potential copyright infringement claims.
Undermining creative industries: If language models are able to generate content that substitutes for or competes with the work of human creators, it could undermine the livelihoods of authors, artists, software developers, and others in creative industries.
Lack of attribution and compensation: When language models are trained on copyrighted works, the original creators often receive no attribution or compensation, which is a violation of their intellectual property rights.
Exacerbating existing inequities: The unauthorized use of copyrighted material from marginalized communities or developing countries could further entrench global inequities in the creative economy.
Created with the assistance of GenAI: Anthropic. (2024). Claude 3 haiku (March 4 version) [Large language model]. https://claude.ai/
What is the argument for fair use of copyrighted material for training LLMs?
Tech companies argue that the use of copyrighted material for training large language models qualifies as fair use because it is transformative in nature, generating new insights and functionalities rather than simply reproducing the original works. They emphasize that the models do not retain or reproduce significant portions of copyrighted content and assert that their use does not harm the market for the original works, as the models do not act as direct substitutes.
Created with the assistance of GenAI: OpenAI. (2024). ChatGPT 4o mini (July 18 version) [Large language model]. https://openai.com/
But is it fair use?
Copyright holders and intellectual property owners say it isn't. See the copyright legal news section on the right for updates in this space.
Two recent studies support the argument against fair use. The first is out of Yale University's Law School:
Despite wide employment of anthropomorphic terms to describe their behavior, AI machines do not learn or reason as humans do. They do not “know” anything independently of the works on which they are trained, so their output is a function of the copied materials. Large language models, or LLMs, are trained by breaking textual works down into small segments, or “tokens” (typically individual words or parts of words) and converting the tokens into vectors—numerical representations of the tokens and where they appear in relation to other tokens in the text. The training works thus do not disappear, as claimed, but are encoded, token by token, into the model and relied upon to generate output.
Source: Charlesworth, J. (2024, August 13). Generative AI's illusory case for fair use. SSRN. https://ssrn.com/abstract=4924997
The second study was commissioned by Germany's Copyright Initiative and was undertaken by a legal expert and a computer scientist (Universities of Hannover and Magdeburg). It found in four stages of the AI training process, copyright is breached under European copyright law.
For example:
during pre-training and fine-tuning, copyright-relevant reproductions of copyrighted works materialize “inside” the AI model. This...constitutes a copy and replication in the legal sense. Furthermore, during the application of generative AI models, particularly by the end users of the fully trained AI systems (e.g., ChatGPT via the OpenAI website), works that have been used for training the AI model may be copied and replicated as part of the systems’ output.
Source: Dornis, T. W., & Stober, S. (2024, August 29). Copyright and training of generative AI models - technological and legal foundations. SSRN. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4946214
AI chatbots, like many other digital technologies, have a significant environmental impact due to their energy and water consumption.
The training and operation of these models require substantial computational power, which in turn demands large amounts of electricity. Data centers housing the servers that run AI models often rely on fossil fuels, contributing to greenhouse gas emissions. As the demand for AI services grows, so does the energy required to support them, raising concerns about the sustainability of such technologies in the face of climate change.
In addition to energy consumption, the water usage associated with AI chatbots is another environmental concern. Data centers not only consume electricity but also require large amounts water for cooling systems to maintain optimal operating temperatures. This water is often sourced from local supplies, which can strain regional water resources, especially in areas facing drought or water scarcity.
Created with the assistance of GenAI: OpenAI. (2024). ChatGPT4o mini (July 18 version) [Large language model]. https://openai.com/
This study from the University of California, Riverside, takes a closer look at the scale and implications of the water footprint of AI. It has informed this precise of the issue from The Washington Post:
Source: Verma, P., & Tan, S. (2024, September 18). A bottle of water per email: The hidden environmental costs of using AI chatbots. The Washington Post. https://www.washingtonpost.com/technology/2024/09/18/energy-ai-use-electricity-water-data-centers/
Use the full screen icon in the top right-hand corner to view.
Note: This model is borrowed from another source. Some of the links are broken.
The UNESCO document, based on their 2022 framework, Recommendation on the Ethics of Artificial Intelligence, identifies the following issues for GenAI in education and research:
It sets out steps for the regulation of GenAI in education, as well as a policy framework based on a human-centred approach, and how best to facilitate the use of GenAI in education and research.
A comprehensive database of risks from AI systems that includes:
Why AI Can't Be Ethical - Yet: a 7-minute presentation on the ethical implications of no one really understanding how Deep Learning works (including the AI engineers responsible for it).
TEDx Talks. (2024, April 17). Why AI can't be ethical - yet | Eleanor Manley | TEDxDaltVila [Video]. YouTube. https://www.youtube.com/watch?v=9DXm54ZkSiU
AI is Dangerous, But Not for the Reasons You Think: a 10-minute presentation on the environmental impacts of AI, and issues surrounding copyright infringement and bias.
TED. (2023, November 7). AI Is dangerous, but not for the reasons you think | Sasha Luccioni | TED [Video]. YouTube. https://www.youtube.com/watch?v=eXdVDhOGqoE
Computer scientist Paulo Shakarian explains why large language models used in AI tools are likely to continue producing incorrect and strange "hallucinations" as outputs.
Neuro Symbolic. (2024, June 4). Google vs. hallucinations in "AI overviews" [Video]. YouTube. https://www.youtube.com/watch?v=bGsq0kX4apg&t=651s
May 2024: The new EU AI Act will allow copyright holders to reserve the right to opt out of data mining and general purpose AI companies must give full transparency about where they have sourced their training data.
16 May 2024: Sony issues letters to over 700 AI companies demanding they cease using Sony music to train their models, with the promise of lawsuits with non-compliance.
31 May 2024: American Department of Justice antitrust chief warns AI companies that they must fairly compensate artists.
12 July 2024: New US Senate bill seeks to protect artists’ and journalists’ content from AI use.
16 July 2024: A new company, Created by Humans, that licenses creative intellectual property for training AI.
12 August 2024: A high-profile class-action copyright infringement lawsuit is progressing to discovery - a big step in it actually going to trial and the closest anyone has got in the US. Defendants include giants like Midjourney and Stability.
Generative AI Has a Visual Plagiarism Problem: Experiments with Midjourney and DALL-E 3 show a copyright minefield (6 January 2024) - do LLMs only produce derivative work? Well, no. Here's the proof.
Fair use in the US Redux: Reformed or Still Deformed? (March 2024) - an in-depth analysis of the legal precedent that surrounds the idea of scraping copyrighted content to train artificial intelligence models.
Unveiling security, privacy, and ethical concerns of ChatGPT (March 2024) - a study that aims to shed light on the potential risks of integrating ChatGPT into our daily lives.
Bias and Fairness in Large Language Models: A Survey (12 March 2024) - a study of how intrinsic bias is in LLMs.
AI Chatbots Will Never Stop Hallucinating (5 April 2024) - Some amount of chatbot hallucination is inevitable. But there are ways to minimize it.
Two AI Truths and a Lie (24 May 2024) - Industry will take everything it can in developing Artificial Intelligence (AI) systems. We will get used to it. This will be done for our benefit. Two of these things are true and one of them is a lie.
The Impossibility of Fair LLMs (28 May 2024) - a review of the technical frameworks machine learning researchers have used to evaluate fairness, which finds they have inherent limitations.
New UNESCO report warns that Generative AI threatens Holocaust memory (18 June 2024) - the report warns that unless decisive action is taken to integrate ethical principles, AI could distort the historical record of the Holocaust and fuel antisemitism.
The Backlash Against AI Scraping Is Real and Measurable (23 July 2024) - 2024 is seeing more web content being protected against web scraper bots, which may further compromise the quality of already questionable data being used by large language models.
Senators introduce bill to protect individuals’ voices and likenesses from AI-generated replicas (31 July 2024) - known as The No Deep Fakes Act, it is set to make individuals and companies liable for damages over producing digital replicas of people who haven't consented to it.
AI generates covertly racist decisions about people based on their dialect (28 August 2024) - tested on AAE (African American English), the study shows the danger in using AI to make decisions in areas such as healthcare and law enforcement (which it's already being used for).
Senator Markey Introduces AI Civil Rights Act to Eliminate AI Bias, Enact Guardrails on Use of Algorithms in Decisions Impacting People’s Rights, Civil Liberties, Livelihoods (24 September 2024) - described as “perhaps the most robust protections for people in the age of AI of any bill introduced in Congress in this session.”