Greetings,
Welcome to the 11th edition of the State of AI. In this issue, we're taking a deep dive into the fascinating intersection of AI and the financial industry with FinGPT, examining the prominent usage of large language models in crowd worker text production, and discussing the breakthroughs in universal speech generation with Voicebox.
Further, we explore the progress in creating generalist agents for the web and probe into the capabilities of large language models in discerning causation from mere correlation. These topics offer a captivating exploration of the growing versatility and increasing complexities of AI applications.
Best regards,
Contents
- FinGPT: Open-Source Financial Large Language Models 
- Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks 
- Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale 
- Mind2Web: Towards a Generalist Agent for the Web 
- Can Large Language Models Infer Causation from Correlation? 
FinGPT: Open-Source Financial Large Language Models 
Authors: Hongyang Yang, Xiao-Yang Liu, Christina Dan Wang
Source & References: https://arxiv.org/abs/2306.06031v1
Introduction
In recent years, large language models (LLMs) have revolutionized natural language processing and proven to be promising tools across various domains. Finance is one area where LLMs have garnered significant interest—in particular, with the rise of proprietary models like BloombergGPT. However, this paper highlights the need for open-source financial LLMs (FinLLMs) and introduces FinGPT, an open-source large language model for the finance sector.
The Need for Open-Source FinLLMs
Proprietary models, such as BloombergGPT, have leveraged their exclusive access to specialized data to train finance-specific LLMs, which excel in handling financial tasks. On the other hand, they are not accessible and transparent, creating the need for an open-source alternative for the democratization of financial data. Open-source FinLLMs address issues of data accessibility and quality and foster collaborative innovation within the financial domain. By leveraging the power of the open-source AI4Finance community, FinGPT aims to unlock new opportunities in open finance.
FinGPT's Data-Centric Approach
Data plays a crucial role in FinLLMs development. FinGPT adopts a data-centric approach by prioritizing the collection, preparation, and processing of high-quality financial data. The authors outline the unique characteristics of financial data sources such as financial news, filings, social media discussions, and trends, and they address challenges such as high temporal sensitivity, high dynamism, and low signal-to-noise ratio in handling and preprocessing this data. By integrating and managing these diverse data types, FinGPT provides a comprehensive understanding of financial markets and facilitates effective financial decision-making.
Framework Overview
FinGPT encompasses four fundamental components: Data Source, Data Engineering, LLMs, and Applications. Each component addresses specific challenges associated with financial data and market conditions:
- Data Source layer: Acquires financial data from a variety of online sources, including financial news websites, social media platforms, filings, trends, and academic datasets. 
- Data Engineering layer: Focuses on real-time NLP data processing to address high temporal sensitivity and low signal-to-noise ratio. This layer includes data cleaning, tokenization, stop word removal, stemming/lemmatization, feature extraction, sentiment analysis, and prompt engineering. 
- LLMs layer: Implements various fine-tuning methodologies, prioritizing lightweight adaptation to keep the model updated and relevant. 
- Application layer: Showcases practical applications of FinGPT, such as robo-advising, algorithmic trading, and low-code development. 
Real-Time Data Engineering Pipeline for Financial NLP
Financial NLP requires real-time data processing. The paper outlines steps for setting up a real-time data ingestion system and performing data cleaning, tokenization, stop word removal, stemming/lemmatization, feature extraction, sentiment analysis, prompt engineering, and decision making/alerts. By doing so, FinGPT can better understand and adapt to individual preferences, ultimately paving the way for more personalized financial assistants.
Potential Applications and Collaborations
FinGPT, through its open-source framework, seeks to stimulate innovation and collaboration within the finance domain by providing accessible resources for developing and fine-tuning FinLLMs. The FinGPT project aims to support a wide range of use cases, including robo-advisory services, quantitative trading, and low-code development. The ultimate goal is to democratize FinLLMs and uncover untapped potential in open finance.
A Catalyst for Change
The vision for FinGPT is to serve as a catalyst for change within the financial landscape, driving research, innovation, and collaboration. By nurturing a robust collaboration ecosystem within the AI4Finance community, FinGPT has the potential to reshape our understanding and application of FinLLMs and help unlock new possibilities in the world of open finance. The authors encourage further contributions and collaboration to help move the FinGPT project forward and make it an essential tool for financial analysis, decision-making, and overall growth within the industry.
In conclusion 
FinGPT's open-source framework provides a data-centric approach and a full-stack implementation for FinLLMs. Through collaborative efforts within the AI4Finance community, FinGPT aims to democratize financial data and FinLLMs, ultimately leading to increased innovation and development of open finance applications. With its real-time data engineering pipeline and dedication to addressing the challenges associated with financial data, FinGPT is a promising tool with the potential to unlock new opportunities in the financial industry.
Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks
Authors: Veniamin Veselovsky, Manoel Horta Ribeiro, Robert West
Source & References: https://arxiv.org/abs/2306.07899v1
Introduction
In a world where Large Language Models (LLMs) like GPT-3 and ChatGPT can generate high-quality text, researchers are starting to question the authenticity of human-generated data. This paper investigates the usage of LLMs by crowd workers on Amazon Mechanical Turk (MTurk), a platform often used to gather human annotations and surveys. If crowd workers are indeed using LLMs, it becomes essential to find a way to ensure human data remains human, as the quality of LLM-generated data can significantly differ from that of humans.
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.

