Meta recently unveiled its latest multimodal language model, Spirit LM. Spirit LM is an open-source language model that combines text and speech easily. According to the official release, the model is based on a 7B pre-trained text language model that is extended to the speech modality by continuously training it on text and speech units.
This article will look into the Spirit LM model, its features, capabilities, how to access it, and how it compares to other models. Let’s begin.
What is the Spirit LM model?
Spirit LM is a foundation multimodal language model developed by Meta. It is designed to work with both text and speech, allowing these two modalities to integrate with ease. The model builds on a 7B pre-trained text language model, which is further trained with both text and speech units.
The result is a model that not only understands and generates text but can also handle spoken language in a highly natural and expressive manner.
Meta Empowers Developers with AI Innovations: SAM 2.1, Meta Spirit LM & SALSA Lead the Way
Features
These are some of the most prominent features of Spirit LM:
- Multimodal Integration
Spirit LM goes beyond just processing text. It integrates speech and text by training on both types of data. This allows the model to work across different forms of communication. This makes it more versatile than traditional language models that only focus on text.
- Word–Level Interleaving Training Method
Spirit LM uses a word-level interleaving method during training. This means that speech and text are merged into a single stream of tokens. This helps the model learn how to seamlessly switch between both. This approach requires a small, automatically curated speech-text parallel corpus, which is essential for training the model on both text and speech units.
- Two Versions: Base and Expressive
Spirit LM comes in two different versions:- Spirit LM Base: This version uses phonetic units (HuBERT), which helps in accurately modeling speech.
- Spirit LM Expressive: In addition to phonetic units, this version includes pitch and style units to model the expressive qualities of speech, such as tone, pitch, and emotions. This version allows the model to not only generate speech but also convey emotions and feelings more effectively.
- BPE Tokenization for Text
Both versions of Spirit LM use subword Byte Pair Encoding (BPE) tokens to encode the text. This tokenization method helps the model understand language at a finer level and improve its performance across various tasks.
You can read the research paper here to get more insight into Meta’s Spirit LM.Â
Meta Introduces Self-Taught Evaluator: AI Model Evaluation Now Automated Without Human Involvement
How to Access Spirit LM?
Meta’s Spirit LM is an open-source model, meaning it is freely available for use by the research community and developers.
To access Spirit LM, follow these steps:
- Get it from HuggingFace or GitHub: Developers can access the Spirit LM model from both HuggingFace and GitHub.Â
- Download the Model: Spirit LM is available for download in both its Base and Expressive versions. Developers can choose the version that best fits their needs.
- Setup and Usage: Once downloaded, developers can set up the model in their environments and begin experimenting with multimodal tasks, such as generating speech with expressiveness or transcribing speech to text.
In case, you need further help, here is a step-by-step guide on how to use Spirit LM on Windows and Linux:
1. Set Up Your System
- System Requirements: First, make sure your computer meets the minimum hardware requirements. To run the model with 200 tokens of output, you will need at least 15.5GB of VRAM. For 800 tokens, you will need 19GB.
Installing Necessary Software: You will need to install some Python libraries. Run the following commands to get everything you need:
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install gradio tempfile transformers numpy
- You will also need to install the spirit lm module, which might be available from a specific source. You can find it on Hugging Face.
2. Cloning the Repository
Next, you need to get the Spirit LM demo files from GitHub. You can do this by cloning the repository with the following command:
git clone https://github.com/remghoost/meta-spirit-frontend
3. Preparing the Model Files
- Download the required model files, called “checkpoints,” and place them into a folder called checkpoints/ within the repository you just cloned. These files are necessary for the model to work properly.
4. Setting Up the Gradio Interface
- The demo uses Gradio to make it easy to test the model. You will need to run a Python script that connects the model to Gradio. This will allow you to input text or audio and test the model.
Here’s an example of how to set up the Gradio interface in Python:
5. Running the Model
import gradio as gr
from spiritlm.model.spiritlm_model import Spiritlm, OutputModality, GenerationInput, ContentType
from transformers import GenerationConfig
import torchaudio
import torch
import tempfile
import os
import numpy as np
- After everything is set up, you can start the demo by running a Python script. The Gradio interface will let you input text or audio and get output in return.
To launch the interface, run this command:
iface.launch()
6. Different Versions of the Model
- Spirit LM comes in two versions:
- Spirit LM Base: This version uses phonetic units, similar to the HuBERT model.
- Spirit LM Expressive: This version adds features that allow the model to capture emotions and other expressions in speech, like pitch and style.
7. Audio-to-Audio Inference Issues
- Currently, the audio-to-audio function might not work perfectly. There may be issues with how the model processes the audio.
- The model works best for text-to-speech (TTS) and speech-to-text (ASR) tasks.
Comparison to Other Models
Meta’s Spirit LM is not the only multimodal model on the market. Google, OpenAI, and other companies have also been working on similar models with capabilities that blend speech and text. Let’s take a look:
- Google’s NotebookLM: This tool can convert text into podcasts, using AI-generated voices to discuss articles or documents. It is powered by Google’s Gemini 1.5 model and supports lifelike audio outputs.
- OpenAI’s ChatGPT with Advanced Voice Mode: OpenAI’s model now offers advanced voice features that allow for dramatic reenactments and interactive voice engagements.
- Hume AI’s EVI 2: This model focuses on voice-to-voice interactions and can adapt to various personalities and accents.
- Amazon Alexa: In collaboration with Anthropic, Alexa is improving its conversational abilities to sound more natural and human-like.
NotebookLM vs Notion: Which is the better notetaking tool?
The Bottom Line
Meta’s Spirit LM is a useful multimodal language tool that offers advanced capabilities for handling both speech and text. With its ability to express emotions, integrate text and speech, and perform a variety of tasks, the model is opening up new possibilities in human-computer interaction, speech synthesis, and natural language processing.
How to Use Google NotebookLM AI Tool to Create Podcast in Just 1 Click?