Mistral 7B Tutorial: A Step-by-Step Guide on How to Use Mistral LLM

The Mistral 7B Tutorial helps you understand how to use and fine-tune the Mistral 7B model to enhance your natural language processing projects. The Mistral 7B tutorial covers accessing, quantizing, fine-tuning, merging, and saving a powerful 7.3 billion-parameter open-source language model.

In this how-to-use Mistral 7B tutorial, you will get an overview of how to use and fine-tune the Mistral 7B model to enhance your natural language processing projects. You will learn to load the model in Kaggle, run inference, quantize, fine-tune, merge it, and push it to the Hugging Face Hub.

Must Read: Mistral 7B Outperforms LLaMA 2 and GPT-3.5 by running 6x faster

What is Mistral 7B? Understanding the Mistral-7B and Mixtral-8X7B Model

Mistral 7B is a new 7.3 billion-parameter language model, representing a major advance in large language model (LLM) capabilities. It has outperformed the 13 billion parameter Llama 2 model on all tasks and outperformed the 34 billion parameter Llama 1 on many benchmarks.

Mistral 7B approaches the performance of CodeLlama 7B on code tasks while remaining highly capable at English language tasks. This balanced performance is achieved through two key mechanisms. First, Mistral 7B uses group query attention (GQA), which allows for faster inference times compared to standard full attention. Second, Sliding Window Attention (SWA) gives Mistral 7B the ability to handle longer text sequences at a low cost.

Both code and various versions of the models are released under an Apache 2.0 license, allowing them to be used without restrictions. You can learn more about model architecture, performance, and instruction fine-tuning by reading the Mistral 7B (arxiv.org) research paper. Read the Mistral-7B and Mixtral-8X7B Official Document

What is the platform used for Mistral 7B LLM? Mistral 7B LLM models can generate endpoints for both open-weight models and optimized models. The endpoints can be used with the client packages or accessed directly through an API. The endpoints page for a detailed on description of endpoints performance can be checked here – https://docs.mistral.ai/platform/overview

How to use the Mistral 7B LLM Client Code:

Mistral 7B client codes are available in both Python and Javascript. For installation, it is advised to follow the repository for our Python client or Javascript client. The chat completion API allows you to chat with a model fine-tuned to follow instructions. Mistral allows users to provide a custom system prompt (see API reference). A convenient safe_mode flag allows for chat completion to be moderated against sensitive content (see Guardrailing). For more information on Mistral 7B LLM Client code – Click Here

What is the process Mistral 7B LLM Client code Embeddings:

The embeddings API allows you to embed sentences with the below Mistral embedding codes.

What is Mistral 7B LLM Client code Endpoints?

Mistral provides different endpoints with different price/performance tradeoffs. The Mistral endpoints depend on internal models. Some of them are open-weight, which allows users to deploy them on their own, on arbitrary infrastructure. See Self-deployment for details.

Mistral Generative Endpoints: All Mistral Generative Endpoints can reason on contexts up to 32k tokens and follow fine-grained instructions. The following table gathers benchmarks for each endpoint. Mistral Generative endpoints are provided through the chat access API. Users can access underlying base models for endpoints relying on open-weight models.

What are the types of Mistral Generative Endpoints

The generative endpoint is best used for large batch processing tasks where cost is a significant factor but reasoning capabilities are not crucial. Currently powered by Mistral-7B-v0.2, a better fine-tuning of the initial Mistral-7B released was inspired by the fantastic work of the community.

Must Read: Mistral Drops OpenAI Language Model via Torrent Link

API name: mistral-tiny

Mistral Generative Endpoints Small: The endpoint supports English, French, German, Italian, and Spanish and can produce and reason about code. Currently powered by Mixtral-8X7B-v0.1, a sparse mixture of expert models with 12B active parameters. API name: mistral-small

Mistral Generative Endpoints Medium: This endpoint currently relies on an internal prototype model. API name: mistral-medium

Mistral Generative Endpoints Embedding models: Embedding models enable retrieval and retrieval-augmented generation applications. The endpoint outputs vectors in the 1024 dimensions. It achieves a retrieval score of 55.26 on MTEB. API name: mistral-embed

Arthur Mensch Net Worth: Mistral AI CEO and Co-Founder

Mistral 7B LLM Guardrailing: System prompts to enforce guardrails

The ability to enforce guardrails in chat generations is crucial for front-facing applications. Mistral has introduced an optional system prompt to enforce guardrails on top of the models. Developers can activate this prompt through a safe_mode binary flag in API calls as follows. Check here the Mistral 7B LLM Guardrailing Official Documents

Mistral 7B LLM Pricing: For Per Million Token

Mistral provides you with the facility to pay-as-you-go, which means you can choose the API options based on the requirements and, accordingly, pay for them. The prices listed below are exclusive of VAT.

Mistral 7B LLM Chat API Pricing: For 1M tokens

Model	Input	Output
mistral-tiny	0.14€ / 1M tokens	0.42€ / 1M tokens
mistral-small	0.6€ / 1M tokens	1.8€ / 1M tokens
mistral-medium	2.5€ / 1M tokens	7.5€ / 1M tokens
Mistral 7B LLM Embedding API
Model	Input
mistral-embed	0.1€ / 1M tokens

Mistral 7B LLM Open-weight models: 8X7B | Downloading | Chat template | Size

Mistral open-sources both pre-trained models and fine-tuned models. These models are not tuned for safety because they want to empower users to test and refine moderation based on their use cases. For safer models, follow our guardrailing tutorial.

Mistral 7B: Mistral 7B is the first dense model released by Mistral AI. At the time of the release, it matched the capabilities of models up to 30B parameters. Learn more in our blog post.

Mixtral 8X7B: Mixtral 8X7B is a sparse mixture of expert models. As such, it leverages up to 45B parameters but only uses about 12B during inference, leading to better inference throughput at the cost of more vRAM. Read here the Mistral 7B LLM Open-weight models Official Documents

Mistral Open-weight Models Download Links:

Mistral-7B-v0.1: Hugging Face // raw_weights (md5sum: 37dab53973db2d56b2da0a033a15307f).
Mistral-7B-Instruct-v0.2: Hugging Face // raw_weights (md5sum: fbae55bc038f12f010b4251326e73d39).
Mixtral-8x7B-v0.1: Hugging Face.
Mixtral-8x7B-Instruct-v0.1: Hugging Face // raw_weights (md5sum: 8e2d3930145dc43d3084396f49d38a3f).

Mistral Open-weight Models Sizes:

Name	Number of parameters	Number of active parameters	Min. GPU RAM for inference (GB)
Mistral-7B-v0.2	7.3B	7.3B	16
Mistral-8X7B-v0.1	46.7B	12.9B	100

Mistral Open-weight Models Chat Template:

The template used to build a prompt for the Instruct model is defined as follows:

Note: The function should never generate the EOS token. However, FastChat (used in vLLM) sends the full prompt as a string, which might lead to incorrect tokenization of the EOS token and prompt injection. Users are encouraged to send tokens instead, as described above.

Mistral 7B Self-Deployment: Process and Serving Frameworks

Mistral AI provides ready-to-use Docker images on the Github registry. The weights are distributed separately.

To run these images, you need a cloud virtual machine matching the requirements for a given model. These requirements can be found in the model description. We recommend two different serving frameworks for the models.

vLLM: A python-only serving framework that deploys an API matching OpenAI’s specification. vLLM provides paged attention kernel to improve serving throughput.

NVidias’s Tensor RT-LLM served with Nvidia’s Triton Inference Server: TensorRT-LLM provides a DSL to build fast inference engines with dedicated kernels for large language models. Triton Inference Server allows efficient serving of these inference engines. Read here the Mistral 7B vLLM Official Documents

These images can be run locally, or on your favourite cloud provider, using SkyPilot.

Mistral 7B TensorRT-LLM // Triton: How to Build the Engine

To build the Mistral 7B TensorRT-LLM // Triton engine Follow the official TensorRT-LLM documentation to build the engine.

For Mistral-7B, you can use the LLaMA example

For Mixtral-8X7B, official documentation coming soon…

Deploying the engine

Once the engine is built, it can be deployed using the Triton inference server and its TensorRTLLM backend. Read here the Mistral 7B TensorRT-LLM // Triton Official Documents

Mistral 7B Self-Deployment With docker or Without docker:

vLLM can be deployed using a docker image provide or directly from the python package. To Know more about Mistral 7B Self-Deployment With docker or Without docker – Click Here

Mistral 7B Self-Deployment With docker: Process and Code

On a GPU-enabled host, you can run the Mistral AI LLM Inference image with the following command to download the model from Hugging Face:

Mistral 7B Self-Deployment

Mixtral-8X7B Self-Deployment

Where HF_TOKEN is an environment variable containing your Hugging Face user access token. This will spawn a vLLM instance exposing an OpenAI-like API, as documented in the API section.

INFO: If your GPU has CUDA capabilities below 8.0, you will see the error ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your XXX GPU has compute capability 7.0. You need to pass the parameter –dtype half to the Docker command line. The dockerfile for this image can be found on our reference implementation github.

Mistral 7B Self-Deployment Without docker: Process and Code

Alternatively, you can directly spawn a vLLM server on a GPU-enabled host with Cuda 11.8. Read here the Mistral 7B Deploy with SkyPilot Official Documents

Install vLLM: Firstly, you need to install vLLM (or use Conda to add vllm if you are using Anaconda):

Mistral 7B Self-Deployment Without docker:

Mixtral-8X7B Self-Deployment Without docker

Mistral 7B Deploy with SkyPilot: Process and Code

SkyPilot is a framework for running LLMs, AI, and batch jobs on any cloud, offering maximum cost savings, highest GPU availability, and managed execution. An example of a SkyPilot config that deploys models is provided below. To know more on Mistral 7B Deploy with SkyPilot – Click Here

Also Read: Top 5 Deep Fake Videos of 2023: YoY 3000% Fraud Increased

Mistral SkyPilot Configuration Process:

After installing SkyPilot, you need to create a configuration file that tells SkyPilot how and where to deploy your inference server, using our pre-built docker container:

Mistral-7B SkyPilot Configuration Process Code:

Mixtral-8X7B SkyPilot Configuration Process Code:

Test it out: To easily retrieve the IP address of the deployed mistral-7b cluster you can use:

Also Read: Nvidia Funded 35 AI Companies in 2023 to dominate the technology landscape

Tech Chilli Desk

Tech Chilli News Desk is a conglomeration of Tech enthusiasts who are committed to delving deep into the evolving new-age technology of Web 3.0, Artificial Intelligence (AI), Robotics, Fintech, Crypto and more. This desk brings the latest information on Digital Transformation through use cases, implementations, coverage, case studies, reporting and deep analysis.