Installing LMstudio

In this lesson, you will:

1. Use LM Studio to install LLM locally

2. Learn different kinds of LLMs and their usages

3. (Optional) Learn important terms related to the operation of LLMs

Your computer needs to meet the following requirement to run smoothly.

Apple Silicon Mac (M1/M2/M3) with macOS 13.6 or newer

Windows / Linux PC with a processor that supports AVX2 (typically newer PCs)

16GB+ of RAM is recommended. For PCs, 6GB+ of VRAM is recommended

Download the installer suitable for your computer from LM Studio, then run the set-up application.
LM studio may ask you to make changes to the computer. Allow any permissions that LM studio may need to ensure a smooth experience.
Once you finish installing, you should see the following interface.
There are a ton of AI models available for you to choose and experiment with. You may choose the ones that are most popular with the community or you may search for a certain model. We are going to install Llama3 – 8B instruct model.

**Step 1: After installing LM Studio, install Llama3 – 8B instruct model**

After the model is downloaded, click on the AI chat button on the top left.
Select a model to load at the top.
On the right is the system prompt. You can instruct the AI to behave in certain behaviours or give the AI certain roles.

The prompting strategy is the same as Poe except that you can directly change their behaviour in the system prompt. For example, the system prompt can be “ you are a cheerful cat” or “ you are a physics teacher, explain and clarify physics concepts to me. Explain as if I am 12 years old”. You may also input your prompt directly at the bottom.

**Step 2: Build your own chat model by starting a new chat (top left), selecting a model (top middle) and customizing system prompt (right/ bottom)**

There are different LLMs models you can experiment with, the following table explains the difference.

Llama	Developed and trained by Meta, Llama family AI are aimed at research and commercial use in English. It can handle multi-step tasks and code generation.
Code Llama	Part of the Llama family but designed specifically for supporting software engineers. It can handle complex coding tasks more accurately than other AI.
Phi	Developed and trained by Microsoft AI, Phi family AI are cost-effective small language models that can match the performance of large language models. It can handle language, Maths, reasoning and coding tasks. It is designed to be integrated into small devices while being quick and accurate.
Qwen(通义千问)	Developed and trained by Alibaba Cloud, Qwen family AI is a Mixture-of-Experts model aimed at research and commercial purposes. It can handle different language output such as Chinese and French as well as Maths and reasoning tasks.

(Optional) Glossary

Tokens: can be thought of as pieces of words. Our prompt is divided into tokens before passing on to the AI.
Context windows: is the number of tokens the model can take as input when generating responses. That is why you may see different versions of AI such as GPT-4 and GPT-4 128k. The larger the context window, the more tokens a model can handle.
7B, 13B, 70B: You may see AI models with different endings such as Llama 7B, Llama 13B and Llama 70B. This is the size of the training data used to train the LLM model. A model with larger training data might be more accurate. However, it also uses more resources such as RAM, storage and time. Therefore when running LLMs locally, you can choose which size LLMs to use according to your computer.
System message: System messages is a message written by the developer to tell the bot how to interpret the conversation. They give instructions that override the conversation. For example, “ you are a cat, respond in the form of Meow” then the bot will only respond using “Meow”.
Parameters: are the variables that the model learns during training. They are the internal variables that the model uses to make predictions or decisions. They determine the output of a model for a given input. For example, a model with more layers is more accurate.
Temperature: is a parameter that determines whether the output is more random and creative or more predictable. A higher temperature will produce more creative outputs, while a 0 temperature will generate identical output.
Quantization: is a technique that compresses LLMs. LLMs with more layers may take more storage but have increased accuracies. Quantizing the LLMs will decrease the size with a slight reduction in accuracy. However, it will also increase the speed and reduce the energy consumed.

Further Reading:

Running LLM’s Locally Using LM Studio | by Gene Bernardin | Medium

Best Open Source LLMs of 2024

a Hugging Face Space by open-llm-leaderboard