Ollama -
Ollama is how you can run a LLM in your home, without the internet. Trust me, it is fun.
- Installing Ollama
- Hardware!
- Pick the model that you want to run. If the model is large, then you will need a large amount of RAM. The size of the model is measured in Parameters. One billion parameters will require one Gigabyte of RAM. 16B parameters requires 16GB of RAM. 500M parameters requires 500MB of RAM.
- Then you will need a NVIDIA GPU which has a "Compute Complexity" greater than 5.
- Why? The embedding space is where each word and syllable is mapped using a large amount of dimensions. Imagine each word as a point floating in space. From the center of space, each point is like a radius with its various angles. An easy way to compute how close each point is to all other neighbouring points is by doing the same mathematics (Cosine Affinity) that the NVIDIA GPUs instruction sets can do across hundreds of cores, while your more complex CPU with its handful of cores can do these maths instructions and much, much more, due to the CPU managing your computer.
- Software!
- Well, www.Ollama.com asks you to install a peice of software. The Ollama packaige which is a Client+Server bundle. The Client is how the user talks to the LLM. The Server manages the LLM, loads the model, does the inference, returns the answer.
- Ollama also uses "Go" a run-time language created by Google.com. At www.go.dev So, like Java, you may want to download to Go package, which installs its runtime and compiler.
- To test if you installed Ollama, open your console and try "ollama --help"
- To test if you installed Go, open your console and try "go -version"
- Downloading a model
- Find the list of models at www.ollama.com/library
- Pick a model based on your needs. There are models for the linguistic arts, models for image recognition, and models for coding and maths.
- in your console, try "ollama run modelname'
- Running your model
- The Server is an HTTP server that waits on your computer when your computer starts up... like a trojan.
- When you call "ollama ps" in the cammandline. (Powershell), ollama is a HTTP client like a browser that contacts the server.
- You can always type in "ollama run 'model name' because if you have downloaded it once it will be saved with you. This starts a 'chat' client, in your commandline interface.
- Once you see ">>>" you can type "/help"(and enter) to explore what options you have.
- Your client -
- Because you are using HTTP , and HTTP is very human-friendly, you can create your own client to chat with an Ollama model. https://github.com/ollama/ollama/blob/main/api/client.go
- The word "Chat" here is different from "Generate". "Chat" bundles up your past request-response pairs as history of the Chat. Generate doesn't bother to fill your 'messages' with history. You can read the types.h file to see the different structures, for which Chat Response and Chat Request are two different structures. https://github.com/ollama/ollama/blob/main/api/types.go
- If you want to use "tools" in Ollama, then you need to build your own client interface, because the model will respond with a request for a tool, and you are free to let your client respond as the assistant with the results of a function tool, back to the model, as an intermediate step between your Request, and the Model's Response.
- Prompts
- Properly formatting your prompt can help. There are 7 categories in this Cognitive prompt structure, which is probably more help than hint. https://arxiv.org/html/2410.02953v2
- Tools
- Models with tool support have a special token, which is an "unknown" or "i dont' know" token.
- Creating a new model
- TEMPLATES
- Templates use Go Lang's template thing which is a little like PERL or PHP as a text manipulation tool. And The templates can manipulate the .Prompt which is the User's input, and can do manipulation of the .Response which is the LLM's output. However, the key is that the Template can use Tools. And Tools can be an external program like Python or C++ that the Template calls for either.
- Modelfile
- The model File allows you to create your own model, using an encoded model as the base. The model file allows you to tweak the parameters. The embedding space of a word or token can be imagined like a flower with the center of the flower where the stem is being the word in question, and all other words being the tips of the petals. The parameters tell the machine how open the flower can be, from fully outstretched and wonky to tightly curled up and repetitive.
- The source.
- The source according to ollama.com is at github.com. Within the source, there is:
- the client
- The server
- The llama.cpp machine
Comments
Post a Comment