Why you should host your LLM? From LLMOps perspective
I recently read an article on the topic, advocating for self-hosting of LLM. Drawing from my personal experience with a project involving LLM, I'm writing this post to further discuss reasons why you should seriously consider hosting your own LLM from LLMOps perspective.
In today's landscape, utilizing LLM often implies relying on APIs provided by companies like OpenAI/Anthropic. The process is straightforward: you create a prompt and receive the generated output.
If you're using LLM for prototyping or toy projects, using these APIs is undoubtedly the optimal choice. They offer economic benefits and rapid deployment while delivering high-quality results. Especially with GPT-4 being the pinnacle of current models.
However, if your use of LLM goes beyond a mere front-end interface, and you're creating products that leverage LLM extensively, you should seriously contemplate self-hosting your own LLM. Beyond the well-discussed reasons such as data ownership, the reasons for hosting your own LLM from an operational perspective include:
1. You're entirely reliant on OpenAI, and unfortunately, OpenAI's support is almost non-existent.
2. GPT models exhibit substantial non-deterministic behavior.
3. Concerns about political correctness and censorship.
4. Managing version control becomes significantly more challenging.
5. Is having the model at this cost truly necessary?
These considerations highlight the importance of evaluating the trade-offs between convenience and control when deciding to host your own LLM. Your unique project requirements and long-term goals should drive this decision-making process.
1, You're entirely reliant on OpenAI and heck, OpenAI support is non-existent
It's obvious that you rely completely on OpenAI when utilizing their solutions. However, the OpenAI API isn't truly stable, and even more concerning, there is almost no support available for customers, especially for smaller ones. Essentially, you have no means of contacting OpenAI's support team. Imagine if OpenAI experiences an issue and you can't access their API on your product launch day – that would be a disaster.
Using OpenAI's ticket system, I received only a single response after over 2 months. This is a significant drawback.
Not to mention major issues like server downtime, the frequency of API problems is also quite high. With the rapid development, OpenAI's infrastructure hasn't matured sufficiently.
2, GPT models are wildly non-deterministic
This is an issue that's far from being new. According to this article, the level of random responses (even with temperature set to 0) falls within the range of 10-35%, a significant margin.
This becomes even more concerning when models transition between versions. The scenario where a prompt works smoothly in one version but encounters errors in the next version is a challenge I've had to face.
Deterministic models undoubtedly offer easier operational management. Models like Llama2 7b and 13b, based on personal examination, are entirely deterministic.
3, Political correctness and censorship
Recently, a new game called Baldur's Gate 3 has become one of the most successful launches on Steam. What intrigued me and fellow gamers is the game's inclusion of edgy details, even incorporating explicit content. These distinct indie characteristics have played a small part in garnering player favor.
If you aim to create products with a unique signature, or even explore themes intended for mature audiences, it's advisable to explore building your own LLMs for such purposes.
4, Version control will be much more painful
OpenAI will continually release new model versions and eventually cease support for older versions. Given that a model version can only be used for 6 months, which is rather limited, it necessitates a constant team effort to monitor the behavior of these models.
If you're merely utilizing LLMs for standard question-and-answer tasks, this might not impact you significantly. However, if you're crafting a custom agent to address specific tasks (like generating queries), it will demand additional resources that might not be worth the investment.
5, Do you really need a model at this cost?
If you intend to host a model that can compete with GPT 3.5, it's likely that you'll need to use Llama2-70b, and in that case, there's almost no cost advantage for inference [3].
However, do you truly require such a large model? It heavily depends on the problem you aim to solve. There are indeed smaller models that might suit your needs.
Recently, Stability introduced an LLM for code, containing just 3 billion parameters [4]. Similarly, Replit has their replit-code model, also with 3 billion parameters. These models can be trained and used for inference with significantly lower VRAM requirements. Consequently, the cost of inference drops by orders of magnitude, allowing you the leeway to implement a broader range of monetization plans.
Final thought
I still believe that it's still best to start prototyping and even create an MVP using APIs from OpenAI or Anthropic. Nevertheless, you can also consider self-hosting LLMs to establish a "moat" for your product. This approach can also deepen your understanding of the technology, helping you adapt in this rapidly evolving field.
References
[1] http://marble.onl/posts/why_host_your_own_llm.html
[2] https://152334h.github.io/blog/non-determinism-in-gpt-4/
[3] https://www.cursor.so/blog/llama-inference
[4] https://stability.ai/blog/stablecode-llm-generative-ai-coding