Salesforce Einstein offers businesses an open platform to integrate a vast array of large language models (LLMs). Whether it’s leveraging Salesforce’s bespoke models or models from leading providers like Amazon and Google, Model Builder ensures businesses can harness the most effective AI tools for their unique needs. But with all these options, how do you find the perfect AI model for your specific business needs?
The secret lies in understanding how to evaluate how good these powerful tools are at what you need done. In the last article, we talked about the big picture – the key factors to consider when picking an AI model. Remember the five: functional fit, non-functional fit, pricing, trust, and company policies.
Today, we’re diving deeper into the first one – functional fit. Think of it like picking the right tool for the job. You wouldn’t use a hammer to tighten a screw, would you? Same goes for AI models! Functional fit evaluates how good of a job it can do.
At AAXIS, we have a step-by-step process to check the “functional fit” of any model, whether it’s off the shelf or custom trained. This helps us see how well the training worked and how much bang we’re getting for the buck (ROI) – important stuff! Now, let’s get down to the details…
Understanding Modality: The model’s superpower.
Not all AI models are created equal – they each have their own specialty! Some are text whizzes, perfect for analyzing customer reviews. Others are image ninjas, ideal for revolutionizing quality control in manufacturing. And some are even multilingual voice masters, transcribing conversations like a champ from call center conversations. This “modality,” or the type of data they work with (text, images, voice, etc) is a key consideration in picking the AI model for your job.
Model Size: How big is it?
Like tools in a shed, some AI models are complex, while others are simple but effective. Complex models can deliver impressive results, but they need a lot of data to “learn” properly if you are custom training them. This can be expensive and time-consuming.
For off the shelf models, simpler models might not be as fancy, but they’re faster, cheaper, and have a smaller environment footprint. This can be perfect for tasks like categorizing products your vendors submit to the marketplace (Think Palm 2 or GPT 3.5 at around 100B parameters). However, for more delicate tasks like personalizing marketing emails, a more powerful model might be better suited (Like GPT 4 at 100 T)
Context Window: How much can it remember?
Especially with tasks involving language, AI models need to remember things! This “memory” is called context window size, which refers to how much information it considers at once. Choosing the right context size that fits your task is key.
For example, if you are extracting key product attributes from an unstructured description, the model only needs to remember one product at a time – no need for big context. On the other hand, if the model is searching for the correct order the customer is referring to in an email transaction, several orders and their details need to be considered – this will require a larger memory or “context window size”.
Benchmarks: Standardized Tests for AI models.
Standardized benchmarks act as your guide. These are tests that measure how well AI models perform specific tasks. By comparing benchmark results for different models, you can get a clear picture of which one tackles your business challenge most effectively. For example, if you need a model to answer customer questions based on product descriptions, you might look at benchmarks like SQuADv2, which tests a model’s ability to read and answer questions about a given passage. On the other hand, if you’re looking for a model to classify images, you might consider ImageNet, a benchmark that tests how well a model can identify objects in images. Other benchmarks like SuperGLUE, WinoGrande, and GPQA can also be used. These benchmarks give each model a score, so you can use them as a starting point for your comparison-shopping spree.
Custom Scoring: The extra credit.
Standardized benchmarks are great, but your task and data are unique. Create your own mini-test with real-world examples relevant to your specific needs.
For instance, let’s imagine you want to tag customer service emails with urgency, category, and sentiment. You could grab 100-500 typical emails and have your sales team label them. Then, run each AI model you’re considering against this test set. See which model comes closest to matching your sales team’s labels! This custom scoring model gives you a clearer picture of how well each model performs for your specific task. But you don’t want to rely solely on this score alone. Combine this with a generic benchmark to ensure your model can also deal with the unexpected.
Functional Fit Score: The final grade.
By considering these factors quantitatively – modality, model size, context size, benchmarks, and your own custom tests – you can compute a combined weighted score to ensure your AI investment delivers real results. You can then make informed decisions about which AI models are likely to offer the best functional fit for their tasks. Remember that for the final selection of the model to use, you must also evaluate the non-functional fit, pricing, trust and company policies.
So, there you have it! Choosing the right AI model isn’t about finding the flashiest tool, it’s about finding the perfect fit for your business challenge. Remember, AI is a powerful ally in today’s business world, and with the right evaluation strategy, it can propel your company towards innovation and success!