
The Israeli startup tackling AI's unpredictable behavior
Qualifire's real-time monitoring solution ensures GenAI models are safer, more reliable, and ready for complex business tasks.
It’s been almost two and a half years since OpenAI unveiled ChatGPT, changing the world and the way we use artificial intelligence (AI) capabilities. Despite the significant market shifts during this time, and the improvements in the capabilities and performance of models from OpenAI and other companies, some things haven’t changed: they are still prone to errors, vulnerable to manipulation, and cannot be trusted blindly. When it comes to integrating these models into a business environment, these are significant barriers.
However, a young Israeli startup founded last year, Qualifire, is confident it has the solution to these problems and can remove the barriers preventing the widespread adoption of GenAI-based chatbots in organizational environments for complex and sensitive tasks. It is doing this, in large part, through the use of what it calls “once upon a time” artificial intelligence – machine learning.
"Today, companies cannot predict how their model will behave," said Qualifire founder and COO Gilad Ivry, speaking to Calcalist. "This creates a lot of risks for the company. Our system checks the models constantly, and in real time, it can prevent them from giving dangerous answers." In February, Qualifire was one of 16 companies, and the only Israeli one, selected to participate in the Google for Startups Growth Academy: AI for Cybersecurity program, where it will receive assistance from Google in its business development activities.
"The AI model is a parrot"
Ivry founded Qualifire last year with his brother Dror, who serves as CTO. Shortly after, Ariel Dan joined as CEO. "In my previous job, I worked at the startup Vianai, which, even before ChatGPT came out, was building products for the first generation of large language models (LLMs)," said Gilad Ivry. "We quickly realized there was a lot of unexpected behavior from the model. It would make mistakes, invent facts—everything we’re familiar with today. In November 2022, ChatGPT was released, and with it came a surge in demand and hype. Dror and I recognized it as the right moment to start Qualifire. We left everything and started the company, without funding, and even before we had a prototype. We saw the need."
What problems have you identified?
Ivry: "The tendency of language models to respond coherently creates an expectation that the model will be intelligent and knowledgeable. But the truth is that it’s just a parrot. A model knows how to produce text, but it doesn't know how to verify if it is correct. For example, if I want to use a model for customer service, it doesn’t understand what is expected of it, how it should behave, or what it is allowed to do. This creates a challenge for companies. If they cannot predict how the model will behave, it leads to a lot of risks—brand risks, regulatory risks, and even legal risks if the model does something forbidden or breaks the law."
Dan: "For instance, an Air Canada customer interacted with the company's chatbot, told it he was flying to a funeral, and asked for a free ticket. The chatbot responded, 'Contact customer service and you will receive a free ticket.' Of course, this was not the company's policy, and they refused. The customer sued and won. Legally, the chatbot is considered to represent the company. There is no layer that constantly checks the models to prevent them from giving answers that damage the brand or violate regulations."
Does this also include dealing with jailbreaking attacks, where users try to manipulate models into providing dangerous information by engineering prompts?
Ivry: "This is a feature we provide for free, but it’s not the main issue today. The problem is that AI doesn’t work well. The reason that, two years after ChatGPT's release, there is still no AI-based customer service in use today is because the models don't work well. They’re inaccurate, they don’t follow instructions, and they don’t behave as expected. These are the barriers to entry, and that’s what we aim to address."
What are the common glitches? What do you mean when you say AI is unreliable?
Ivry: "When a company implements AI, it has a set of requirements. It must address certain issues, communicate in a specific tone with a particular personality. Then there’s the issue of hallucinations; they want the model to be factually correct. But today, there’s no way for an organization to verify all of these requirements. Typically, companies invest significant time in development, experiments, and offline testing to tune the model and improve its behavior. Even after all that, they have no way of knowing, while the model is in use, how well it is performing. You always need to look retroactively to identify any issues."
"Qualifire comes with a different approach. We move testing to runtime. The moment the model delivers an answer, milliseconds before the user sees it, we analyze it and decide if the response meets the required standards. If we detect a violation, it is blocked, preventing the customer from experiencing it, and thus improving the reliability of the AI product."
How do you detect violations? AI too?
Ivry: "We address each phenomenon with our dedicated models. We’ve created deep learning-based detection models that understand the customer’s requirements, usage scenarios, and data. Each model performs its specific task."
Dan: "These are small detection models. It’s not LLM-based, which we don’t believe in for this purpose, but rather models tailored specifically to the problem. These are small language models (SLM) or classic machine learning algorithms."
Qualifire’s system addresses five common problems of GenAI models. "The first is detecting attempts to manipulate the model, such as jailbreaking," said Ivry. "We do this using a solution that’s applicable to all clients. Compared to other solutions, it ranks highly according to accepted metrics. The second problem is privacy. This includes identifying sensitive information, such as names, addresses, and identification numbers, and ensuring that neither the user nor the model exposes sensitive data. This is also handled with our algorithm-based solution. The third challenge involves content monitoring to ensure that the LLM isn’t producing insulting, racist, sexually harassing, or illegal content."
The last two challenges are more complex and require individual adaptation for each client. "The fourth challenge is anchoring: ensuring that every model output is grounded in the data and the organization's knowledge base," Ivry explained. "We make adjustments for each organization to identify and correct model delusions. The fifth challenge involves policy and organizational requirements: product characterization, acceptable use scenarios, and the boundaries within which the model can operate. For example, not recommending competitors or providing legal or financial advice. We also have 'do' rules, such as always staying involved in the conversation or ending with a question."
What is your advantage over the competition?
Dan: "Most companies use a one-model approach, trying to test the issues we address with another LLM. This creates problems like slowness and high costs, which we solve with our unique product."