Scaling of volatile ML models in production

Nico Kuzak's avatar

Chris Poilier's avatar

“We have discovered that they are not just service providers, but partners invested in our goals and outcomes” – Nicolas Scuzak, senior ML engineer at Rocket Money.

To help users improve their financial health, I created Rocket Money (a personal financial app previously known as TrueBill). Users link their bank accounts to the app, then classify and classify transactions, identify repetitive patterns, and provide an integrated and comprehensive view of their personal financial life. A key stage in transaction processing is detecting known merchants and services, in part, when Rocket Money can cancel and negotiate members’ costs. This detection begins with the conversion of a short, often truncated form of transaction string to a class that can be used to enrich the product experience.

A journey to a new system

First, we extracted brands and products from transactions using regular expression-based normalizers. These were used in parallel with increasingly complex decision tables that map strings to their corresponding brands. The system proved effective in the company’s first four years when classes were tied only to products they supported for cancellation and negotiation. However, as the user base grew, the subscription economy boomed and the product range increased, so we had to keep up with the new class of rates while simultaneously adjusting regular expressions and preventing conflicts and duplications. To address this, we investigated a variety of traditional machine learning (ML) solutions, including a bag of word models with per-class model architectures. This system was difficult and lacking in maintenance and performance.

We decided to start with a clean slate and put together both our new team and our new mission. Our first task was to accumulate training data and build an in-house system from scratch. Retool was used to build label cues, gold standard validation datasets, and drift detection monitoring tools. We investigated various model topologies, but ultimately chose the Bert family of models to solve the text classification problem. Most of the initial model testing and evaluation was conducted offline within the GCP warehouse. Here we designed and built the telemetry and systems used to measure model performance in over 4000 classes.

Relieve domain challenges and constraints by partnering with embracing faces

There are many unique challenges we face within our domain, including merchant-injected entropy, processing/payment companies, institutional differences, and changes in user behavior. Designing and building efficient model performance alerts along with realistic benchmark datasets has proven to be an ongoing challenge. Another important hurdle is determining the number of classes that are best for your system. Each class represents a considerable amount of effort to create and maintain. Therefore, you need to consider the value that is offered to your users and our business.

The model worked well in offline testing, and with a small team of ML engineers, we faced new challenges. Seamlessly integrate that model into your production pipeline. As existing Regex systems handled over 100 million transactions per month under a very explosive load, it was important to have a high availability system that could dynamically scale to load and maintain the overall latency in the pipeline, combined with systems computed to the models they provide. As a small startup at the time, I chose to buy it rather than building a model serving solution. At the time, there was no expertise in internal model OPS and the ML engineers had to focus their energy on improving the performance of models within the product. With this in mind, I set out looking for a solution.

We auditioned the hand-rolled in-house model hosting solution we initially used for prototyping, compared to AWS Sage Maker, and embraced Face’s new model hosting inference API. Given that I use GCP for data storage and Google Vertex Pipelines for model training, exporting models to AWS Sagemaker was clunky and bug-prone. Thankfully, the setup for hugging my face was quick and easy, and I was able to handle small amounts of traffic within a week. The embrace face simply came out of the gate, but this reduction in friction led us to follow this path.

After a large three-month evaluation period, we chose to hug the face to host the model. During this time, we gradually increased the transaction volume to the hosted model and performed numerous simulated load tests based on the worst-case scenario volume. This process allowed the system to be tweaked to monitor performance, ultimately giving confidence in the ability of the inference API to handle transaction enrichment loads.

Beyond technical capabilities, we have also established a strong relationship with the team embracing Face. We found that they were not just service providers, but partners invested in our goals and outcomes. Early in the collaboration, we set up a shared slack channel that proved to be invaluable. We were particularly impressed by their prompt response to problems and their proactive approach to problem solving. Their engineers and CSM have consistently demonstrated our commitment to success and commitment to getting things right. This gave me an additional layer of confidence when it was time to make the final choice.

Integration, evaluation, and final selection

“Overall, the experience of working together with embracing faces in the model deployment was rich for our team and instilled us confidence in driving greater scale.” – Nicholas Kuzak, senior ML engineer at Rocket Money.

Once the contract was signed, we began a migration of moving out of the regex-based system, directing the trans model with increased critical path traffic. Internally, new telemetry had to be built to accommodate both model and production data monitoring. Given that this system is placed early in the product experience, inaccuracies in the results of the model can have a significant impact on business metrics. We performed extensive experiments in which new users split equally between the old system and the new model. We evaluated model performance in conjunction with a wider range of business metrics, including paid user retention and engagement. ML models are clearly superior in terms of retention, so you’ll be confident in making your decision to scale your system first to new users and then to existing users.

Both uptime and latency became a major concern as the model was fully placed in the transaction processing pipeline. Many downstream processes rely on classification results, and complications can lead to delayed data or incomplete enrichment, both of which reduce the user experience.

The first year of Rocket Money and Hugging Face’s collaboration was not without its challenges. However, both teams showed incredible resilience and a common commitment to solving problems that arise. One such example was when we expanded the number of classes in the second production model. Despite this set, the team endured and avoided recurring the same problem. Another hiccup occurred when I moved to a new model, but due to cache issues in embracing the end of the face, I received results from the previous model. This issue has been addressed promptly and has not recurred. Overall, the experience of working together with embracing faces in the model deployment has instilled us with confidence to enrich our team and promote greater scale.

Speaking of size, when we began to witness a significant increase in traffic to our models, it became clear that the cost of inference exceeds the budget. We used the cache layer before inference calls to significantly reduce the cardinality of the transaction and attempted to benefit from prior inference. Our problem is technically able to achieve a cash rate of 93%, but in the production environment it only reached 85%. There were several milestones on the rocket money side as the model provided 100% of the forecast. Our model scaled to a run rate of over 1 billion transactions per month, climbed into the #1 financial app in the App Store, and managed traffic surges when registering for #7 overall.

Collaboration and future planning

“Up hours and confidence in the Huggingface Inference API allowed us to focus our energy on the value generated by our models, focusing on plumbing and day-to-day operations” – Nicolas Kuzak, senior ML engineer at Rocket Money.

After launch, the internal rocket money team will focus on tuning both the class and performance of the model, in addition to a more automated monitoring and training label system. We add new labels every day and encounter fun challenges in model lifecycle management. This includes unique ones such as companies, new companies and products after the rocket company acquired TrueBill in late 2021.

Always check if there is a model topology that is suitable for your problem. LLM has been in the news recently, but at this point I struggled to find an implementation that could surpass the special transformer classifier in both speed and cost. We see promise in the early results of using them in the long tail of the service (i.e. mom and pop shop) – note that in a future version of Rocket Money! The uptime and confidence in the Huggingface Inference API allowed us to focus our energy on the value generated by the model, focusing on plumbing and daily operations. Helping to hug the face, we took on the more scale and complexity within the model, the type of value it generates. Their customer service and support exceeds our expectations and they are truly an incredible partner in our journey.

If you’d like to know how hugging faces can manage your ML inference workload, contact our hugging face team.

versatileai

See Full Bio

What's Hot

New image verification feature added to Gemini app

Aluminum OS is the AI-powered successor to ChromeOS

Complete Swift client for Hugging Face

New image verification feature added to Gemini app

Aluminum OS is the AI-powered successor to ChromeOS

Complete Swift client for Hugging Face

UK and Germany plan to commercialize quantum supercomputing

Tencent launches Hunyuan 3D AI asset generation engine

Complete Swift client for Hugging Face

Most Popular

UK and Germany plan to commercialize quantum supercomputing

Tencent launches Hunyuan 3D AI asset generation engine

Complete Swift client for Hugging Face

Don't Miss

New image verification feature added to Gemini app

Aluminum OS is the AI-powered successor to ChromeOS

Complete Swift client for Hugging Face

Subscribe to Updates

What's Hot

Scaling of volatile ML models in production

“We have discovered that they are not just service providers, but partners invested in our goals and outcomes” – Nicolas Scuzak, senior ML engineer at Rocket Money.

A journey to a new system

Relieve domain challenges and constraints by partnering with embracing faces

Integration, evaluation, and final selection

“Overall, the experience of working together with embracing faces in the model deployment was rich for our team and instilled us confidence in driving greater scale.” – Nicholas Kuzak, senior ML engineer at Rocket Money.

Collaboration and future planning

“Up hours and confidence in the Huggingface Inference API allowed us to focus our energy on the value generated by our models, focusing on plumbing and day-to-day operations” – Nicolas Kuzak, senior ML engineer at Rocket Money.

Related Posts