
We are extremely excited to officially announce that we have held Xethub in our face 🔥
Xethub is a Seattle-based company founded by Yucheng Low of Ajit Banerjee of Rajat Arya, who previously worked at Apple, which built and expanded Apple’s internal ML infrastructure. Xethub’s mission is to enable best practices in software engineering for AI development. Xethub has developed technology to enable GIT to scale to TB repository and allow teams to collaborate, understand and work on large, evolving datasets and models. They were soon joined by a talented team of 12 team members. You need to follow them on their new ORG page: hf.co/xet-team
Common goals in HF
The Xethub team will unlock the next five years of growth in HF datasets and models, switching to their own better version of LFS as a storage backend for the hub repository.
– Julien Chaumonde, HF CTO
When I built the first version of the HF hub in 2020, I decided to build it on top of Git LFS. This is because it is decently known and bootstrapping the use of the hub was a reasonable option.
But at some point, I wanted to switch to a more optimized storage and version of the backend. git lfs – representing large file storage, but it didn’t mean the large file types that AI handled.
Examples of future use cases 🔥 – What will this help you with your hub?
Let’s say you have a 10GB parquet file. Add a single row. I need to re-upload 10GB today. With chunked files and deduplication from Xethub, you will need to re-upload some chunks containing new lines.
Another example of a GGUF model file: Suppose @Bartowski wants to update one metadata value in the GGUF header in Llama 3.1 405b Repo. Well, in the future, Bartowski can only re-upload one chunk of a few kilobytes, making the process more efficient.
As the field moves into the model in the coming months (thanks to Maxime Labonne of the new Bigllama-3.1-1ttr), this new technology is to unlock new scales both within the community and within the enterprise.
Finally, large datasets and large models challenge collaboration. How do teams work together with large data, models and code? How do users understand how data and models are evolving? We are working to find a better solution to answer these questions.
Hub Repos’s Fun Current Statistics 🤯🤯
Number of reports: 1.3m model, 450k dataset, 680k space total cumulative size: 12pb (280m file) / 7,3 TB Number of requests for Git (non-LFS) hubs stored in LFS (280m file): 1b Daily Cloudfront Bandwidth: 6PB🤯
Personal words from @ylow
I have been part of the AI/ML world for over 15 years and have seen deep learning slowly taking over vision, speech and text, increasing and increasing all data domains.
What I strictly underestimates is the power of data. It turns out that just a few years ago (such as image generation) was possible with models with more data and the ability to absorb it. In hindsight, this is a repeated lesson in ML history.
I have been working in the data domain since my PhD. The first is a startup (GraphLab/Dato/Turi) that created structured data and ML algorithm scales on a single machine. After being acquired by Apple, it worked to expand AI data management to over 100pb, supporting 10 internal teams who shipped 100 features each year. In 2021, I started Xethub with my co-founders supported by Madrona and other angel investors, bringing learning to achieve large-scale collaborations around the world.
The goal of Xethub is to enable ML teams to act like software teams by scaling Git file storage to TBS, allowing seamless experimentation and reproducibility, and providing visualization capabilities that understand the evolution of datasets and models.
I am extremely excited to join Face along with the entire Xethub team and integrate AI collaboration and development into the hub to facilitate AI collaboration and development, and release these features to the world’s largest ML community.
Finally, the infrastructure team is employed.
If you like these subjects and would like to build and expand the collaboration platform for the open source AI movement, please contact us!