Go to article URL

Announcing Magika 1.0: now faster, smarter, and rebuilt in Rust

Early last year, we open sourced Magika, Google's AI-powered file type detection system. Magika has seen great adoption by open source communities since that alpha release, with over one million monthly downloads. Today, we are happy to announce the release of Magika 1.0, a first stable version that introduces new features and a host of major improvements since last announcement. Here are the highlights:

Smarter Detection: Doubling Down on File Types

Magika 1.0 now identifies more than 200 content types, doubling the number of file-types supported from the initial release. This isn't just about a bigger number; it unlocks far more granular and useful identification, especially for specialized, modern file types.

Some of the notable new file types detected include:

Expanding Magika's detection capabilities introduced two significant technical hurdles: data volume and data scarcity.

First, the scale of the data required for training was a key consideration. Our training dataset grew to over 3TB when uncompressed, which required an efficient processing pipeline. To handle this, we leveraged our recently released SedPack dataset library. This tool allows us to stream and decompress this large dataset directly to memory during training, bypassing potential I/O bottlenecks and making the process feasible.

Second, while common file types are plentiful, many of the new, specialized, or legacy formats presented a data scarcity challenge. It is often not feasible to find thousands of real-world samples for every file type. To overcome this, we turned to generative AI. We leveraged Gemini to create a high-quality, synthetic training set by translating existing code and other structured files from one format to another. This technique, combined with advanced data augmentation, allowed us to build a robust training set, ensuring Magika performs reliably even on file types for which public samples are not readily available.

The complete list of all 200+ supported file types is available in our revamped documentation.

Under the Hood: A High-Performance Rust Engine

We completely rewrote Magika's core in Rust to provide native, fast, and memory-safe content identification. This engine is at the heart of the new Magika native command line tool that can safely scan hundreds of files per second.

Output of the new Magika Rust based command line tool

Magika is able to identify hundreds of files per second on a single core and easily scale to thousands per second on modern multi-core CPUs thanks to the use of the high-performance ONNX Runtime for model inference and Tokio for asynchronous parallel processing, For example, as visible in the chart below, on a MacBook Pro (M4), Magika processes nearly 1,000 files per second.

Getting Started

Ready to try it out? Getting started with the native command-line client is as simple as typing a single command line:

Alternatively, the new Rust command-line client is also included in the magika python package, which you can install with: pipx install magika.

For developers looking to integrate Magika as a library into their own applications in Python, JavaScript/TypeScript, Rust, or other languages, head over to our comprehensive developer documentation to get started.

What's next

We're incredibly excited to see what you will build using Magika's enhanced file detection capabilities.

We invite you to join the community:

Thank you to everyone who has contributed, provided feedback, and used Magika over the past year. We can't wait to see what the future holds.

Acknowledgements

Magika's continued success was made possible by the help and support of many people, including: Ange Albertini, Loua Farah, Francois Galilee, Giancarlo Metitieri, Alex Petit-Bianco, Kurt Thomas, Luca Invernizzi, Lenin Simicich, and Amanda Walker.

feeds.feedburner.com/GoogleOpenSourceBlog
open-source | source