Hugging Face – Posts

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

All HF Hub posts

posted an update 1 day ago

Post

7866

A new model is coming!
Its going to take a long time on my 5070 Ti so expect a release in ~1 month.
We think this model is going to be SOTA For its size.
Our Mini Version will be 25M Parameters and Pro with 140M.
The Pro version has a 3072 Context Window (Extensible to up to 6K with RoPE) And the Mini version has a context window of 4096 (Up to 8K with RoPE)
Meanwhile we are currently working on a Instruct Version of our BananaMind 1.5 Base.

The training will start this weekend

We are very exited to release it when its done!

8 replies

ginigen-ai

posted an update 2 days ago

Post

10200

🧠 Does your LLM know when it's about to be wrong?

Most leaderboards measure accuracy. We measure metacognition — whether a model catches its own errors. Benchmark + leaderboard + adapters, all open. 🎉

The surprise: even a K-AI #1 model (JGOS-31B-Citizen) is the strongest on multiple-choice traps (trap_rate 0.005 — ~2 misses in 400) yet blind to its own free-form mistakes (self-confidence AUROC = 0.5, pure random). A tiny base-frozen adapter recovers that signal.

Two independent axes (never compared across a row): ① trap_rate — does it fall for tempting trap options? (lower = stronger) ② adapter gain Δ — how much a lightweight adapter catches errors the model itself misses. (higher = more adapter value)

What's open: 📊 300+100 trap problems (each with a hidden trap + TICOS type) 🏆 24-model leaderboard 🧩 11 per-model adapters — adapters, NOT fine-tunes (base stays frozen; the adapter just reads the hidden state → P(wrong))

Submit any HF model → auto-scored daily at 09:00 KST and added to the board.

🏆 Leaderboard → ginigen-ai/Metacognition-Leaderboard-Space

📊 Benchmark → ginigen-ai/Metacognition-Bench

🧩 Adapters → FINAL-Bench/metacognition-adapters-6a42c032e6beb803dd032961

📊 Article → https://huggingface.co/blog/ginigen-ai/metacognition

Benchmark by ginigen-ai · Adapters by FINAL-Bench (Darwin/Chimera platform + AETHER metacognition tech).

9 replies

stas

posted an update 3 days ago

Post

3563

After many months of intense work the
Snowflake AI Research team is happy to present to you the new open source project: Arctic RL

https://snowflake.com/en/blog/engineering/arctic-rl-open-source-backend/

- Arctic RL integrates with VeRL and SkyRL today; enable ZoRRo with one config flag, no code changes required
- ZoRRo delivers up to 6x actor-update acceleration and a 3.5x end-to-end training speedup, reducing Arctic-Text2SQL-R2 training from ~5 days to ~36 hours on 32 H200 GPUs
- Arctic-Text2SQL-R2 achieved higher accuracy scores (48.7) than Gemini 3.1 Pro (47.9) and Claude 4.7 (47.3) on Snowflake's evaluated enterprise SQL benchmark under the tested conditions
- Two open source recipes ship with this release: a text-to-SQL recipe that improved BIRD dev accuracy from 59.92% to 70.35%, and a multi-hop QA recipe that improved average accuracy from 69.6% to 72.3%

4 replies

ginigen-ai

posted an update 4 days ago

Post

5135

🍳 The RoboCasa Kitchen Leaderboard
What does it take for a robot to handle kitchen chores the way a person does? It has to see (Vision), understand instructions (Language), and actually act (Action) — and VLA (Vision-Language-Action) models are emerging as the answer. They're the bridge between large multimodal models and real-world embodied control.

RoboCasa Kitchen is a leading robot-learning benchmark in which a single-arm robot (Franka Panda) performs 24 atomic manipulation tasks — picking up cups and bowls, opening drawers and doors, turning faucets, pressing buttons, and more — inside a photorealistic simulated kitchen. Because the layout and object placement are randomized every episode, it tests genuine generalization rather than memorized motions. The score (success rate, SR) is the average fraction of the 24 tasks completed as instructed, measured over multiple seeds so results aren't down to luck.

The catch: this benchmark has no official leaderboard, and protocols (number of demonstrations, evaluation setup) differ from paper to paper, leaving scores scattered. Lining the numbers up naively quickly turns into an apples-to-oranges comparison.

This leaderboard fixes that by collecting published scores with their sources and comparing only what is genuinely comparable. It's split into three tables:

🏆 Kitchen 24-task (matched) — head-to-head under identical conditions (per the RLDX-1 Technical Report). This is the core ranking you can actually trust.
➕ Other protocols — self-reported under different setups (e.g. fewer demos). Not directly comparable, so kept separate.
🤖 GR1-Tabletop — a different, humanoid-based variant suite, separated to avoid confusion.

Any researcher can submit their own model's score directly, and submissions are reviewed before they appear on the board. Every number links to its source paper, so you can verify it yourself.

👉 ginigen-ai/robocasa-kitchen-leaderboard

stas

posted an update about 13 hours ago

Post

I present to you a new experimental open book.

https://github.com/stas00/python-cookbook

I took my dense Python cheatsheet that I have been honing for many years and use a lot daily and turned it into a book of recipes.

Is this useful?

This is, of course, free, like other open books.

salma-remyx

posted an update about 14 hours ago

Post

What's holding your code back?
Outrider finds, implements, and validates methods for your repo.

While testing Outrider on a fork of huggingface/peft, I discovered "Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models" (arxiv: 2402.02347)

The work offers improved stability and faster convergence in LoRA finetuning by adjusting updates for curvature that LoRA optimizers typically ignore.

Not the most recent paper, so I was pleasantly surprised my action surfaced this method as a candidate before implementing a PR. Even more surprised this method had not already been merged upstream.

Turns out, the author did try contributing to peft a couple years ago, but people get busy and the PR was closed after going stale.

So I decided to revive it! I opened an issue and soon after the author engaged to help land the feature. Now huggingface/peft #3382 is open, a joint effort with the paper's author.

This whole episode has me thinking about the future of OSS maintenance with AI coding. The software projects which endure will be well-shaped to quickly land and help test new ideas.

Across 30 forks, I've seen several papers land as clean PRs for multiple repos, which offers a perspective on how methods impact applications. Recent methods matching multiple frameworks: STARE, Entity Binding, BINEVAL

Get Outrider: https://github.com/remyxai/outrider

aufklarer

posted an update about 19 hours ago

Post

Voice cloning models measured across five languages: OmniVoice, Chatterbox, VoxCPM2, Fish Audio

I published a new Soniqo benchmark post for local voice cloning models across five languages:

https://www.soniqo.audio/blog/voice-cloning-benchmarks

Models:

- OmniVoice int8
- Chatterbox Multilingual fp16
- VoxCPM2 bf16
- Fish Audio S2 Pro fp16

Languages:

- English
- German
- Modern Standard Arabic
- Spanish
- Mandarin Chinese

The benchmark uses Google FLEURS test clips as dataset references. Each row includes the reference audio, generated audio, speaker similarity, WER/CER, generated audio length, and RTF.

Main result in this run: OmniVoice was the strongest all-around row set, with 0.707 mean speaker cosine across all five languages, 0.0% ASR error, and mean RTF 0.45. VoxCPM2 bf16 was especially strong on Arabic speaker match. Fish Audio S2 Pro showed strong German/Arabic similarity but slower RTF. Chatterbox Multilingual was competitive on Arabic and Spanish.

This is an engineering benchmark, not a human MOS study. The speaker-similarity values should be compared within this table because every row uses the same local speaker-embedding pipeline.

Try the stack locally with Speech Studio:

https://www.soniqo.audio/speech-studio
https://github.com/soniqo/speech-studio

Underlying Swift library/CLI:

https://github.com/soniqo/speech-swift

Soniqo models and exports:

soniqo @aufklarer

What model or language should I add next?

breitburg

posted an update about 19 hours ago

Post

I've been experimenting with "pure" model alignment.

The core idea is to only train a verifiable version of a capacity until the model generalizes it to the non-verifiable version. For example, training the model on factual self-knowledge, like the model's scale, architecture, runtime situation, and being able to predict its own behavior, betting this generalizes to real introspection about states that do not.

The same principle applies to general instruction following -- no training on subjective judgement, only verifiable claims and inferences, betting the skill generalizes to instructions where correctness is a matter of judgment.

The primary alignment claim is that an identity and taste that will emerge this way will be much more robust and honest than hand-scripted ones (e.g.
"As an AI language model...").

During the training, we should never teach it to make any subjective claims or invent experiences that we assume it has, like "I don't have taste" or "I'm not self-aware in the way you think", as well as no narration of internal states like "I'm curious now".

The main threat, of course, is that we'll simply inherit the training distribution of all the things like "taste", and we'll get an average. However, with the recent research about the models' introspection abilities, it might be as well the case that we'll get something that's more honest than something that tries to adhere to a specific spec file.

I'm posting new experimental models trained that way in this collection: https://huggingface.co/collections/breitburg/neue

3 replies

kanaria007

posted an update about 21 hours ago

Post

✅ Article highlight: *Mega-Parse Bridge: Large Context Compression Without Losing Governance Semantics* (art-60-190, v0.1)

TL;DR:
This article argues that summarizing a huge input is not the same as parsing it.

Large documents, evidence bundles, long histories, multimodal case packets, and world-state slices cannot be treated as one vague “context.” 190 turns large-input handling into a governed mega-parse: shard, parse, retain semantics, declare loss, preserve re-expandability, and decide what the compressed artifact can honestly support.

Read:
kanaria007/agi-structural-intelligence-protocols

Why it matters:
• prevents “I read the whole thing” from becoming an overclaim
• keeps shard-level provenance instead of trusting a summary blob
• makes compression loss explicit and reviewable
• protects contradictions, authority-sensitive clauses, and protected-subject distinctions
• lets reviewers re-expand compressed claims back to source structure

What’s inside:
• mega-parse intake envelopes for large text, multimodal batches, and long-running packets
• shard-parse receipts for local grounded structure
• semantic-retention policies for what must survive compression
• compression artifacts with declared retention and bounded loss
• loss-declaration receipts for dropped, blurred, or unavailable surfaces
• re-expandability maps linking compressed claims back to recoverable shards
• admissibility and reentry artifacts for deciding where compressed outputs may be used

Key idea:
Do not say:

*“the system summarized the context.”*

Say:

*“this large input was sharded, locally parsed, compressed under this retention policy, loss-declared, re-expandable through these refs, and admitted only for these effect surfaces.”*

Compression is allowed.

Unreceipted semantic loss is not.

fffiloni

posted an update 1 day ago

Post

160

⏱️ Built a small Space for Visual Chronometer / Pulse of Motion.

Upload a video and estimate its Physical FPS: the frame rate implied by visual motion, independent of metadata.
Useful to inspect “chronometric hallucination” in generated videos: clips that look smooth, but move with the wrong physical time scale.

Try it here: fffiloni/Pulse-of-Motion

Recently active users