I don’t often use this blog to talk about personal matters, but this time professional and personal are tightly intertwined. So here it is: I’ve joined the early-stage startup Neuralk-AI. And I want to share why, what we’ve already built, and what’s coming next.

Neuralk in a few words

I won’t go too deep into startup lore, you’ve heard it all before. Early-stage companies are all about testing hypotheses, pivoting, sometimes dancing around a bit until a clear path emerges. At Neuralk-AI, we’ve found a promising direction: AI agents for commerce data, grounded in deep learning on tabular data.

Our guiding principle is simple: get the most out of structured, real-world data. And for now, our first playground is e-commerce.

What have we achieved so far?

Working at a young startup is as intense as it is energizing. We’re juggling internal product development, open-source contributions, and ongoing market exploration. It’s chaotic by nature: some promising branches must be pruned early, and sometimes weeks of work are shelved. But we’re moving forward fast—and that’s the essence of the startup mindset.

A tabular foundation model: NICL

One of our proudest milestones is NICL, our proprietary tabular foundation model designed for in-context learning—an emerging paradigm inspired by LLMs but applied to tables.

While I’m not directly involved in NICL’s architecture, I’ve been close enough to leave a few fingerprints. It’s exciting to be part of a team pushing the limits of what large models can do with structured data.

Due to its proprietary nature, NICL isn’t open-source—but stay tuned. We’re exploring ways to responsibly share insights, if not the model itself. In the meantime, if you’re curious about SOTA tabular models, you might want to check out TabPFN, TabTransformer, SAINT, or FT-Transformer for excellent baselines and inspiration.

See our performances on OpenML datasets: OpenML benchmark

TabBench: real-world benchmarking for tabular ML

My main contribution so far has been TabBench, our new benchmark suite designed to evaluate models on real industrial tasks, not just sanitized public datasets.

Why another benchmark? Because most existing ones either rely on overused datasets (think Titanic, Criteo, or UCI classics), or require significant setup costs to evaluate your model (PyTabKit and LAMDA TALENT).

TabBench strikes a different balance:

  • Lightweight integration (just plug in your model),
  • Realistic, proprietary tasks where possible (with strict anonymization),
  • Tasks that highlight generalization, robustness, and data issues common in the wild.

Our first industrial task is product categorization: the pragmatic cousin of academic classification. Unlike cleanly labeled datasets, categorization in the wild deals with incomplete descriptions, ambiguous product names, and taxonomies that grow and shift over time. It’s the kind of challenge that forces models to be robust and generalize beyond textbook definitions.

We hope TabBench becomes a place where researchers and practitioners can meet halfway—where new models can be tested not only on Kaggle-like data, but on the messier, richer datasets businesses actually deal with.

If you’ve been paying close attention to our core package tutorial, you may have caught a glimpse of what’s brewing behind the scenes. A subtle clue here, a stray comment there… and yes, a few of you sharp-eyed readers picked it up.

Record linkage is coming.

And that’s just the beginning. We won’t spoil the rest—but let’s just say more industrial tasks are quietly lining up behind the curtain. Stay curious. Stay tuned.

Foundry: our ML pipeline system, built like Lego

From benchmark to production and back again. To support rapid experimentation and deployment, we also released Foundry, the ML pipeline framework that supports TabBench.

Inspired by internal needs but refined for the community, Foundry prioritizes modularity and ease of use:

  • Want to add your model? Just swap a single component.
  • Need extra features? Inject them with minimal code into the pipeline.
  • Nested cross-validation, lazy loading, reproducible seeds: it’s all baked in.

While there are excellent tools out there, we often found the barrier to entry surprisingly steep, especially when all you want is to get started quickly, whether you’re prototyping a new idea or benchmarking hundreds of models. That’s why we built Foundry to feel intuitive from day one.

Interestingly, we only realized later that our design philosophy shares quite a bit with ZenML. I’m not a ZenML expert myself, but the parallels are striking. If you’ve built a pipeline with Foundry, chances are it could translate quite naturally to a more MLOps-oriented stack like ZenML—with very little friction.

If you do try bridging the two, let me know how it goes—I’d love to hear what transfers well (and what doesn’t).

What’s next?

The early reception to TabBench has been encouraging. So we’ll keep going—more tasks, better datasets, and broader community involvement.

One thing we’re especially committed to is pushing beyond traditional benchmarks: fraud detection, duplicate detection, table augmentation, robust tabular transformers… All tasks rarely seen in academic benchmarks, yet central to real-world systems.

We also hope to open more collaborative challenges based on these benchmarks. If you’re working on tabular models, or just want to see how your architecture fares in practice, we’d love to hear from you.

Conclusion

Joining Neuralk-AI was a leap, but a good one. There’s something deeply rewarding about building tools you wish you had in past jobs—and watching others use them to go further, faster.

If you’re as curious about structured data, practical ML, or foundation models outside of NLP, I think you’ll like where we’re heading.

Stay tuned. Or better, join us!