Sign in to view Akshay’s full profile
or
New to LinkedIn? Join now
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
Sign in to view Akshay’s full profile
or
New to LinkedIn? Join now
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
San Francisco Bay Area
Sign in to view Akshay’s full profile
Akshay can introduce you to 10+ people at Anyscale
or
New to LinkedIn? Join now
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
3K followers
500+ connections
Sign in to view Akshay’s full profile
or
New to LinkedIn? Join now
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
View mutual connections with Akshay
Akshay can introduce you to 10+ people at Anyscale
or
New to LinkedIn? Join now
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
View mutual connections with Akshay
or
New to LinkedIn? Join now
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
Sign in to view Akshay’s full profile
or
New to LinkedIn? Join now
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
Activity
3K followers
-
Akshay Malik reposted thisAkshay Malik reposted thisToday we are excited to announce, in partnership with the Google Kubernetes Engine (GKE) team at Google Cloud, a major milestone in Ray Serve LLM’s throughput and latency characteristics: Ray Serve LLM now matches high performance, rust-based routing frameworks such as vllm-router in benchmarks across a variety of workloads and deployment patterns. In our new blog, we cover three major optimizations to the Ray Serve LLM + vLLM stack that made this possible: direct streaming, a new vLLM Ray executor backend, and HAProxy integration. As a result, we see up to 4.4x higher request throughput than previous versions on prefill-heavy workloads, and up to 24x higher request throughput on decode-heavy workloads. Ray is a popular choice for complex, Python-native distributed computing batch inference pipelines with heterogeneous hardware. And now, we believe that Ray’s powerful primitives for fault tolerance, observability, flexibility across Kubernetes and VMs will enable the next generation of optimizations as LLM inference deployments become increasingly complex. Thanks to Spencer Peterson, Andrew Sy Kim, Kourosh Hakhamaneshi, Jeffrey (Yu-Che) Wang, Richard Liaw, Akshay Malik, Abrar Sheikh, and Alex Yang whose contributions to Ray Serve and Ray Serve LLM made this possible. A special thanks to the vllm-router (vLLM) and SGLang Model Gateway (SGLang) teams for great engineering on their respective projects. Read the full writeup here: https://lnkd.in/guHrz_FA
-
Akshay Malik reposted thisAkshay Malik reposted thisWrote a technical deep dive on Satya Nadella's point that the moat is the learning loop you build on a model (not the model you rent). A top-down walkthrough, with diagrams and code, of why companies across industries (finance, robotics, autonomy, ecommerce, and biology) are already doing it and how they're winning.A technical guide to building your own learning loopA technical guide to building your own learning loopGoku Mohandas
-
Akshay Malik reposted thisAkshay Malik reposted thisOne pattern we keep seeing with teams serving LLMs at scale: Prefill-decode disaggregation is often treated like a magic wand. Turn it on, and latency/throughput should get better. But the reality is more nuanced. PD can deliver major gains — in our experiments, up to 2.7x better goodput and up to 67% compute cost reduction on AMD MI325X with Ray Serve + vLLM — but only when the workload and SLA are a good fit. That is why we wrote this post: to share the core insights for when PD helps, when it does not, and how to reason about it in practice. A few takeaways: 1. PD does not make prefill faster. It adds a KV transfer step, so TTFT can get worse. If your SLA is strictly TTFT-bound, aggregated serving is often simpler and better. 2. PD’s real win is TPOT. By separating prefill and decode onto dedicated GPUs, decode avoids prefill interruptions and TPOT stays much flatter under load. 3. TPOT savings compound over generation length. A few milliseconds per token may look small, but over hundreds or thousands of generated tokens, it can become a meaningful E2E latency and throughput improvement. 4. The P ratio is workload-dependent. Input/output length, KV cache hit rate, target QPS, and latency SLA all affect the optimal split. A bad ratio can make PD worse than aggregated. We also validated this on AMD + vLLM, where the path for prefill-decode disaggregation has been much less paved. Full post with intuition, benchmarks, and reproducible AMD + Ray + vLLM setup: https://lnkd.in/gnrmmSrKAchieving Up to 67% Cost Savings with Prefill-Decode Disaggregation Using Ray + vLLM on AMD MI325X | AnyscaleAchieving Up to 67% Cost Savings with Prefill-Decode Disaggregation Using Ray + vLLM on AMD MI325X | Anyscale
-
Akshay Malik reposted thisCheck out our work on native RL APIs for vLLM! Blog: https://lnkd.in/giq22pwPAkshay Malik reposted thisExcited to share some of our work on improving vLLM for RL! A number of RL frameworks, including SkyRL, use vLLM for inference, and we’ve noticed some common problems: Weight syncing between training and inference is implemented in an ad-hoc fashion and duplicated across frameworks. Asynchronous RL is prone to break at scale, especially in P/D and DPEP deployments. We’ve been working on improving both! For more details check out: https://lnkd.in/geWNvSav 𝗪𝗲𝗶𝗴𝗵𝘁 𝘀𝘆𝗻𝗰𝗶𝗻𝗴 𝗶𝗻 𝘃𝗟𝗟𝗠 Weight syncing with vLLM has typically been implemented with ad-hoc worker extensions and RPC endpoints. While this works, it leads to a few issues. Most frameworks typically care about specifying the transport logic of how exactly vLLM will receive weights, but now they also need to deal with ad-hoc pre/postprocessing. Many frameworks also end up duplicating transport logic for popular strategies like NCCL, CUDA IPC as well as implementing the same optimizations (ex: packed tensor). This also leads to version locked implementations because they reach into vLLM internals. We introduce native APIs for weight transfer: /init_weight_transfer_engine, /start_weight_update , /update_weights and /finish_weight_update for differents stages of weight transfer. Along with these endpoints is a WeightTransferEngine abstraction allowing users to specify custom transport logic for receiving weights. We provide NCCL and CUDA IPC implementations out of the box, but framework developers can bring their own. The APIs being simple still allows for advanced use-cases like sharded weight transfer from M trainer ranks to N inference ranks. See my prototype here: https://lnkd.in/gbZAhNik 𝗔𝘀𝘆𝗻𝗰𝗵𝗿𝗼𝗻𝗼𝘂𝘀 𝗥𝗟 𝘄𝗶𝘁𝗵 𝘃𝗟𝗟𝗠 Async RL is the default for reasoning and agentic RL to maximize utilization with long tailed trajectories. We’ve worked on upgrading async RL with vLLM: - New pause mode to preserve requests in the scheduler: Users don’t need to manually bookkeep requests - Deadlock fixes for DPEP: This one took many iterations! DPEP requires careful coordination between vLLM engines for generation, and we ensure the same with weight syncing! The fixes have been tested at scale - Prime Intellect has validated async RL training for zai-org/GLM-5.1-FP8 in a P/D, DPEP32 deployment across 16 8xH200 nodes. It’s been great working with Sumanth R Hegde on this effort!
-
Akshay Malik shared thisThe Ray Core and Ray Data teams at Anyscale are actively hiring system engineers in Bengaluru! Ray Core serves as the cornerstone of the entire Ray ecosystem and powers libraries like Ray Train, Ray Data, and Ray Serve — quickly adopted by companies like OpenAI, DeepSeek, Spotify, Uber, DoorDash, Pinterest, Apple, and many more. Ray Data takes this further as the scalable data processing engine behind the training and inference pipelines of some of the largest AI models in the world. In this role, you will play a pivotal part in shaping the future of Ray and Anyscale. Particularly in the context of the growing importance of opensource and LLMs, you will be a crucial contributor to our strategic goal of establishing ourselves as the compute substrate of this unprecedented AI wave. If the prospect of tackling challenges like: → Scaling clusters to 10K+ nodes → Optimizing network transfer speed for petabyte-scale workloads → Building the data and execution infrastructure behind large multi-modal models → Pushing the limits of distributed training and inference …excites you, I encourage you to apply or message me directly if you have any questions. 📍 Bengaluru 🔗 Open roles: https://lnkd.in/ga6R6bvJ #Hiring #Bengaluru #DistributedSystems #RayCore #RayData #OpenSource #LLM #MLInfra
-
Akshay Malik reposted thisAkshay Malik reposted thisToday, we are officially making Anyscale Agent Skills Generally Available! Over the past year at Anyscale, I've watched the same pattern repeat: teams adopt AI coding agents, point them at Ray workloads, and hit the same walls: wrong GPU configs, stale APIs, broken deploys. The agent writes confident code that fails at runtime. My team and I wanted to fix that at the source. Not by building another chatbot, but by encoding what our field engineering team has learned across hundreds of production Ray deployments directly into the tools developers already use. The part I'm most proud of: agents with these skills don't just generate code. They ask the right questions before writing a single line. They validate GPU memory constraints, use current Ray APIs, and produce configs pulled from tested templates, not hallucinated ones. Now, your agent actually understands how to build and operate Ray: 🔹 Workload Skills: Turn a single prompt into production-ready configs using current Ray APIs and validated templates. 🔹 Platform Skills: Read live logs and metrics from the Anyscale API to diagnose failures, patch code, and redeploy in the exact same conversation. 🔹 Infra Skills: Get guided, step-by-step help deploying Anyscale on Kubernetes or cloud VMs tailored to your specific environment. In addition, Anyscale is launching an limited-access Optimization Services Program where agents paired with our engineers analyze throughput bottlenecks and GPU waste to generate tuning recommendations to help optimize cost and performance of production AI workloads. Read the launch blog: https://lnkd.in/gDki6PMm
-
Akshay Malik reposted thisAkshay Malik reposted thisAI is not only shifting to bigger models. It is also shifting toward more complex data pipelines. To build AI models or search services with multimodal capabilities (e.g. vision language models - VLMs, or search over both images and text), teams need to combine a mix of Python libraries and frameworks, along with a combination of CPU and GPU resources, to complete one end-to-end pipeline. While CPUs still play a critical role, GPU demand is increasing for data preprocessing. Preparing data now also requires GPUs, since running embedding generation steps or tasks like image captioning and text summarizaton rely on AI models. This increased demand for GPUs makes workload orchestration more complex, as GPU capacity is limited and not always available in the same cloud or region when you need it. The industry most clearly driving the demands of this modern AI stack is physical AI and its cool to see it all happening in real-time in teams like Multiply Labs, Bonsai Robotics, Physical Intelligence, and others pushing the boundaries of what infrastructure needs to look like for this new world. Link to their stories in the comments
-
Akshay Malik reposted thisAkshay Malik reposted thisI recently had the opportunity to join a panel with Nebius and Anyscale to discuss a challenge every robotics team is facing: 𝗱𝗲𝘃𝗲𝗹𝗼𝗽𝗶𝗻𝗴 𝗿𝗲𝘀𝗶𝗹𝗶𝗲𝗻𝘁, 𝘀𝗰𝗮𝗹𝗮𝗯𝗹𝗲 𝗮𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 𝗳𝗼𝗿 𝘁𝗵𝗲 𝗲𝗿𝗮 𝗼𝗳 𝗳𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻 𝗺𝗼𝗱𝗲𝗹𝘀 𝗮𝗻𝗱 𝗽𝗵𝘆𝘀𝗶𝗰𝗮𝗹 𝗔𝗜. Brittle infrastructure is one of the biggest bottlenecks to R&D velocity today, but at Multiply Labs, we’ve turned that hurdle into a competitive advantage. I supported the shift to a "one-click" deployment model, building the infrastructure-as-code pipelines that allow our team to deploy seamlessly across AWS, GCP, and Nebius. By partnering with Anyscale to create a fully portable, multi-cloud Ray environment, we’ve been able to drastically reduce the time it takes for a developer to spin up training software. Our training is now cloud-agnostic, letting us follow GPU capacity across clouds without any operational burden. If you’re scaling distributed training and thinking about how your infrastructure needs to evolve as robotics models get more capable, check out the full case study on how Multiply Labs built our multi-cloud foundation using Anyscale: https://lnkd.in/gS28uAXyMultiply Labs Advances AI for Biologics Robotics on AnyscaleMultiply Labs Advances AI for Biologics Robotics on Anyscale
-
Akshay Malik reposted thisAkshay Malik reposted thisWideEP has become the industry standard for serving large MoE models like DeepSeek-V3. By distributing experts across a large number of GPUs, WideEP expands effective GPU memory for KV caches — enabling larger batch sizes and higher throughput. But WideEP introduces a critical production challenge: fault blast radius. Because the dispatch-combine collective requires all ranks to participate together, a single GPU failure can take down the entire DP/EP group. At a typical WideEP width of 32 GPUs, one failed GPU means 32 GPUs go dark – and so does your serving availability. Ray Serve LLM solves this with DP Group Fault Tolerance. Here's how it works 👉 WideEP Refresher In DeepSeek-style MoE, the attention layer uses Multi-head Latent Attention (MLA). Unlike standard multi-head attention, where tensor parallelism can be applied across KV heads, MLA compresses the KV cache into a shared latent representation, making KV-head sharding incompatible. The common WideEP strategy is therefore to replicate the MLA layer across all participating ranks and to apply data parallelism (DP) at the request level. In sparse MoE LLMs, the linear layers consist of a collection of smaller linear layers, each representing an expert. For example, DeepSeek-V3 has 256 experts per MoE layer, where only 8 experts are activated at a time. In WideEP deployments, the experts are spread across the participating ranks, and this is known as expert parallelism (EP). Together, the replicated DP attention layer and the sharded EP MoEs form a DP/EP group. Why DP Group is the Right Unit for Fault Tolerance In WideEP deployments, partial DP/EP groups are not functional. Tokens processed on a certain DP attention rank may be routed to an expert living on a different EP rank. The control plane should never expose partial groups to traffic. We apply gang scheduling to deliver this atomicity requirement among a DP group. Ray Serve, acting as the control plane, implements gang scheduling to ensure all ranks among a gang are scheduled, health-checked, torn down, and recovered all together. This gives Ray Serve the right orchestration semantics for WideEP: • All ranks in a DP group are scheduled together. • A rank failure invalidates the DP group. • The faulty DP group is torn down and recreated atomically. • Ray Serve router continues sending traffic to other healthy DP groups. This enables minimal downtime and effective recovery mechanism for serving WideEP deployments in production. Shout out to the team who makes this happen: Kourosh Hakhamaneshi, Abrar Sheikh, and Seiji Eicher! Check out the full writeup here: https://lnkd.in/gMqkmxxE.
-
Akshay Malik liked thisAkshay Malik liked thisIntroducing JetSpec: we find speculative decoding can push LLM generation latency to extreme by co-optimizing drafting cost and drafting quality with causal parallel tree drafting. JetSpec reaches up to 9.64× end-to-end speedup on MATH-500 and 4.58× on open-ended chat while keeping lossless. With CUDA graph and kernel optimizations, JetSpec further translates to around 1000 TPS on a single B200 GPU. ⚡️ Prior SD faces a dilemma: 1. AR-style draft heads preserve causality for quality, but drafting cost grows with tree depth. 2. Block-diffusion style heads draft cheaply in one pass, but branches are often scored independently, so deeper paths can become mutually inconsistent. JetSpec enables such speed by drafting a causality-preserving tree in one single pass. 🚀🌳 Check out our project page for demos and how we built it 👇 https://lnkd.in/gjg9rutV 📖 Paper: https://lnkd.in/gcGFWMTp 💻 Code: https://lnkd.in/gywd9vu2
-
Akshay Malik liked thisAkshay Malik liked thisPromoted to Full Professor at UC San Diego. 😊 13 years ago, I almost dropped out of my PhD midway, believing I was not cut out for research. My family and my then advisors/mentors convinced me to continue, via their faith in me and by giving me the freedom to pursue my research interests while staying engaged with their insightful feedback. It took me 7 years (!) to finish my MS + PhD at University of Wisconsin-Madison. Longer than most students around me, but I had internalized a key message in Prof. David Patterson’s famous advice deck: “”” Concentrate on graduating as fast as possible? … To a person in their 40s or 50s, 1 or 2 more years is roundoff error (27 = 29) “”” So, in 2015 I convinced my advisors to fund me for an extra year so that I can finish one last paper for my thesis. That paper almost got rejected by SIGMOD’16, with sharply polarized reviews from 6 (!) reviewers (likely a record for DB venues). But it got accepted after a revision and ended up being a big part of my job talk, spurring new links between the DB and ML/AI worlds. Looking back now as a full professor at 38, perhaps that extra year turned out to be a reasonable trade after all. 😄 I am grateful for the last decade at UCSD being an absolute blast — fantastic students, amazing colleagues/mentors/friends, a thriving and inclusive campus community — and all that in an incredibly beautiful city! I’ve been enjoying working with and helping various parts of UCSD — CSE, HDSI, SDSC, QI/CalIT2, Public Health, Extension, LGBT Resource Center, oSTEM, STARS, MAP — via my service, teaching, mentoring, and/or research, by establishing new bridges with industry/startups/OSS communities, and via outreach to the wider SD/SoCal ecosystem. Looking forward to amping up on all fronts in the coming decades! 🥂 Finally, if you are reading this as a grad student or junior faculty doubting yourself, I have this to say: hold that doubt gently, balanced just enough to perhaps propel your self-growth but without crushing your self-confidence. You have my best wishes!
-
Akshay Malik liked thisAkshay Malik liked thisWe closed our Series F today at a $13B valuation. Our inference business grew 20x in the last year. I want to explain why: The growth comes from a shift I think is permanent: companies want to own their intelligence layer. Instead of relying exclusively on closed models, teams are post-training open models for their specific use cases. Customers like Abridge, Cursor, Decagon, Harvey, HubSpot, Lovable, Notion, OpenEvidence, and Parallel are building this way. But post-training is still more of an art than a science. That’s why we’ve been working hands-on with customers to build specialized models that match or exceed closed models on the tasks they care about. We provide not just the weights, but also the training recipes and tooling so that they're in charge of the continual learning process. I think more companies, both AI-natives and enterprises, will own their intelligence layer. And I’m excited to help build that future.
-
Akshay Malik liked thisOur tech report "Dissecting Model Behavior Using Agent Trajectories" on the work behind building the coding agent harness SSA is out at https://lnkd.in/gs2a2Pc7 We discuss how we designed SSA to be minimal and still work well across different frontier model-families (Claude, GPT, Gemini, Qwen). We then analyzed over a 100k agent trajectories to see how different models, even when they are neck-and-neck in accuracy, go about solving problems differently from each other. SSA is fully open-sourced at https://lnkd.in/gvuHAM3c, is simple to use, and comes with pre-packaged configs for benchmarking 21 models from different providers on SWE-Bench-Pro, SWE-Bench-Verified and Terminal-Bench-2. Was awesome working with Gaurav Gupta and big thanks to WEI XIA, Jun (Luke) Huan and Anoop Deoras!Akshay Malik liked this🚀 Excited to share Part II of our work: “Dissecting Model Behavior Using Agent Trajectories” https://lnkd.in/guXicF4N As we studied what makes agents perform well in real environments, one thing became clear: success rates alone do not tell the full story. In Part I, we showed that a simple intent–execution gap was enough to reach state-of-the-art results on popular agentic benchmarks. For Part II, we went deeper. We conducted a large-scale study across 21 models from diverse model-provider families and collected 138K high-quality agent trajectories. 🧭 By mapping these trajectories into code state spaces and extracting transient behavioral metrics, we found that models with similar pass@1 scores can behave very differently internally. In other words: two models may solve the same task at similar rates, but take fundamentally different paths to get there. These differences are often invisible in aggregate benchmark scores, but become measurable through trajectory-level analysis. This is a joint work with amazing collaborators: Vatshank Chaturvedi Jun (Luke) Huan Anoop Deoras
-
Akshay Malik liked thisAkshay Malik liked thisTogether with the Google Kubernetes Engine (GKE) team at Google Cloud, we're announcing a major throughput and latency milestone for Ray Serve LLM. With architecture changes across the whole stack, Ray Serve is able to achieve up to 4.4x higher request throughput on prefill-heavy workloads and up to 24.8x on decode-heavy workloads vs. the pre-optimized baseline. Ray Serve LLM now matches vllm-router, a high-performance Rust-based routing framework, while keeping Ray's primitives for fault tolerance, observability, and portability across Kubernetes and VMs for distributed inference. Read more about the three optimizations we made in this blog: https://lnkd.in/gfaBesSn
-
Akshay Malik liked thisVery exciting work! Ray Serve LLM, which helps scale vLLM for distributed, disaggregated, and multi-replica inference, now offers 4.4x higher request throughput for prefill-heavy workloads compared to previous versions of Ray. Combined with its flexibility to easily support inference disaggregation, Ray Serve LLM is now a very competitive offering for large scale distributed inference. The team has implemented a ton of performance optimizations on the stack -- take a look at the blog and give it a try!Akshay Malik liked thisTogether with the Google Kubernetes Engine (GKE) team at Google Cloud, we're announcing a major throughput and latency milestone for Ray Serve LLM. With architecture changes across the whole stack, Ray Serve is able to achieve up to 4.4x higher request throughput on prefill-heavy workloads and up to 24.8x on decode-heavy workloads vs. the pre-optimized baseline. Ray Serve LLM now matches vllm-router, a high-performance Rust-based routing framework, while keeping Ray's primitives for fault tolerance, observability, and portability across Kubernetes and VMs for distributed inference. Read more about the three optimizations we made in this blog: https://lnkd.in/gfaBesSn
-
Akshay Malik liked thisAkshay Malik liked thisToday we are excited to announce, in partnership with the Google Kubernetes Engine (GKE) team at Google Cloud, a major milestone in Ray Serve LLM’s throughput and latency characteristics: Ray Serve LLM now matches high performance, rust-based routing frameworks such as vllm-router in benchmarks across a variety of workloads and deployment patterns. In our new blog, we cover three major optimizations to the Ray Serve LLM + vLLM stack that made this possible: direct streaming, a new vLLM Ray executor backend, and HAProxy integration. As a result, we see up to 4.4x higher request throughput than previous versions on prefill-heavy workloads, and up to 24x higher request throughput on decode-heavy workloads. Ray is a popular choice for complex, Python-native distributed computing batch inference pipelines with heterogeneous hardware. And now, we believe that Ray’s powerful primitives for fault tolerance, observability, flexibility across Kubernetes and VMs will enable the next generation of optimizations as LLM inference deployments become increasingly complex. Thanks to Spencer Peterson, Andrew Sy Kim, Kourosh Hakhamaneshi, Jeffrey (Yu-Che) Wang, Richard Liaw, Akshay Malik, Abrar Sheikh, and Alex Yang whose contributions to Ray Serve and Ray Serve LLM made this possible. A special thanks to the vllm-router (vLLM) and SGLang Model Gateway (SGLang) teams for great engineering on their respective projects. Read the full writeup here: https://lnkd.in/guHrz_FA
-
Akshay Malik liked thisAkshay Malik liked thisToday, we enable AutoResearch in the physical world for the first time! Introducing ENPIRE: we give 8 Codex agents a fleet of robots, an allocation of GPUs, and generous token budget. We set them free with a simple goal: solve the task as quickly as possible, keep the robots busy but stay safe, don't waste precious compute. Make no mistake. Then humans step aside and our watch begins. The robot fleet starts to come alive: they learn to look for visual clues, reset the scene, practice novel skills, tinker with control stack, read papers online, debate, reflect, get stuck, and try again directly on the hardware. All we did is giving Codex an API to the world of atoms, and the rest is emergence. ENPIRE is able to solve high-precision tasks like tying zip-ties, organizing fine pins, and installing GPUs all by itself. We also discovered a new type of "physical scaling": 8 robots exploring in parallel solves the task significantly faster than fewer ones. A part of our NVIDIA GEAR lab now self-improves tirelessly overnight. We just read the reports in the morning. /goal: we all take a holiday and Jensen wouldn't even notice ;) We will be open-sourcing everything, so you can host your self-running robot lab at home too! Project site and paper: https://lnkd.in/g3t4qS8Y
Experience & Education
-
Anyscale
*********** ****
-
******
** *********** *******
-
********** ** ******** ** ****************
*********** ** ******* **********
-
********** ** *********** ********* **** ****** ** ********
****** ** ******** ************** * *** undefined undefined
-
-
********** ** ******** ** ****************
********* ** ******* ******** ***********
-
View Akshay’s full experience
See their title, tenure and more.
Welcome back
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
New to LinkedIn? Join now
or
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
View Akshay’s full profile
-
See who you know in common
-
Get introduced
-
Contact Akshay directly
Other similar profiles
Explore more posts
-
Bhuvan Purohit
1K followers
A 280-fold drop in AI inference cost isn't just a headline - it's a line in the sand for anyone building or deploying large language models. Last quarter, our team ran LLMs on $10k/month hardware. Today? That same workload costs less than $40. No tricks - just leveraging the latest quantization and hardware-optimized runtimes. The numbers still surprise me. What changed? Not the demand for accuracy or speed. But relentless optimization - in model architecture, in serving stacks, in chip design. And now, inference is no longer the bottleneck it was even six months ago. Here’s what this means if you’re working with AI/ML: - You can ship models with billions more parameters, at a fraction of yesterday’s price. - Experimentation cycles shrink. You test, learn, and iterate faster - before competitors can even provision their clusters. - Cost is no longer the gatekeeper for deploying language AI at scale. Creativity is. One client moved from proof-of-concept to 100k+ daily users. Their infra bill? Down 97% after refactoring for modern inference. Suddenly, features we shelved as "too expensive" are back on the roadmap. But here’s the tension: Massive cost drops mean every team has access to the same raw power. It’s not about who can afford to play - it’s about who can build something people actually want at scale. Is your team rethinking what’s possible now that inference isn’t the limiter? Would love to hear how these cost shifts are changing your roadmap or business model. #AI #MachineLearning #LargeLanguageModels #StartupLessons #TechLeadership
5
-
Pranabh Kumar Thaduri
Qualcomm • 13K followers
🧠 𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝘃𝘀 𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 — 𝗪𝗵𝗮𝘁’𝘀 𝘁𝗵𝗲 𝗱𝗶𝗳𝗳𝗲𝗿𝗲𝗻𝗰𝗲? A lot of people say LLMs “reason.” But most of the time, they just 𝐢𝐧𝐟𝐞𝐫. 🔹 𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 is like pattern solving. The model says: “I’ve seen this before — here’s what comes next.” It works well when things are clear and the task repeats. 🔹 𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 goes deeper. The model connects ideas and picks a good answer. It’s needed when something new or tricky comes up. 💡 𝗪𝗵𝗲𝗻 𝘁𝗼 𝘂𝘀𝗲 𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲: • Auto-completing responses (“Reset your password…”) • Summarizing meeting notes • Sorting emails into folders 💡 𝗪𝗵𝗲𝗻 𝘁𝗼 𝘂𝘀𝗲 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴: • Figuring out why a supply chain failed • Creating a new plan from lots of feedback • Complex coding with many steps 𝗧𝗵𝗶𝗻𝗸 𝗼𝗳 𝗶𝘁 𝗹𝗶𝗸𝗲 𝘁𝗵𝗶𝘀 👇 Inference = remembering answers you’ve seen Reasoning = solving new problems As AI grows, both skills matter. Inference gives answers. Reasoning brings judgment. #ArtificialIntelligence #MachineLearning #LLM #AIReasoning #AIforStudents #TechLeadership #DeepLearning #GenAI #DataScience
3
-
Supreet Deshpande
SynthioLabs • 6K followers
🚨𝗧𝗵𝗲 𝗥𝗲𝗮𝗹 𝗦𝘁𝗼𝗿𝘆 𝗕𝗲𝗵𝗶𝗻𝗱 “𝟭𝟬𝟬% 𝗼𝗻 𝗨𝗦𝗠𝗟𝗘”🚨 This weekend, our engineering team at SynthioLabs got curious. OpenEvidence announced they were the first AI to score 100% on the USMLE — big headline So we asked ourselves: could we do the same? 50+ runs. $1000+ in compute and a few coffees later — we finally got that one run that hit 100%. ✅ But that led to the obvious question: how did OpenEvidence do it? One of our engineers said, “𝘐 𝘣𝘦𝘵 𝘵𝘩𝘦𝘺 𝘤𝘢𝘯’𝘵 𝘳𝘦𝘱𝘳𝘰𝘥𝘶𝘤𝘦 𝘵𝘩𝘦 100% 𝘵𝘩𝘦𝘮𝘴𝘦𝘭𝘷𝘦𝘴” So we put it to the test. We went to OpenEvidence, asked it one of the very same USMLE questions… ❌ And it got it wrong; and that’s the real story. In healthcare, you don’t get to claim perfection when lives are at stake. Because here’s the risk: physicians see “100% on USMLE” and start trusting AI blindly for diagnosis. That’s dangerous. 👉 (Link in comments: proof with the tested question + wrong OE output) At SynthioLabs, our thesis is clear: AI for Life Sciences isn’t about acing exams — it’s about trust. 🔒 Safety & compliance — every response grounded in credible, validated, regulatory-approved sources 🩺 Medical Info — supporting healthcare stakeholders with on-demand, evidence-based answers when it matters most ⚠️ No diagnosis — because in medicine, responsibility matters more than buzz. 100% on USMLE is a parlor trick. The real work is building AI that’s safe, credible, and usable in practice. That’s the standard we hold ourselves to. 🙏 We greatly respect OpenEvidence and the mission they’re on. At the same time, it’s important to clarify these discrepancies. Would welcome a conversation with their team.
85
14 Comments -
Everett Berry
Clay • 17K followers
GTM Engineering has arrived. As covered in the NYT, Clay is investing in GTME, the first AI-native career, with $100m of new funding at a $3.1B valuation from Google's venture arm, CapitalG. Today there are: ▪️ 400+ open GTME roles ▪️ $160K median salaries ▪️ Clay Agencies earning millions in revenue ▪️ $50M+ in projected revenue for our data partners in 2025 Right now every company is rethinking their go to market, because GTM is one of the areas where AI is working the best. To help them do this, a huge ecosystem of GTMEs, data providers, Clay agencies, and LinkedIn posters 😉 has emerged. Instead of offering prescriptive, out of the box software, Clay is succeeding by giving users power and flexibility to design unique datapoints and plays that we call "GTM Alpha". This approach has appealed to me from the start, when I became a Clay user and customer in 2022. By treating GTM like a holistic product that gets designed and built versus a set of business processes, our customers are programming their GTM for the AI era. Links! ▪️ New York Times article - https://lnkd.in/eCH3bpQK ▪️ Clay announcement - https://lnkd.in/eNp5pS6j ▪️ GTM Engineering substack - https://thegtme.com/ Looking to hire a GTME or find a GTME job? We've got you covered ^^ Special shoutout to the GTMEs at Clay who have made this happen Osman S. Manny Adelstein Andrew Malacrea Ashley Artrip Alex Lindahl Sabrina Glaser Marat T. Daniel Johnson Tong-Tong Li and so many more. And of course shoutout to Kareem Amin and Varun Anand
309
28 Comments -
David Vydra
SAP • 3K followers
As we enter the agentic era of software development, I am now tracking the ratio of PMs to traditional SWEs and to Forward Deployed Engineers. SWE roles seem to be the most impacted with some teams considering 1-1 PM/SWE ratio in the next 12 months. FDE seems to be trending up. What trends do you see?
4
2 Comments
Explore top content on LinkedIn
Find curated posts and insights for relevant topics all in one place.
View top content