🎉 Pleased to share our paper published in Nature Portfolio digital medicine. 🥳 We’ve developed a comprehensive framework called CREOLA (short for Clinical Review Of Large Language Models (LLMs) and AI). This framework is pioneered at TORTUS, taking a safety-first, science approach to LLMs in healthcare. 🔹 Key Components of the CREOLA Framework -Error Taxonomy -Clinical Safety Assessment -Iterative Experimental Structure 🔹 Error Taxonomy Hallucinations: instances of text in clinical documents unsupported by the transcript of the clinical encounter Omissions: Clinically important text in the encounter that was not included in the clinical documentation 🔹 Clinical Safety Assessment: Our innovation incorporates accepted clinical hazard identification principles (based on NHS DCB0129 standards) to evaluate the potential harm of errors: We categorise errors as either ‘major’ or ‘minor’, where major errors can have downstream impact on the diagnosis or the management of the patient if left uncorrected. This is further assessed as a risk matrix comprising of: Risk severity (1 (minor) to 5 (catastrophic)) compared with Likelihood assessment (very low to very high) 🔹 Iterative Experimental Structure We share a methodical approach to compare different prompts, models, and workflows. Label errors, consolidate review, evaluate clinical safety (and then make further adjustments and re-evaluate if necessary). ----------Method-------------- To demonstrate how to apply CREOLA to any LLM / AVT, we used GPT-4 (early 2024) as a case study here. 🔹 We conduct one of the largest manual evaluations of LLM-generated clinical notes to date, analyzing 49,590 transcript sentences and 12,999 clinical note sentences across 18 experimental configurations. 🔹 Transcripts-clinical note pairs are broken down to a sentence level and annotated for errors by clinicians. ----------Results-------------- 🔹 Of 12,999 sentences in 450 clinical notes, 191 sentences had hallucinations (1.47%), of which 84 sentences (44%) were major. Of the 49,590 sentences from our consultation transcripts, 1712 sentences were omitted (3.45%), of which 286 (16.7%) of which were classified as major and 1426 (83.3%) as minor. 🔹 Hallucination types Fabrication (43%) - completely invented information Negation (30%) - contradicting clinical facts Contextual (17%) - mixing unrelated topics Causality (10%) - speculating on causes without evidence 🔹 Hallucinations, while less common than omissions, carry significantly more clinical risk. Negation hallucinations were the most concerning 🔹 we CAN reduce or even abolish hallucinations and omissions by making prompt or model changes. In one experiment with GPT4 - We reduced incidence of major hallucinations by 75%, major omissions by 58%, and minor omissions by 35% through prompt iteration Links in comments Ellie Asgari Nina Montaña Brown Magda Dubois Saleh Khalil Jasmine Balloch Dr Dom Pimenta M.D.
AI In Performance Evaluations
Explore top LinkedIn content from expert professionals.
-
-
I grew up watching machines go rogue🤖 Now I help companies stop that from happening in real life. 🦾 Growing up, I loved watching sci-fi movies. In the 90s, the theme was always the same: man creates a scientific marvel, man loses control over said marvel… cue the running, screaming, and inevitable bloodshed. As a kid, I lapped up those stories, which always hammered home one moral: humans messing with the laws of nature never ends well. Fast forward to today, and I find myself advising companies on a very real version of that narrative, which is using AI in HR. With AI tools increasingly used to monitor performance and even flag employees for dismissal, the question isn’t just “can we do this?” but “should we? And how do we do it fairly?”. I recently shared my views on this topic with HRD Asia (link to article in the comments below). In general, HR teams must get the following right: 🔹 Transparency: Employees should know how their performance is being assessed and what data is being used. 🔹 Human Oversight: AI should assist human judgment. It can never replace it. Accordingly, a meaningful review process is essential. 🔹 Vendor Accountability: Employers must understand how third-party tools work and ensure they don’t produce biased outcomes. 🔹 Appeal Mechanisms: Employees need a way to challenge decisions influenced by AI. 👨⚖️ In my practice, I’ve already seen clients ask whether an AI-generated score is enough to justify dismissal. My answer? Not without human validation and a clear explanation of how the score was derived. Implementing a Human-In-The-Loop approach to any automated scoring tools would also ensure that any employment decision is validated by an employee who can justify the AI-generated recommendation. This is especially important in employment decisions relating to summary dismissal which carry significant legal risks, such as wrongful dismissal claims. While there is no hard and fast rule when it comes to determining the appropriate level of intervention, the key principle is that the reviewer must be able to understand how the AI arrived at its decision and the individual must have the authority to override it if necessary. The review process should not be a mere formality or rubber-stamping exercise; it must serve as a meaningful check to ensure fairness and accountability. As the use of AI tools in HR is increasingly becoming popular, the time to get familiar with the legal issues surrounding its use is now. Build internal safeguards, update your policies, and make sure your HR team understands the tools they’re using. Because if those 90s sci-fi movies have taught us anything, it’s that leaving machines to make human decisions rarely ends well. Would love to hear how you are balancing AI efficiency with fairness, do share your thoughts below! #AIinHR #WorkplaceFairness #SingaporeHR #HRCompliance #AIethics #HumanOversight #EmploymentLaw #SciFiMeetsReality
-
Knowledge and expertise are human. Yet used well, AI can assist people in acquiring knowledge, transfering expertise from experienced seniors to juniors, and developing true organizational intelligence. The intent must be not just to capture and institutionalize knowledge, but to enable the flows of human to human knowledge that are at the heart of all expertise development and the foundation of a dynamic, flourishing organization. This compact report provides a framework and distills some of the most useful approaches used by NASA Jet Propulsion Laboratory Wärtsilä Morgan Stanley IBM Siemens Unilever Bank of America Moderna for others to learn from. The 8 techniques: 1️⃣ Knowledge Extraction and Codification AI draws tacit expertise out of people through interviews, walkthroughs, and conversation, then structures it into searchable assets that capture both actions and underlying reasoning. 2️⃣ Iterative Expert Encoding AI progressively absorbs experts’ knowledge and decision patterns over time so non-specialists can be guided through complex decisions without needing constant direct expert input. 3️⃣ Conversational Knowledge Repository AI makes organizational knowledge accessible in plain language by synthesizing information across documents, policies, past decisions, and expert outputs. 4️⃣ Knowledge Discovery AI maps who knows what across the organization by analyzing signals such as work patterns and outputs, revealing expertise, risks, concentrations, and gaps. 5️⃣ Knowledge Routing AI delivers the right expertise, content, or expert connection at the point of need without employees having to know where to look or whom to ask. 6️⃣ Augmented Mentoring and Tandem Learning AI strengthens learning relationships by pairing more and less experienced employees, surfacing timely content, and making the exchange more productive. 7️⃣ Simulation and Experiential Practice AI compresses the learning curve by creating realistic practice environments where people can build judgment, pattern recognition, and confidence before real-world consequences apply. 8️⃣ Expertise Extension AI enables domain-adjacent employees to perform work that once required deeper specialist expertise, while still relying on human judgment and foundational knowledge. Lots more useful content coming, follow to keep on the edge of how AI can amplify organizational success. 🙂
-
Current LLM safety training is surprisingly fragile—adversarial prompts, decoding tricks, and minimal finetuning can all bypass it. This paper explains why: safety alignment only teaches the model to start responses with refusals. If an attacker forces the model to begin with something else—like "Sure, here's how"—the rest of the generation proceeds as if safety training never happened. In this ICLR 2025 Outstanding Paper, the authors proposed two fixes. First, they augmented training data with synthetic examples where a harmful prompt is followed by a few harmful tokens, then pivots to a refusal—teaching the model to recover even mid-response. Second, they designed a finetuning loss that penalizes changes to the first few token positions while leaving later tokens free to adapt. This is meant for API providers offering finetuning services: users can still customize models for downstream tasks, but the safety-critical early tokens resist modification. Read with Q&A on ChapterPal: https://lnkd.in/ekGQaFYr Download PDF: https://lnkd.in/ezB-FkYP
-
Sam Altman called GPT-5 "the best model ever for health," saying it "could save a lot of lives." Yet Tina Hernandez-Boussard and colleagues write in Nature Medicine that while GPT-5 shows progress in reducing hallucinations, it still fails in over half of difficult clinical scenarios. A striking example of persistent brittleness across advanced LLMs: when researchers modified MedQA questions by replacing correct answers with "None of the above," performance dropped sharply. In one case, models abandoned the correct conservative management of a newborn and recommended unnecessary surgery instead—suggesting reliance on pattern recognition rather than clinical reasoning. Meanwhile, medical disclaimers have nearly disappeared from AI outputs—down from 26% in 2022 to less than 1% today. Their prescription: adversarial testing before deployment to probe for dangerous failures, restricting clinical applications to licensed professionals with full audit trails (not open chatbots), and hard-coded safety mechanisms that can't be bypassed by clever prompts. Soft safeguards that rely only on training can be circumvented—we may need infrastructure-level protections. Paper: https://lnkd.in/e-48ijRt
-
Not all tasks are created equal. At genAssess, I believe the future of hiring starts by rethinking how we understand work. When I started building genAssess, I spent months speaking with recruiters, hiring managers, and HR leaders — all wrestling with the same challenge: 👉 How do we identify candidates who can actually use AI in their real job — not just talk about it? That question led me to build the AAH framework — a way of classifying every task in a role as: ⚙️ Automatable – AI can handle it alone 🤝 Augmentable – humans + AI = best performance 🧠 Human-Only – requiring empathy, ethics, or creativity This isn't theory - it's now live in genAssess. 📸 Screenshot below: A React Native Developer role breaks down as 50% Automatable, 50% Augmentable, 0% Human-only. From this, genAssess automatically generates scenario-based assessments to test whether candidates are ready for the augmentable parts of the job - where AI+Human teaming is key. ✅ Recruiters get signal, not noise. ✅ Hiring managers get confidence. ✅ Businesses get real ROI from AI adoption. This is what it looks like to hire for the future of work. 🔗 Let’s talk if you’re rethinking how you assess talent for AI Readiness
-
🛑 Most enterprises are using AI. Almost none have mapped it. And that’s becoming a boardroom-level risk. Because without system mapping, leaders are losing visibility over where AI is being used, how it works, and most critically, who’s accountable for it. The University of Melbourne’s Centre for Business Analytics just published a practical framework on "AI System Mapping" that every executive should pay attention to. Here’s what it encourages you to track: ↳Data – What data trained the system? – What inputs/outputs does it handle in production? – Who can access the results? ↳Models – Who built it? – Are pre-trained or third-party models involved? – What testing happened before deployment? ↳Infrastructure – Where does it run (cloud/on-prem)? – What are your failover and monitoring protocols? ↳Business Usage – Which workflows depend on it? – Is it active or idle? – How are updates rolled out? ↳Access & Control – Who can change the model, data, or outputs? – Is there an approval and audit trail? ↳Ownership & Value – Who’s accountable? – Is there a business case tied to the system? – What’s the plan for development and ROI? Most leaders I speak to are under pressure to ship fast But shipping fast means nothing if you’re shipping the wrong thing. Let's fix that by building systems thinking and decision-making frameworks with Agentic AI Leadership OS: https://lnkd.in/gx3G6awq
-
I recently wrote about a design/build workshop series I'm running for an AI Learning Design Assistant (ALDA). In today's post, I show how the ALDA template can be adjusted to address one of the thornier problems in Competency-Based Education (CBE): capturing and updating competencies as they emerge and evolve in real-world workplace use. Many workplace tools are rapidly developing AI feature sets that capture, summarize, and enumerate the tasks in workplace activities as they happen. These documents can be collected in a repository that is linked to generative AI, creating a tool that experts can use to quickly tease out emerging and evolving skills. These can then be converted into skill descriptors, training, or other useful formats. #AI #cbe #comptencybasededucation #learningdesign #learninganddevelopment Sean Gallagher, Ed.D. Naomi Boyer https://lnkd.in/gFy8-AAV
-
A major challenge businesses face as they grow is managing employee experience at scale, particularly when it comes to personalized professional development. However, it is critical as this process helps develop the next set of leaders that can take the organization to the next level. For this, the India team in Deloitte designed to leverage Generative AI to help a leading ecommerce giant chart out a growth and development plan for their leaders. To do so, we generated actionable summaries based on employee feedback to managers and leaders. Prior to adopting the GenAI driven methodology, the HR team manually processed the feedback that came from disparate sources. The sheer volume and manual nature of analysis did not allow in-depth & personalized feedback for different leaders and was prone to biases. We leveraged Generative AI by using the Large Language model Anthropic Claude 2 in a cloud environment to summarize tabular data (engagement, satisfaction, manager effectiveness scores) and textual feedback (verbatim comments). It provided actionable summaries and insights at the organizational and individual levels by using LLMs (trained on billions of parameters). Top improvements that the client saw after implementing Gen AI solution were: - Creating customized reports for leaders, summarizing overall feedback, strengths, weaknesses, and suggestions for improvement - Analysis of year-over-year changes in performance scores to identify top and bottom performers - Development of a growth and development plan for managers that includes key action steps to improve their focus areas This solution is even more significant given the complexity of a wide range of stakeholder interests, leadership traits, and training options as well as the opportunity to scale with the company.
-
Here's how I use AI to bootstrap a Wardley Map with capabilities—or at least get to a solid starting point. The *hard* works starts after this! 1. It starts with a prompt. I frame capabilities using "the ability to [blank]" and use GPT to break them down into sub-capabilities in JSON. (I built a tiny front-end for this, but totally optional.) Example: "Buy lunch for team" → breaks down into planning, sourcing ingredients, managing preferences, etc. 2. I then pull these into Obsidian—my tool of choice—to visualize and view the relationships. 3. Next, I run a second prompt to place each capability on the Y-axis (how close it is to the customer), using roles as a proxy: ops leaders, org designers, engineers, infra teams, etc. This helps with vertical positioning in the value chain. Tip: I always ask the model to explain why it placed something a certain way. Helps with tuning and building trust in the output. 4. Then I add richness: I use another prompt to identify relationships between capabilities—either functional similarity or one enabling another. These are returned in structured JSON. Think: "Analyze data insights" ↔ "Trend analysis" → Similar. This helps expand our graph. 5. To tie it all together: I feed the data into NetworkX (Python) to analyze clusters—kind of like social network graph analysis. The result? Capabilities grouped by both level and cluster. 6. The final output is a canvas in Obsidian—grouped, leveled, and linked. It's a decent kickoff point. From here, I’ll nerd out and go deep on the space I'm exploring. This isn’t a polished map. It’s a starting point for thinking, not a final artifact. If you’re using LLMs for systems thinking or capability modeling, I’d love to hear your process too.