Performance Optimization Techniques

Explore top LinkedIn content from expert professionals.

Andrew Ng Andrew Ng is an Influencer

DeepLearning.AI, AI Fund and AI Aspire

2,581,420 followers 2y
Report this post
Last week, I described four design patterns for AI agentic workflows that I believe will drive significant progress: Reflection, Tool use, Planning and Multi-agent collaboration. Instead of having an LLM generate its final output directly, an agentic workflow prompts the LLM multiple times, giving it opportunities to build step by step to higher-quality output. Here, I'd like to discuss Reflection. It's relatively quick to implement, and I've seen it lead to surprising performance gains. You may have had the experience of prompting ChatGPT/Claude/Gemini, receiving unsatisfactory output, delivering critical feedback to help the LLM improve its response, and then getting a better response. What if you automate the step of delivering critical feedback, so the model automatically criticizes its own output and improves its response? This is the crux of Reflection. Take the task of asking an LLM to write code. We can prompt it to generate the desired code directly to carry out some task X. Then, we can prompt it to reflect on its own output, perhaps as follows: Here’s code intended for task X: [previously generated code] Check the code carefully for correctness, style, and efficiency, and give constructive criticism for how to improve it. Sometimes this causes the LLM to spot problems and come up with constructive suggestions. Next, we can prompt the LLM with context including (i) the previously generated code and (ii) the constructive feedback, and ask it to use the feedback to rewrite the code. This can lead to a better response. Repeating the criticism/rewrite process might yield further improvements. This self-reflection process allows the LLM to spot gaps and improve its output on a variety of tasks including producing code, writing text, and answering questions. And we can go beyond self-reflection by giving the LLM tools that help evaluate its output; for example, running its code through a few unit tests to check whether it generates correct results on test cases or searching the web to double-check text output. Then it can reflect on any errors it found and come up with ideas for improvement. Further, we can implement Reflection using a multi-agent framework. I've found it convenient to create two agents, one prompted to generate good outputs and the other prompted to give constructive criticism of the first agent's output. The resulting discussion between the two agents leads to improved responses. Reflection is a relatively basic type of agentic workflow, but I've been delighted by how much it improved my applications’ results. If you’re interested in learning more about reflection, I recommend: - Self-Refine: Iterative Refinement with Self-Feedback, by Madaan et al. (2023) - Reflexion: Language Agents with Verbal Reinforcement Learning, by Shinn et al. (2023) - CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing, by Gou et al. (2024) [Original text: https://lnkd.in/g4bTuWtU ]

One Agent For Many Worlds, Cross-Species Cell Embeddings, and more deeplearning.ai

144 Comments
Like Comment
Sebastian Raschka, PhD Sebastian Raschka, PhD is an Influencer

ML/AI research engineer. Author of Build a Large Language Model From Scratch (amzn.to/4fqvn0D) and Ahead of AI (magazine.sebastianraschka.com), on how LLMs work and the latest developments in the field.

255,038 followers 1y
Report this post
I just published a new tutorial article explaining how KV caching works in LLMs, both conceptually and in code, with a clean, from-scratch implementation. It's one of the key techniques for efficient LLM inference. While recovering from an injury and taking a break from more research-heavier writing in the last few weeks, I wanted to share this practical guide on a topic many readers asked about (and one I deliberately left out of the Build a Large Language Model From Scratch book due to its added complexity). In this tutorial, I walk through: 1. Why LLMs recompute attention weights inefficiently during generation 2. How a KV cache avoids that by storing key/value vectors for reuse 3. A side-by-side walkthrough of inference with and without caching 4. Step-by-step code changes to implement caching in a readable way 5. Performance comparison and key optimizations (like preallocation and sliding windows) Even with a tiny 124M parameter model, enabling KV caching led to a substantial speed-up in generation. 🔗 Full tutorial: https://lnkd.in/g-vYFVTa Happy reading, and as always, feel free to share feedback or questions!

Understanding and Coding the KV Cache in LLMs from Scratch magazine.sebastianraschka.com

83 Comments
Like Comment
Ethan Evans Ethan Evans is an Influencer

Former Amazon VP, sharing how I succeeded so that you can too. Outperform, out-compete, and still get time off for yourself.

175,054 followers 2y
Report this post
I've recently suffered a major career setback. Since I teach about high performance and career growth, I want to share how I am addressing it. One day you will need this recipe yourself! My goal in my current "career" is to reach as many people as I can, and to help them achieve career success and satisfaction. For the last three years, the way to do this has been through LinkedIn. Unfortunately, LinkedIn recently made some unknown changes to their algorithm. Other Top Voices and I have noticed a drop of 70% to 80% in the reach of our posts. Since my goal is to share my knowledge with more people, that means my goal just took an 80% hit. In general, setbacks in performance are either due to: A) Something we did Or B) Something external, outside our direct control Mistakes, poor decisions, and missed deadlines are examples of A. They are in our control. Things like Covid, high interest rates, and reorganizations at work are examples of B, outside our control. LinkedIn's change is also case B, outside my control. When a setback comes from something in your control, you know clearly what you did wrong and what you need to change to restore your performance and progress. Fixing your own issues may take time and be difficult, but you know what to do. When the setback is due to something outside your control, you do not know how to fix the issue. So, how can we react when our performance is shattered and we do not know why? Here is my recipe: 1. Allow yourself a fixed amount of time to grieve (and complain if you wish). Emotions are real, and before you can move on you will need to sit with those emotions. But, do not get stuck in them. Curse your bad luck, pout for a minute, etc. Then, move to the next step. 2. Refocus on your core value. Whatever happened, go back to how you define high performance to ensure it is still relevant. I admit, I slipped into defining my own performance by how many people viewed my LinkedIn posts. This was a mistake. My mission is to help others, so getting views is a proxy, not a result. And, using LinkedIn is just a method for the mission, not the mission itself. 3. Adapt your core value if you must (if its value has decreased). In my case, the value of what I offer hasn't changed, the external delivery system has. 4. Once you adapt and/or increase your value, find new ways to deliver it if necessary. Luckily, I have other options for reaching people: my Substack newsletter, YouTube, etc. Since Substack has been such a good partner recently, I will start there. I have also refocused how I write on LinkedIn to make every post focused on my goal. 5. Test, measure, adapt, repeat! Really, this step is everything. Once you get past the grief, jump into action in this loop. Nothing can stop you if you keep working to refine, deliver, and showcase your core value. Comments? Here's my newsletter, which is my next area of investment: https://lnkd.in/gXh2pdK2

53 Comments
Like Comment
Nishant Kumar

Data Engineer @ IBM | Data & AI | Python | SQL | PySpark | Apache Spark | Apache Kafka | AWS | Delta Lake | Airflow | Amazon Bedrock | LangChain | GenAI | RAG

118,937 followers 1y
Report this post
This PySpark job was running for 2 hours. I brought it down to 15 mins And no, I didn’t just throw more clusters at it Here’s what really made the difference Context: We had a pipeline processing millions of rows — complex joins, multiple transformations, and writing to S3 Every day, it was eating up ~2 hours, and slowing down downstream processes What I did: 𝐒𝐭𝐞𝐩 1: Avoided shuffles wherever possible → Rewrote wide transformations like groupBy and join using efficient partitioning strategies 𝐒𝐭𝐞𝐩 2: Broadcast Joins → Replaced regular joins with broadcast joins for smaller dimension tables. Saved huge shuffle time 𝐒𝐭𝐞𝐩 3: Used .select() smartly → Trimmed down the DataFrame early. No need to carry unused columns throughout 𝐒𝐭𝐞𝐩 4: Cached intermediate DataFrames → Especially after expensive operations used multiple times 𝐒𝐭𝐞𝐩 5: Repartitioned before write → Controlled file sizes for optimized parallel writes to S3 Result? - From 2 hours → 15 minutes - Same data, same cluster, smarter code - That’s the power of PySpark when used right Have you faced performance issues in Spark jobs too? Drop a “Yes” and I’ll share my performance tuning checklist 💡 𝐏𝐫𝐞𝐩𝐚𝐫𝐞 𝐟𝐨𝐫 𝐈𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰: https://lnkd.in/gUEVYCGy 𝐉𝐨𝐢𝐧 𝐦𝐞: https://lnkd.in/giE3e9yH #DataEngineering #PySpark #PerformanceTuning #AWS
No more previous content

No more next content
80 Comments
Like Comment
Brij Kishore Pandey Brij Kishore Pandey is an Influencer

AI Architect & AI Engineer | Building Agentic Systems & Scalable AI Solutions

735,832 followers 1y
Report this post
A sluggish API isn't just a technical hiccup – it's the difference between retaining and losing users to competitors. Let me share some battle-tested strategies that have helped many achieve 10x performance improvements: 1. 𝗜𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝘁 𝗖𝗮𝗰𝗵𝗶𝗻𝗴 𝗦𝘁𝗿𝗮𝘁𝗲𝗴𝘆 Not just any caching – but strategic implementation. Think Redis or Memcached for frequently accessed data. The key is identifying what to cache and for how long. We've seen response times drop from seconds to milliseconds by implementing smart cache invalidation patterns and cache-aside strategies. 2. 𝗦𝗺𝗮𝗿𝘁 𝗣𝗮𝗴𝗶𝗻𝗮𝘁𝗶𝗼𝗻 𝗜𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻 Large datasets need careful handling. Whether you're using cursor-based or offset pagination, the secret lies in optimizing page sizes and implementing infinite scroll efficiently. Pro tip: Always include total count and metadata in your pagination response for better frontend handling. 3. 𝗝𝗦𝗢𝗡 𝗦𝗲𝗿𝗶𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻 This is often overlooked, but crucial. Using efficient serializers (like MessagePack or Protocol Buffers as alternatives), removing unnecessary fields, and implementing partial response patterns can significantly reduce payload size. I've seen API response sizes shrink by 60% through careful serialization optimization. 4. 𝗧𝗵𝗲 𝗡+𝟭 𝗤𝘂𝗲𝗿𝘆 𝗞𝗶𝗹𝗹𝗲𝗿 This is the silent performance killer in many APIs. Using eager loading, implementing GraphQL for flexible data fetching, or utilizing batch loading techniques (like DataLoader pattern) can transform your API's database interaction patterns. 5. 𝗖𝗼𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝗼𝗻 𝗧𝗲𝗰𝗵𝗻𝗶𝗾𝘂𝗲𝘀 GZIP or Brotli compression isn't just about smaller payloads – it's about finding the right balance between CPU usage and transfer size. Modern compression algorithms can reduce payload size by up to 70% with minimal CPU overhead. 6. 𝗖𝗼𝗻𝗻𝗲𝗰𝘁𝗶𝗼𝗻 𝗣𝗼𝗼𝗹 A well-configured connection pool is your API's best friend. Whether it's database connections or HTTP clients, maintaining an optimal pool size based on your infrastructure capabilities can prevent connection bottlenecks and reduce latency spikes. 7. 𝗜𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝘁 𝗟𝗼𝗮𝗱 𝗗𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗶𝗼𝗻 Beyond simple round-robin – implement adaptive load balancing that considers server health, current load, and geographical proximity. Tools like Kubernetes horizontal pod autoscaling can help automatically adjust resources based on real-time demand. In my experience, implementing these techniques reduces average response times from 800ms to under 100ms and helps handle 10x more traffic with the same infrastructure. Which of these techniques made the most significant impact on your API optimization journey?
No more previous content

No more next content
53 Comments
Like Comment
Abdulrahman Albusis

Executive Manager | Expert in Concrete Repair, Structural Strengthening, Coring & Injection | Driving Excellence at Petra Scan | Qualified from Dubai Municipality.

2,089 followers 1y
Report this post
🔧 Slab Strengthening Using Thickening (Top-Up & Drop Panels) In structural retrofitting, slab thickening—whether by top-up slabs or drop panels—is a highly effective method to enhance load capacity. I’ve led multiple projects where this technique played a key role in restoring and upgrading structural performance. ✅ Critical success factors in implementation: 📐 Design compliance: Ensure the additional concrete layer is engineered for the required structural behavior—especially in shear and flexure. 🧱 Drilling & Anchoring: Follow exact drilling depths and spacing per design, and use the correct epoxy or mechanical anchors to ensure a reliable connection between the old and new slab. 🧽 Surface Preparation: The substrate must be roughened, cleaned of dust and laitance, and properly primed with bonding agents to guarantee monolithic behavior. 🚧 Casting & Pumping Techniques: Form-and-Pour: Ideal when you have clear access to the casting area. After setting up formwork under the slab, concrete is poured by gravity from above. This is straightforward but requires enough workspace and headroom. Form-and-Pump: Used when access is limited or when pouring from the top isn’t possible. Concrete is pumped into the formwork under pressure—especially suitable for drop panels or soffit strengthening. Requires skilled coordination to avoid segregation or voids. 🛠️ As a Project Manager, I’ve successfully delivered a wide range of strengthening projects using both techniques. The difference always comes down to detailed planning, execution discipline, and clear understanding of site constraints. #SlabStrengthening #ConcreteRepair #StructuralRehabilitation #FormAndPump #FormAndPour #ProjectManagement #EngineeringExecution #RetrofitSolutions
No more previous content

No more next content
55 Comments
Like Comment
Rahul Agarwal

Staff ML Engineer | Meta, Roku, Walmart | 1:1 @ topmate.io/MLwhiz

46,118 followers 1y
Report this post
Few Lessons from Deploying and Using LLMs in Production Deploying LLMs can feel like hiring a hyperactive genius intern—they dazzle users while potentially draining your API budget. Here are some insights I’ve gathered: 1. “Cheap” is a Lie You Tell Yourself: Cloud costs per call may seem low, but the overall expense of an LLM-based system can skyrocket. Fixes: - Cache repetitive queries: Users ask the same thing at least 100x/day - Gatekeep: Use cheap classifiers (BERT) to filter “easy” requests. Let LLMs handle only the complex 10% and your current systems handle the remaining 90%. - Quantize your models: Shrink LLMs to run on cheaper hardware without massive accuracy drops - Asynchronously build your caches — Pre-generate common responses before they’re requested or gracefully fail the first time a query comes and cache for the next time. 2. Guard Against Model Hallucinations: Sometimes, models express answers with such confidence that distinguishing fact from fiction becomes challenging, even for human reviewers. Fixes: - Use RAG - Just a fancy way of saying to provide your model the knowledge it requires in the prompt itself by querying some database based on semantic matches with the query. - Guardrails: Validate outputs using regex or cross-encoders to establish a clear decision boundary between the query and the LLM’s response. 3. The best LLM is often a discriminative model: You don’t always need a full LLM. Consider knowledge distillation: use a large LLM to label your data and then train a smaller, discriminative model that performs similarly at a much lower cost. 4. It's not about the model, it is about the data on which it is trained: A smaller LLM might struggle with specialized domain data—that’s normal. Fine-tune your model on your specific data set by starting with parameter-efficient methods (like LoRA or Adapters) and using synthetic data generation to bootstrap training. 5. Prompts are the new Features: Prompts are the new features in your system. Version them, run A/B tests, and continuously refine using online experiments. Consider bandit algorithms to automatically promote the best-performing variants. What do you think? Have I missed anything? I’d love to hear your “I survived LLM prod” stories in the comments!

46 Comments
Like Comment
Sahar Mor

I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

42,514 followers 2y
Report this post
In the last three months alone, over ten papers outlining novel prompting techniques were published, boosting LLMs’ performance by a substantial margin. Two weeks ago, a groundbreaking paper from Microsoft demonstrated how a well-prompted GPT-4 outperforms Google’s Med-PaLM 2, a specialized medical model, solely through sophisticated prompting techniques. Yet, while our X and LinkedIn feeds buzz with ‘secret prompting tips’, a definitive, research-backed guide aggregating these advanced prompting strategies is hard to come by. This gap prevents LLM developers and everyday users from harnessing these novel frameworks to enhance performance and achieve more accurate results. https://lnkd.in/g7_6eP6y In this AI Tidbits Deep Dive, I outline six of the best and recent prompting methods: (1) EmotionPrompt - inspired by human psychology, this method utilizes emotional stimuli in prompts to gain performance enhancements (2) Optimization by PROmpting (OPRO) - a DeepMind innovation that refines prompts automatically, surpassing human-crafted ones. This paper discovered the “Take a deep breath” instruction that improved LLMs’ performance by 9%. (3) Chain-of-Verification (CoVe) - Meta's novel four-step prompting process that drastically reduces hallucinations and improves factual accuracy (4) System 2 Attention (S2A) - also from Meta, a prompting method that filters out irrelevant details prior to querying the LLM (5) Step-Back Prompting - encouraging LLMs to abstract queries for enhanced reasoning (6) Rephrase and Respond (RaR) - UCLA's method that lets LLMs rephrase queries for better comprehension and response accuracy Understanding the spectrum of available prompting strategies and how to apply them in your app can mean the difference between a production-ready app and a nascent project with untapped potential. Full blog post https://lnkd.in/g7_6eP6y
No more previous content

No more next content
31 Comments
Like Comment
Nikki Siapno

Eng Manager | ex-Canva | 450k+ audience | Helping you become a great engineer and leader

231,278 followers 1y
Report this post
10 Must-know best practices for optimizing API endpoints: Optimizing API endpoints is critical for achieving optimal performance in robust, scalable, and user-friendly applications. By following best practices, we can significantly enhance performance, strengthen security, and improve user and developer experience of APIs. Let's look at 10 core best practices for optimizing API endpoints: 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗯𝗲𝘀𝘁 𝗽𝗿𝗮𝗰𝘁𝗶𝗰𝗲𝘀: 🔹 Optimize SQL queries Ensure your queries are performant. Use query execution plans to identify bottlenecks. Optimize and implement caching for frequent queries to minimize database load. 🔹 Cache Store frequently requested data at the client, server, or CDN level using caching headers or tools like Redis. This reduces response time and lightens backend load. Be mindful of stale data and implement cache invalidation strategies. 🔹 Payload optimization Compress large responses with Gzip, remove unnecessary fields from payloads, and use efficient formats like JSON for faster data transmission. Keep payloads lightweight, but don’t compromise on essential details for the client. 🔹 Pagination Break large datasets into smaller chunks with tools like limit and offset parameters. This improves performance and avoids crashing clients with oversized responses. Combine with cursors for better consistency in real-time data. 🔹 Asynchronous processing For time-intensive operations like file uploads or report generation, use background jobs with tools like RabbitMQ or Celery to keep APIs responsive. Return task IDs so clients can check the operation's status. 𝗦𝗲𝗰𝘂𝗿𝗶𝘁𝘆 𝗯𝗲𝘀𝘁 𝗽𝗿𝗮𝗰𝘁𝗶𝗰𝗲𝘀: 🔹 Rate limiting and throttling Set limits on requests per user or client to prevent abuse, avoid server overload, and ensure consistent performance during traffic spikes. Customize thresholds based on endpoint sensitivity. 🔹 Input validation and sanitization Validate and sanitize all user inputs to protect against injection attacks (e.g., SQL injection, XSS) and ensure data integrity. 🔹 Monitoring and logging Track API metrics like response times, error rates, and usage patterns using tools like Datadog or New Relic. Comprehensive logs simplify debugging and help predict scaling needs. Regularly review logs to identify trends or anomalies. This is also important to identify performance bottlenecks. 🔹 Authentication and authorization Implement robust mechanisms like OAuth2, API keys, or JWT to ensure secure access and restrict resource usage to authorized users. 🔹 Encrypting data in transit Use HTTPS to secure data exchanges between clients and servers, ensuring sensitive information remains protected from interception. 💬 What’s your favorite API optimization tip? 💭 ~~ P.S. If you like this post, then you'll love our newsletter. Subscribe here: https://lnkd.in/giQj3Z44
No more previous content

No more next content
56 Comments
Like Comment

Performance Optimization Techniques

More in Performance Optimization Techniques

More Productivity topics

Explore categories