🧙🏼 This model hallucinates the least

Also: YouTube's AI detection tools

Happy Monday, wizards.

I’m giving all my subscribers free access to the list I made of AI influencers. It’s based on Time’s 100 AI lists from this and last year. You can browse, search and filter 188 people, their roles and the company they’re associated with.

Enjoy!

Dario’s Picks

The most important news stories in AI this week

  1. The LLM hallucination index. Galileo, a platform for evaluating AI models, has made an index that scores popular large language models on how much they hallucinate (ie make things up). This was tested with Retrieval Augmented Generation (RAG), in which the models where given information from documents at different lengths: short (<5k tokens), medium (5-25k tokens) and long (40-100k tokens).

    Out of the 22 open and closed models tested:

    • Claude 3.5 Sonnet came out on top as the best performer for both short and long context lengths.

    • Gemini Flash 1.5 scored best on medium context length, and was highlighted for its overall high accuracy vs cost.

    • Models generally performed better on medium context length than short and long context.

    • In the same study for 2023, GPT-4 was the top performer with a score of 0.76, while this year’s top performer Claude 3.5 Sonnet scored 0.97.

  • ‎ Why it matters‎ ‎ Top-tier models hallucinate less than they did a year ago. That’s great news! While the best performing ones are starting to get way more accurate, the hallucination level that remains still renders them less than helpful in settings where accuracy is critical.

     

    Practical tip: Medium contexts generally yields least hallucinations. Giving ChatGPT or another AI more information can actually make it hallucinate less, up to a point. The “sweet spot” here was 5-25k tokens, which is equal to a typical chapter in a book. 

Continued after the ad…

This issue is brought to you by

Superfilter: Your AI-Powered Email Assistant

Superfilter isn't just another inbox tool - it's your AI executive assistant, dedicated to managing your email. Unlike traditional email clients, Superfilter learns about your email habits and actively manages your inbox. Some of the features users really like:

  • It briefs you on critical inbox activity, so you don't miss what's important.

  • It tracks unfinished tasks, follow-ups, and invitations, keeping you on top of your commitments.

  • It handles repetitive work like scheduling meetings and drafting responses, freeing up your time.

Stop drowning in emails. Let Superfilter handle your inbox, so you can focus on work that matters. Whether you're a founder, manager, or executive, Superfilter adapts to your needs. Reclaim your time and boost your productivity.

  1. YouTube’s upcoming AI detection tools. They’re soon introducing new AI detection tools to protect creators from having their face and voice copied and used in AI generated videos.

    Here’s what they are working on:

    • Expanding their content-ID system, which identifies copyrighted material, to include synthetic singing (coming in January 2025).

    • Developing content detectors for when someone’s face is simulated with AI.

    • A solution to address the use of its content to train popular AI models like those from OpenAI, Anthropic and Google – it’s become a common complaint from creators that these companies train on their content without giving any compensation.

    • For high-profile people and celebrities, they’re creating a way to detect and manage AI-generated work showing their faces on YouTube.

     

    ‎ Why it matters‎ ‎ Happy to see YouTube making strides on this – creating robust and generalizable deepfake detectors is a key research challenge these days to mitigate the downsides of AI in society. In the context of creators on YouTube, AI generated content presents a several challenges (misinformation, negative impact on branding, lost revenues, and more) and the platform’s that depend on them, especially as this content becomes more and more realistic.

From our partners

Doing the same boring work again and again is exhausting.

What if you had a personal AI assistant who could do the job for you?

  1. LMSYS is creating a leaderboard for coding assistants. The organisation behind the popular chatbot arena leaderboard that ranks chatbots based on users’ experiences, is now making an equivalent concept focused solely on coding. Copilot Arena, as it will be called, will collect user votes and rank popular LLMs like GPT-4o, Claude Sonnet 3.5, Llama 3.1 on how helpful they are at coding.

     

    They’re currently seeking beta testers for the solution, that already code regularly using AI and can commit to setting aside a couple of hours to evaluate the different models.

  • ‎ Why it matters‎ ‎ Coding might be the most useful and directly applicable use case for GenAI so far. However, evaluating coding has its own unique challenges – like how it performs across programming languages and debugging – and also requires qualified testers to evaluate the output. Given the success of the existing leaderboard for chatbots, this might become a go-to resource for developers to decide which models to work with, once its up and running.

  1. TLDR covers the most interesting tech, science and coding news in just 5 minutes. TLDR covers the best tech, startup, and coding stories in a quick email that takes 5 minutes to read.

    No politics, sports, or weather – we promise. And it's read by over 1,250,000 people. Subscribe for free(sponsored)

Was this email forwarded to you? Sign up here.

Want to get in front of 13,000 AI enthusiasts? Work with me.

This newsletter is written & curated by Dario Chincha.

Affiliate disclosure: To cover the cost of my email software and the time I spend writing this newsletter, I sometimes link to products and other newsletters. Please assume these are affiliate links. If you choose to subscribe to a newsletter or buy a product through any of my links then THANK YOU – it will make it possible for me to continue to do this.