• millie@beehaw.org
    link
    fedilink
    English
    arrow-up
    5
    ·
    19 days ago

    Given the responses in this thread, it seems that the same bias exists even in ostensibly leftist spaces. Yikes.

    Y’all need to get out more.

  • sparky@lemmy.federate.cc@lemmy.federate.cc
    link
    fedilink
    arrow-up
    3
    ·
    edit-2
    21 days ago

    This kind of seems like a non-article to me. LLMs are trained on the corpus of written text that exists out in the world, which are overwhelmingly standard English. American dialects effectively only exist while spoken, be it a regional or city dialect, the black or chicano dialect, etc. So how would LLMs learn them? Seems like not a bias by AI models themselves, rather a reflection of the source material.

    • lily33@lemm.ee
      link
      fedilink
      arrow-up
      4
      ·
      edit-2
      21 days ago

      It’s not an article about LLMs not using dialects. In fact, they have learned said dialects and will use them if asked.

      What they did was, ask the LLM to suggest adjectives associated with sentences - and it would associate more aggressive or negative adjectives with African dialect.

      Seems like not a bias by AI models themselves, rather a reflection of the source material.

      All (racial) bias in AI models is actually a reflection of the training data, not of the modelling.

    • Toribor@corndog.social
      link
      fedilink
      English
      arrow-up
      1
      ·
      18 days ago

      I’m from the Midwest US and I know there are words and sounds I pronounce with a Midwestern accent but I can still type and spell them correctly.

      If’n I typ lik dis den o’course people gonna think I hev the big dumb or that I’m a mole from a Redwall book.

    • BlackEco@lemmy.blackeco.comOP
      link
      fedilink
      arrow-up
      0
      ·
      edit-2
      21 days ago

      Seems like not a bias by Al models themselves, rather a reflection of the source material.

      That’s what is usually meant by AI bias: a bias in the material used to train the model that reflects in its behavior

      • Lucy :3@feddit.org
        link
        fedilink
        arrow-up
        1
        ·
        21 days ago

        But why is it even mentioned then? It’s removedING OBVIOUS. It’s like saying “AIs are biased towards english and neglect latin” or smth ffs

        • BlackEco@lemmy.blackeco.comOP
          link
          fedilink
          arrow-up
          2
          ·
          21 days ago

          I feel like not everyone is conscious of these biases and we need to raise the awareness and try preventing for example HR people from buying AI-based screening software that has a strong bias that is not disclosed by their vendors (because why would you advertise that?)

          • NaN@lemmy.sdf.org
            link
            fedilink
            English
            arrow-up
            2
            ·
            21 days ago

            I was confused how a resume or application would be largely affected, but the article points out that software is often used to look over social media now as part of hiring (which is awful).

            The bias when it determined guilt or considered consequences for a crime is concerning as more law enforcement agencies integrate black box algorithms into investigative work.

        • Gaywallet (they/it)@beehaw.org
          link
          fedilink
          arrow-up
          2
          ·
          20 days ago

          It’s removedING OBVIOUS

          What is obvious to you is not always obvious to others. There are already countless examples of AI being used to do things like sort through applicants for jobs, who gets audited for child protective services, and who can get a visa for a country.

          But it’s also more insidious than that, because the far reaching implications of this bias often cannot be predicted. For example, excluding all gender data from training ended up making sexism worse in this real world example of financial lending assisted by AI and the same was true for apple’s credit card and we even have full-blown articles showing how the removal of data can actually reinforce bias indicating that it’s not just what material is used to train the model but what data is not used or explicitly removed.

          This is so much more complicated than “this is obvious” and there’s a lot of signs pointing towards the need for regulation around AI and ML models being used in places it really matters, such as decision making, until we understand it a lot better.

        • n2burns@lemmy.ca
          link
          fedilink
          arrow-up
          1
          ·
          21 days ago

          Great comparison, a dialect used by millions of people to a dead language. It really shows how much you care about the people who speak that dialect…

          • Lucy :3@feddit.org
            link
            fedilink
            arrow-up
            0
            ·
            21 days ago

            AIs are trained on what is written in the Internet. Latin is not spoken, it’s written. But even then, it’s rarely used. African american is a dialect, it’s only present in speech.

            • MostlyBlindGamer@rblind.com
              link
              fedilink
              English
              arrow-up
              2
              ·
              20 days ago

              You need to get out more. I totally get that you would think that’s the case, but only if you’re not exploring parts of the internet outside your bubble. It’s absolutely written.

            • curbstickle@lemmy.dbzer0.com
              link
              fedilink
              arrow-up
              1
              ·
              20 days ago

              There are actually quite a few books written in AAVE…the earliest I’m aware of is their eyes were watching god, from the 1930s. The Color Purple, Beloved, The Sellout, the books of Chester Himes…

  • gnu@lemmy.zip
    link
    fedilink
    arrow-up
    1
    ·
    edit-2
    21 days ago

    It’d be interesting to see how much this changes if you were to restrict the training dataset to books written in the last twenty years, I suspect the model would be a lot less negative. Older books tend to include stuff which does not fit with modern ideals and it’d be a real struggle to avoid this if such texts are used for training.

    For example I was recently reading a couple of the sequels to The Thirty-Nine Steps (written during WW1) and they include multiple instances that really date them to an earlier era with the main character casually throwing out jarringly racist stuff about black South Africans, Germans, the Irish, and basically anyone else who wasn’t properly English. Train an AI on that and you’re introducing the chance for problematic output - and chances are most LLMs have been trained on this series since they’re now public domain and easily available.

  • Moonrise2473@feddit.it
    link
    fedilink
    arrow-up
    1
    ·
    21 days ago

    The problem is that they trained the models using millions of pirated books in standard english.

    AAE is mostly used when spoken: they also pirated also millions of tv series and youtube videos that can contain that, but as of now, it was mostly for training voice recognition models

    (proof that they pirated television content and youtube videos to train whisper: https://community.openai.com/t/subtitles-created-by-amara-org-qtss-etc/462561 - https://gist.github.com/riotbib/3b3c5f817b55b68801d14b8bdb02df09)

  • davehtaylor@beehaw.org
    link
    fedilink
    arrow-up
    1
    ·
    20 days ago

    All the people here saying “well of course because they weren’t trained on AAVE”:

    THAT’S THE WHOLE POINT

    It’s the same reason facial recognition and voice recognition software have a difficult time with anyone who isn’t white or a speaker of perfect, uninflected standard english. The bias is created by the developers, conscious or not, because they only train it on what’s in their own bubble. If you don’t have diverse teams behind the development and training, you will create this bias, whether you want to or not. This is well known.