A machine learning algorithm is only as good as the data it’s trained on. Unfortunately, a massive and popular training dataset from MIT taught a bunch of algorithms to use racist and misogynistic slurs.
MIT just took down the offending database, 80 Million Tiny Images, for some much-needed sanitation, The Register reports. The dataset has been used to train image-recognition AI since 2008, but had never been probed for racist or offensive content, meaning a major source of algorithmic bias was flying under the radar.
AI learns to interpret and identify objects in pictures after poring over thousands of images that were already labeled. In MIT’s dataset, thousands of pictures of Black people — and monkeys — were labeled with the N-word. Pictures of women were labeled with misogynistic slurs. After being trained on that data, AI can perpetuate those prejudices in the real world.
“It is clear that we should have manually screened them,” MIT computer scientist and electrical engineer Antonio Torralba told The Register. “For this, we sincerely apologize. Indeed, we have taken the dataset offline so that the offending images and categories can be removed.”
But MIT then clarified that the dataset is gone forever.
After attempting to screen out the offending images, MIT decided that the task is simply too difficult for humans — there are simply too many pictures to check.
“Therefore, manual inspection, even if feasible, will not guarantee that offensive images can be completely removed,” reads an MIT statement.
More on AI bias: Robot Journalist Accused of Racism