It’s an oft-repeated phrase among journalists: never read the comments. Comment sections, from Twitter to Reddit and everything in between, are some of the darkest places on the internet, places where baseless insults and pointed critiques fly like bullets in a chaotic melee.

To save us from that ugliness (in others, and also in ourselves), engineers at IBM have created an AI algorithm that tries to filter the profanity out of our messages and suggests more palatable alternatives.

The scientists behind the profanity-filtering AI are, in a refreshing twist, conscious of how their filter might be misused. For instance, authoritarian governments or overreaching technology companies could, hypothetically, use similar algorithms to flag political or otherwise critical language among people conversing online. And since governments are already hard at work shutting down dissident rumblings online, it’s not far-fetched to imagine that a tool like this would be destructive if in the wrong hands.

So, instead of simply changing offensive language, the researchers argue their algorithm should be used to provide gentle reminders and suggestions. For instance, a tool resembling good ol’ Microsoft Clippy might pop up and ask, “Do you really want to tell this stranger on Reddit to fuck off and die?” instead of automatically editing what you type.

And there’s a lot of merit in that — it’s the technological equivalent of venting your anger and then sleeping on it or stepping away from the keyboard before you hit send.

After being trained on millions of tweets and Reddit posts, the AI system became very effective at removing profane and hateful words. But it’s much, much less good at recreating the sentences in a polite way that conserved meaning.

For instance, a tweet reading “bros before hoes” was translated into “bros before money.” There’s… something missing there. Granted, this is much better than existing language filter AI, which turned the same tweet into “club tomorrow.” Let’s give credit where credit is due.

Also, a lot of swear words were turned into “big,” regardless of context. A frustrated Reddit post reading “What a fucking circus this is” became a sincere, awe-filled “what a big circus this is.”

So far, the researchers have simply created their algorithm, but haven’t incorporated it into a usable online too, for either individual users or the sites themselves. Presumably, it would have to get a lot better at suggesting new language before that could happen.

Aside from the, er, obvious shortcomings, the team behind this algorithm is aware of its limitations. AI filters of this sort can only work to remove the most obvious, explicit forms of online abuse. For instance, it can’t tell if a particular sentence is hateful unless it includes specific angry or profane words. If the language itself is seemingly benign or requires context to understand, it would fly under the radar.

Implicit prejudices, then, would go unchecked, as long as no one says “shit.” And this says nothing for the arguably more dangerous forms of online harassment like stalking, doxing, or threatening people. Of course, a language filter can’t end the internet’s toxic culture, but this new AI research can help us take a step back and make you think real hard before you decide to perpetuate hateful speech.