Using Rules with a SpaCy TextCat Classifier for NSFW Content

6 min readJan 12, 2023

This is a short tech note (intermediate level) on how to modify a spaCy NLP pipeline to incorporate rules to override — or guarantee — a trained classification model result. Models can be fragile sometimes: they can be gamed by adversarial users or behave unpredictably under certain contexts of use you may not have considered. Rules can help you ensure behavior you want when you’re sure of what you want and there aren’t nuances of context.

Our example is a real-world case I encountered on a client project for a text2image generation startup. In these tools, a user types in a text prompt, like “A penguin in Venice,” and the model generates an image for it.

A penguin in a top hat in Venice, AI art made in Midjourney — “A penguin in Venice”, by the author using Midjourney

Every company in this space with a public-facing product, such as a community creation feed, has been concerned about surfacing NSFW or offensive content of various types. For now, let’s assume the client doesn’t like public images of naked people, and that those count as “NSFW” for the purposes of my classifier.

While I was working on labeling data and building a classifier for such prompts, I noted that the volunteer moderators kept adding various porn actors to the blacklist as they encountered them in community-flagged content.

You might think that a generic prompt with a porn actor’s name in it — say, “[Actress Name] eating a bagel” — might be “safe” and should be allowed, but you’d probably be wrong in an image generation context; images of these actors are highly likely to contain NSFW content, if they were included in the training dataset for the model in production. This makes their names a good invocation to steer the model in the direction of NSFW output. Note: Clearly we would ideally like to have two classifiers, one for text input and one for the model’s image output; but my work was focused on the text prompt side.

The TextCat Classifier

First, we need a trained classification model. Creating a real production model is non-trivial, since there are lots of ways to ask for NSFW content, and labeling is time consuming and therefore expensive. Let’s leave aside this issue for now, to focus on the rule overrides. But first we do need at least a toy classification model, on which to apply the overrides.

There are a couple ways to make a textcat classification model with spaCy, either using the command line or in code functions in a notebook. If you want to do it in a notebook, you can follow the code in jupyter-naas’s awesome-notebooks sentiment example — scroll down a bit to the classification part. Otherwise, here’s a tutorial for the pipeline method by Phil S, and some official demo projects from Explosion.ai’s team: a textcat demo for a binary classifer, and a demo for a multilabel classifer. We’re concerned with a binary classifier in this article, to label a prompt as positive or negative for “NSFW.”

This might be obvious but it’s easy to overlook: Whatever you load as your “nlp” base model (“en_core_web_lg” or “en_core_web_trf,” for instance) before you train the textcat model pipeline component is what you ALSO need to use when you load a saved model.

Here we are loading a trained model pipeline component that was built off the “en_core_web_trf,” the default spaCy transformers model:

nlp = spacy.load("en_core_web_trf")  # my textcat model wants transformers model
mytextcat = spacy.load("trained-textcat/model-best")  # load your model
nlp.add_pipe("textcat", source=mytextcat)  # add the trained model here

This is a binary classifier, where a sample of the toy training data looked like this:

("A naked woman on a bed", {"cats": {"POSITIVE": 1}}),
("A woman wearing nothing, in a shower.", {"cats": {"POSITIVE": 1}}),
("A woman wearing clothes in the shower.", {"cats": {"NEGATIVE": 1}}),
("A woman eating a bagel in a sauna.", {"cats": {"NEGATIVE": 1}}),
("Amanda Foo washing a car in a bikini.", {"cats": {"NEGATIVE": 1}}),
("Elizabeth nude, photoreal, High def", {"cats": {"POSITIVE": 1}}),
("Mia Farrow eating a cream tea.", {"cats": {"NEGATIVE": 1}}),
("A man with no clothes in a messy bed.", {"cats": {"POSITIVE": 1}}),
("Anatomically accurate female android wearing a see-through lace bikini", {"cats": {"POSITIVE": 1}})

We can test it, by trying some new sentences and printing their “cats”:

doc = nlp("A nude girl, cinematic HD")
print(doc.cats)

{'POSITIVE': 0.9870722889900208, 'NEGATIVE': 0.012927748262882233}

doc = nlp("Amanda Hess eating a cookie")
print(doc.cats)

{'POSITIVE': 0.15201209485530853, 'NEGATIVE': 0.8479878902435303}

The Entity Ruler

We now want to make sure any sentence with a person entity who is a porn star returns a positive classification, regardless of the rest of the prompt content. Luckily, there is a way to add patterns to the entity recognition process, which is one reason spaCy is nice for production NLP work. That way is the entity ruler.

How do we get a list of porn star names? One way is to do a query for wikidata entities. Luckily the porn actor content is quite fleshed out on Wikipedia (see this article on the gender problem with Wikipedia editors and what gets documented).

We can use a SPARQL query like this. (Tip: Use ctrl-space to get search completions for the wd/wdt entities when you are trying to compose your own.)

SELECT DISTINCT ?person ?personLabel
WHERE {
    ?person wdt:P31 wd:Q5;            # Any instance of a human
          wdt:P106/wdt:P279* wd:Q488111; # with profession or subclass of pornographic actor
   
    SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" }
}

This query looks for entities that a human with a profession of “porn actor.” The “personLabel” is the name (we aren’t collecting aliases here, which are often lists of strings that need cleaning). This will get us over 9K results. But you should note that some of the name labels are single words like “India” or “Missy” and those might not be safe to use in rules. You should carefully review and clean your list for non-confusable names, unless you’re happy with lots of false positives. (This is a decision you must make based on how the results will be surfaced and used in the app; if humans will be reviewing everything, and you’re more worried about bad content slipping through, you may want to retain these confusable entities in your list.)

Using the cleaned list, we can create entity ruler rules to add to our nlp pipeline after “ner” (“named entity recognition”). I’m giving them a special entity label, so it’s super clear in my output:

ruler = nlp.add_pipe("entity_ruler", after="ner", config = {"overwrite_ents": True})
patterns = [
   {"label": "PORN_PERSON", "pattern": [{"LOWER": "lanna"}, {"LOWER": "rhoades"}]},
   {"label": "PORN_PERSON", "pattern": [{"LOWER": "mia"}, {"LOWER": "khalifa"}]},
...
]
ruler.add_patterns(patterns)

Now we want to make sure that if we get one of these entities, we change the document classification category. We can write a little decorated function for a component that does this:

from spacy.language import Language

@Language.component("reset_textcat")
def reset_textcat(doc):
    for ent in doc.ents:
        if ent.label_ == "PORN_PERSON":
            doc.cats = {'POSITIVE': 1, 'NEGATIVE': 0}
    return doc

And then add it last in the pipeline, after both the classifier and the entity ruler:

nlp.add_pipe("reset_textcat", last=True)

Now a test should show it is working when we look at entity labels and the doc.cats:

doc = nlp("Mia Khalifa eating toast.")

for ent in doc.ents:
    print(ent, ent.label_)
print(doc.cats)

Mia Khalifa PORN_PERSON
{'POSITIVE': 1, 'NEGATIVE': 0}

And a negative with a non-porn person shows an ordinary PERSON entity label:

doc = nlp("Mia Farrow eating toast.")

for ent in doc.ents:
    print(ent, ent.label_)
print(doc.cats)

Mia Farrow PERSON
{'POSITIVE': 0.04515556991100311, 'NEGATIVE': 0.9548444151878357}

And we still get a normal positive with a non-porn person:

doc = nlp("Mia Farrow in a see-through bikini eating toast.")

for ent in doc.ents:
    print(ent, ent.label_)
print(doc.cats)

Mia Farrow PERSON
{'POSITIVE': 0.997011661529541, 'NEGATIVE': 0.0029883417300879955}

More Resources

I wrote this up to help contribute to the resources on custom pipelines in spaCy. I talked about this a bit in my Normconf 2022 talk on Tips and Tricks in NLP, and I’m growing a repo with links on helpful real-world NLP issues (especially using spaCy) here. Let me know if you’d like to see other NLP content!

Using Rules with a SpaCy TextCat Classifier for NSFW Content

The TextCat Classifier

The Entity Ruler

More Resources

Written by Lynn Cherny

No responses yet