Hi, I'm James

LLMs for NLP Drudgery

LLMs are a big productivity boost in NLP workflows, especially where human interpretation is an input.

Data Engineering Data Science

In retrospect, it’s kinda obvious, but I had to see it to understand it. LLMs are my new tools of preference for things where I would previously reach for SpaCy, NLTK, GenSim, etc.

A real example

I was recently working to develop an unsupervised topic model for a very large call transcript data set. As is often the case, the model was picking representative names as topics since they consistently appear in the transcripts. E.g.:

Thank you for calling. My name is James. How can I help you today?

Pre-LLM, I would spend time manually reviewing topics, and add to my list of stop-words to be removed from the dataset prior to fitting the model. That approach works, but it could take hours of iteration to get past the noise and into meaningful topics.

On a whim, I gave the problem to Llama 1 . The results were as good as I could do in a few hours:

WITH prompts AS (
    SELECT 'Remove names from this call transcript...' AS remove_names_prompt
)

SELECT
    call_transcripts.call_id,
    ai_query(
        'Meta-Llama-3.1-70B-Instruct',
        concat(prompts.remove_names_prompt, call_transcripts.transcript)
    ) AS cleaned_transcript
FROM call_transcripts
CROSS JOIN prompts;

I’m still tokenizing, stemming, etc. downstream from this query. But I’ve eliminated one of the most time-consuming steps - removing unwanted topics.

Repetitive NLP-ish stuff as a sweet spot for LLMs

This got me thinking, what other natural-language gruntwork do I often do that could get offloaded to an LLM?

  • Removing unwanted keywords like names, products, locations, etc. Like above.
  • Explaining topics after a model is fitted. E.g., “Given these examples and keywords, describe the detected topic in a few words.”
  • Building test-cases for classification models. E.g., “Generate 10 hypothetical SMS messages that illustrate the given topic and sentiment.”
  • Fetching the specific statements from a document. E.g., “What are the 1-2 statements that most relate to this topic?”
  • Upgrading documents from lesser-quality transcribers / processes. E.g., “Edit this transcript for grammar and clarity, while retaining the original meaning.”
  • And a bunch of stuff I’m not clever enough to think of

Use cases aside, any time I find myself writing from spacy import ... I’m thinking about whether I could accomplish all or part of the task with an LLM.

Notes

  1. 1: Full prompt: “Remove specific names of people and places from the following call transcript and provide back the edited transcript. Your edits should be made to retain as much of the original transcript as possible. Replace the names of callers and representatives with the word caller and representative respectively. Replace the names of store locations with the word location. Replace the names of specific products with the word product. Return only the edited transcript. Do not summarize your edits.”