Extract Structured Data From Text: Expert Mode (Using Kor)#

For complicated data extraction you need a robust library. The Kor Library (created by Eugene Yurtsev) is an awesome tool just for this.

We are going to explore using Kor with a practical use case.

Why is this important? LLMs are great at text output, but they need extra help outputing information in a structure that we want. A common request from developers is to get JSON data back from our LLMs.

Spoiler: Jump down to the bottom to see a bonefied business idea that you can start and manage today.

# Unzip data folder

import zipfile
with zipfile.ZipFile('../../data.zip', 'r') as zip_ref:
    zip_ref.extractall('..')
# Kor!
from kor.extraction import create_extraction_chain
from kor.nodes import Object, Text, Number

# LangChain Models
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI

# Standard Helpers
import pandas as pd
import requests
import time
import json
from datetime import datetime

# Text Helpers
from bs4 import BeautifulSoup
from markdownify import markdownify as md

# For token counting
from langchain.callbacks import get_openai_callback

def printOutput(output):
    print(json.dumps(output,sort_keys=True, indent=3))
# It's better to do this an environment variable but putting it in plain text for clarity
openai_api_key = 'your_api_key'
openai_api_key = '...'

Let’s start off by creating our LLM. We’re using gpt4 to take advantage of its increased ability to follow instructions

llm = ChatOpenAI(
#     model_name="gpt-3.5-turbo", # Cheaper but less reliable
    model_name="gpt-4",
    temperature=0,
    max_tokens=2000,
    openai_api_key=openai_api_key
)

Kor Hello World Example#

Create an object that holds information about the fields you’d like to extract

person_schema = Object(
    # This what will appear in your output. It's what the fields below will be nested under.
    # It should be the parent of the fields below. Usually it's singular (not plural)
    id="person",
    
    # Natural language description about your object
    description="Personal information about a person",
    
    # Fields you'd like to capture from a piece of text about your object.
    attributes=[
        Text(
            id="first_name",
            description="The first name of a person.",
        )
    ],
    
    # Examples help go a long way with telling the LLM what you need
    examples=[
        ("Alice and Bob are friends", [{"first_name": "Alice"}, {"first_name": "Bob"}])
    ]
)

Create a chain that will extract the information and then parse it. This uses LangChain under the hood

chain = create_extraction_chain(llm, person_schema)
text = """
    My name is Bobby.
    My sister's name is Rachel.
    My brother's name Joe. My dog's name is Spot
"""
output = chain.predict_and_parse(text=(text))["data"]

printOutput(output)
# Notice how there isn't "spot" in the results list because it's the name of a dog, not a person.
{
   "person": [
      {
         "first_name": "Bobby"
      },
      {
         "first_name": "Rachel"
      },
      {
         "first_name": "Joe"
      }
   ]
}

Kor also facilitates returning None when the LLM doesn’t find what you’re looking for

output = chain.predict_and_parse(text=("The dog went to the park"))["data"]
printOutput(output)
{
   "person": []
}

Multiple Fields#

You can pass multiple fields if you’re looking for more information

plant_schema = Object(
    id="plant",
    description="Information about a plant",
    
    # Notice I put multiple fields to pull out different attributes
    attributes=[
        Text(
            id="plant_type",
            description="The common name of the plant."
        ),
        Text(
            id="color",
            description="The color of the plant"
        ),
        Number(
            id="rating",
            description="The rating of the plant."
        )
    ],
    examples=[
        (
            "Roses are red, lilies are white and a 8 out of 10.",
            [
                {"plant_type": "Roses", "color": "red"},
                {"plant_type": "Lily", "color": "white", "rating" : 8},
            ],
        )
    ]
)
text="Palm trees are brown with a 6 rating. Sequoia trees are green"

chain = create_extraction_chain(llm, plant_schema)
output = chain.predict_and_parse(text=text)['data']

printOutput(output)
{
   "plant": [
      {
         "color": "brown",
         "plant_type": "Palm tree",
         "rating": "6.0"
      },
      {
         "color": "green",
         "plant_type": "Sequoia tree",
         "rating": ""
      }
   ]
}

Working With Lists#

You can also extract lists as well.

Note: Check out how I have a nested object. The ‘parts’ object is in the ‘cars_schema’

parts = Object(
    id="parts",
    description="A single part of a car",
    attributes=[
        Text(id="part", description="The name of the part")
    ],
    examples=[
        (
            "the jeep has wheels and windows",
            [
                {"part": "wheel"},
                {"part": "window"}
            ],
        )
    ]
)

cars_schema = Object(
    id="car",
    description="Information about a car",
    examples=[
        (
            "the bmw is red and has an engine and steering wheel",
            [
                {"type": "BMW", "color": "red", "parts" : ["engine", "steering wheel"]}
            ],
        )
    ],
    attributes=[
        Text(
            id="type",
            description="The make or brand of the car"
        ),
        Text(
            id="color",
            description="The color of the car"
        ),
        parts
    ]
)
# To do nested objects you need to specify encoder_or_encoder_class="json"
text = "The blue jeep has rear view mirror, roof, windshield"

# Changed the encoder to json
chain = create_extraction_chain(llm, cars_schema, encoder_or_encoder_class="json")
output = chain.predict_and_parse(text=text)['data']

printOutput(output)
{
   "car": {
      "color": "blue",
      "parts": [
         {
            "part": "rear view mirror"
         },
         {
            "part": "roof"
         },
         {
            "part": "windshield"
         }
      ],
      "type": "jeep"
   }
}

View the prompt that was sent over

prompt = chain.prompt.format_prompt(text=text).to_string()

print(prompt)
Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

car: { // Information about a car
 type: string // The make or brand of the car
 color: string // The color of the car
 parts: { // A single part of a car
  part: string // The name of the part
 }
}
```


Please output the extracted information in JSON format. Do not output anything except for the extracted information. Do not add any clarifying information. Do not add any fields that are not in the schema. If the text contains attributes that do not appear in the schema, please ignore them. All output must be in JSON format and follow the schema specified above. Wrap the JSON in <json> tags.

Input: the bmw is red and has an engine and steering wheel
Output: <json>{"car": [{"type": "BMW", "color": "red", "parts": ["engine", "steering wheel"]}]}</json>
Input: the jeep has wheels and windows
Output: <json>{"car": {"parts": [{"part": "wheel"}, {"part": "window"}]}}</json>
Input: The blue jeep has rear view mirror, roof, windshield
Output:

Kor is a really great way to extract actions from a user as well

schema = Object(
  id="forecaster",
  description=(
      "User is controling an app that makes financial forecasts. "
      "They will give a command to update a forecast in the future"
  ),
  attributes=[
      Text(
          id="year",
          description="Year the user wants to update",
          examples=[("please increase 2014's customers by 15%", "2014")],
          many=True,
      ),
      Text(
          id="metric",
          description="The unit or metric a user would like to influence",
          examples=[("please increase 2014's customers by 15%", "customers")],
          many=True,
      ),
      Text(
          id="amount",
          description="The quantity of a forecast adjustment",
          examples=[("please increase 2014's customers by 15%", ".15")],
          many=True,
      )
    ],
  many=False,
)
chain = create_extraction_chain(llm, schema, encoder_or_encoder_class='json')
output = chain.predict_and_parse(text="please add 15 more units sold to 2023")['data']

printOutput(output)
{
   "forecaster": {
      "amount": [
         "15"
      ],
      "metric": [
         "units sold"
      ],
      "year": [
         "2023"
      ]
   }
}

Opening Attributes - Real World Example#

Opening Attributes (my sample project for this application)

If anyone wants to strategize on this project DM me on twitter

llm = ChatOpenAI(
    # model_name="gpt-3.5-turbo",
    model_name="gpt-4",
    temperature=0,
    max_tokens=2000,
    openai_api_key=openai_api_key
)

We are going to be pulling jobs from Greenhouse. No API key is needed.

def pull_from_greenhouse(board_token):
    # If doing this in production, make sure you do retries and backoffs
    
    # Get your URL ready to accept a parameter
    url = f'https://boards-api.greenhouse.io/v1/boards/{board_token}/jobs?content=true'
    
    try:
        response = requests.get(url)
    except:
        # In case it doesn't work
        print ("Whoops, error")
        return
        
    status_code = response.status_code
    
    jobs = response.json()['jobs']
    
    print (f"{board_token}: {status_code}, Found {len(jobs)} jobs")
    
    return jobs

Let’s try it out for Okta

jobs = pull_from_greenhouse("okta")
okta: 200, Found 142 jobs

Let’s look at a sample job with it’s raw dictionary

# Keep in mind that my job_ids will likely change when you run this depending on the postings of the company
job_index = 0
print ("Preview:\n")
print (json.dumps(jobs[job_index])[:400])
Preview:

{"absolute_url": "https://www.okta.com/company/careers/opportunity/4977199?gh_jid=4977199", "data_compliance": [{"type": "gdpr", "requires_consent": false, "requires_processing_consent": false, "requires_retention_consent": false, "retention_period": null}], "internal_job_id": 2518868, "location": {"name": "Melbourne "}, "metadata": null, "id": 4977199, "updated_at": "2023-04-05T22:41:12-04:00", "

Let’s clean this up a bit

# I parsed through an output to create the function below
def describeJob(job_description):
    print(f"Job ID: {job_description['id']}")
    print(f"Link: {job_description['absolute_url']}")
    print(f"Updated At: {datetime.fromisoformat(job_description['updated_at']).strftime('%B %-d, %Y')}")
    print(f"Title: {job_description['title']}\n")
    print(f"Content:\n{job_description['content'][:550]}")

We’ll look at another job. This job_id may or may not work for you depending on if the position is still active.

# Note: I'm using a hard coded job id below. You'll need to switch this if this job ever changes
# and it most definitely will!
job_id = 4982726

job_description = [item for item in jobs if item['id'] == job_id][0]
    
describeJob(job_description)
Job ID: 4982726
Link: https://www.okta.com/company/careers/opportunity/4982726?gh_jid=4982726
Updated At: April 11, 2023
Title: Staff Software Engineer 

Content:
&lt;div class=&quot;content-intro&quot;&gt;&lt;p&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;strong&gt;Get to know Okta&lt;/strong&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;br&gt;&lt;/span&gt;Okta is The World’s Identity Company. We free everyone to safely use any technology—anywhere, on any device or app. Our Workforce and Customer Identity Clouds enable secure yet flexible access, authentication, and automation that transforms how people move through the digital world, putting Identity at t

I want to convert the html to text, we’ll use BeautifulSoup to do this. There are multiple methods you could choose from. Pick what’s best for you.

soup = BeautifulSoup(job_description['content'], 'html.parser')
text = soup.get_text()

# Convert your html to markdown. This reduces tokens and noise
text = md(text)

print (text[:600])
**Get to know Okta**


  
Okta is The World’s Identity Company. We free everyone to safely use any technology—anywhere, on any device or app. Our Workforce and Customer Identity Clouds enable secure yet flexible access, authentication, and automation that transforms how people move through the digital world, putting Identity at the heart of business security and growth.   
  
At Okta, we celebrate a variety of perspectives and experiences. We are not looking for someone who checks every single box, we’re looking for lifelong learners and people who can make us better with their unique experien

Let’s create a Kor object that will look for tools. This is the meat and potatoes of the application

tools = Object(
    id="tools",
    description="""
        A tool, application, or other company that is listed in a job description.
        Analytics, eCommerce and GTM are not tools
    """,
    attributes=[
        Text(
            id="tool",
            description="The name of a tool or company"
        )
    ],
    examples=[
        (
            "Experience in working with Netsuite, or Looker a plus.",
            [
                {"tool": "Netsuite"},
                {"tool": "Looker"},
            ],
        ),
        (
           "Experience with Microsoft Excel",
            [
               {"tool": "Microsoft Excel"}
            ] 
        ),
        (
           "You must know AWS to do well in the job",
            [
               {"tool": "AWS"}
            ] 
        ),
        (
           "Troubleshooting customer issues and debugging from logs (Splunk, Syslogs, etc.) ",
            [
               {"tool": "Splunk"},
            ] 
        )
    ],
    many=True,
)
chain = create_extraction_chain(llm, tools, input_formatter="triple_quotes")
output = chain.predict_and_parse(text=text)["data"]

printOutput(output)
{
   "tools": [
      {
         "tool": "Okta"
      },
      {
         "tool": "Java"
      },
      {
         "tool": "Hibernate"
      },
      {
         "tool": "Spring Boot"
      },
      {
         "tool": "AWS"
      },
      {
         "tool": "GCP"
      },
      {
         "tool": "SQL"
      },
      {
         "tool": "ElasticSearch"
      },
      {
         "tool": "Docker"
      },
      {
         "tool": "Kubernetes"
      }
   ]
}

Salary#

Let’s grab salary information while we are at it.

Not all jobs will list this information. If they do, it’s rarely consistent across jobs. A great use case for LLMs to catch this information!

salary_range = Object(
    id="salary_range",
    description="""
        The range of salary offered for a job mentioned in a job description
    """,
    attributes=[
        Number(
            id="low_end",
            description="The low end of a salary range"
        ),
        Number(
            id="high_end",
            description="The high end of a salary range"
        )
    ],
    examples=[
        (
            "This position will make between $140 thousand and $230,000.00",
            [
                {"low_end": 140000, "high_end": 230000},
            ]
        )
    ]
)
jobs = pull_from_greenhouse("cruise")
cruise: 200, Found 219 jobs
# This job id may not work for you, pick another one from the list if it doesn't.
job_id = 4858414

job_description = [item for item in jobs if item['id'] == job_id][0]
    
describeJob(job_description)

soup = BeautifulSoup(job_description['content'], 'html.parser')
text = soup.get_text()

# Convert your html to markdown. This reduces tokens and noise
text = md(text)

print (text[:600])
Job ID: 4858414
Link: https://boards.greenhouse.io/cruise/jobs/4858414?gh_jid=4858414
Updated At: April 12, 2023
Title: Senior Data Center Technician

Content:
&lt;div class=&quot;content-intro&quot;&gt;&lt;p&gt;&lt;span style=&quot;font-weight: 400;&quot;&gt;We&#39;re Cruise, a self-driving service designed for the cities we love.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style=&quot;font-weight: 400;&quot;&gt;We’re building the world’s most advanced self-driving vehicles to safely connect people to the places, things, and experiences they care about. We believe self-driving vehicles will help save lives, reshape cities, give back time in transit, and restore freedom of movement for many.&lt;/span&gt;
We're Cruise, a self-driving service designed for the cities we love.


We’re building the world’s most advanced self-driving vehicles to safely connect people to the places, things, and experiences they care about. We believe self-driving vehicles will help save lives, reshape cities, give back time in transit, and restore freedom of movement for many.


In our cars, you’re free to be yourself. It’s the same here at Cruise. We’re creating a culture that values the experiences and contributions of all of the unique individuals who collectively make up Cruise, so that every employee can do thei
chain = create_extraction_chain(llm, salary_range)
output = chain.predict_and_parse(text=text)["data"]

printOutput(output)
{
   "salary_range": [
      {
         "high_end": "165000",
         "low_end": "112300"
      }
   ]
}

The salary range for this position is $112,300 - 165,000. Compensation will vary depending on location, job-related knowledge, skills, and experience. You may also be offered a bonus, restricted stock units, and benefits. These ranges are subject to change.

Awesome!

OpenAI GPT4 Pricing

with get_openai_callback() as cb:
    result = chain.predict_and_parse(text=text)
    print(f"Total Tokens: {cb.total_tokens}")
    print(f"Prompt Tokens: {cb.prompt_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")
    print(f"Successful Requests: {cb.successful_requests}")
    print(f"Total Cost (USD): ${cb.total_cost}")
Total Tokens: 1768
Prompt Tokens: 1757
Completion Tokens: 11
Successful Requests: 1
Total Cost (USD): $0.053369999999999994

Suggested To Do if you want to build this out:

  • Reduce amount of HTML and low-signal text that gets put into the prompt

  • Gather list of 1000s of companies

  • Run through most jobs (You’ll likely start to see duplicate information after the first 10-15 jobs per department)

  • Store results

  • Snapshot daily as you look for new jobs

  • Follow Greg on Twitter for more tools or if you want to chat about this project

  • Read the user feedback below for what else to build out with this project (I reached out to everyone who signed up on twitter)

Business idea: Job Data As A Service#

Start a data service that collects information about company’s jobs. This can be sold to investors looking for an edge.

After posting this tweet there were 80 people that signed up for the trial. I emailed all of them and most were job seekers looking for companies that used the tech they specialized in.

The more interesting use case were sales teams + investors.

Interesting User Feedback (Persona: Investor):#

Hey Gregory, thanks for reaching out.

I always thought that job posts were a gold mine of information, and often suggest identifying targets based on these (go look at relevant job posts for companies that might want to work with you). Secondly, I also automatically ping BuiltWith from our CRM and send that to OpenAI and have a summarized tech stack created - so I see the benefit of having this as an investor.

For me personally, I like to get as much data as possible about a company. Would love to see job post cadence, type of jobs they post and when, notable keywords/phrases used, tech stack (which you have), and any other information we can glean from the job posts (sometimes they have the title of who you’ll report to, etc.).

For sales people, I think finer searches, maybe even in natural language if possible - such as “search for companies who posted a data science related job for the first time” - would be powerful.

If you do this, let me know! I’d love to hear how it goes.