How To Format Named Entities Using spaCy and Python

This post will describe how to format named entities using spaCy and Python. We will look at a standalone Python application that takes a piece of text and produces formatted HTML of the sentence. The formatting includes background colors and the entity type to highlight each named entity. This application provides a full graphical user interface using Tkinter (included with Python). The code presented here is much more detailed than you will find in other blog posts. This little app also illustrates the basics of spaCy.

If you are interested in an application that provides lots of spaCy functionality, I recommend the following two posts.

Named Entity Sentence Formatting Using spaCy

The following is a screenshot from the application…

WinEntityDisplay Application
WinEntityDisplay Application

The sentence ” Leonardo di ser Piero da Vinci[b] (15 April 1452 – 2 May 1519) was an Italian polymath of the High Renaissance who was active as a painter, draughtsman, engineer, scientist, theorist, sculptor, and architect. ” is hardcoded into the application. You can easily change the text and rerun the program to get a different diagram.

Python Code

The following is the complete code for the application.

'''
Copyright 2022 SingerLinks Consulting

This file is part of spaCyWorkbench.
spaCyWorkbench is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.
spaCyWorkbench is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.
You should have received a copy of the GNU General Public License
    along with NLPSpacy. If not, see <https://www.gnu.org/licenses/>.
'''
'''
WinEntityDisplay is a standalone application that demonstrates how spaCy can format text with named entities and entity types.
'''
try:
    from tkinter import *
    from tkinter import ttk
    from tkinter.ttk import *
except ImportError:
    from Tkinter import *
    from Tkinter import ttk
    from Tkinter.ttk import *

from tkinterweb import HtmlFrame 
import spacy
from spacy import displacy
 
class WinEntityDisplay(Tk):
     
    def __init__(self):
         
        super().__init__()
        self.title("Named Entities")
        
        self.frmMain = Frame(self)
        self.HTMLText = HtmlFrame(self.frmMain, messages_enabled = False)
        
        # grid the widgets
        self.frmMain.grid(row = 0, column = 0, padx = 5, pady = 5, sticky=NSEW)
        self.HTMLText.grid(row = 0, column = 0, padx = 5, pady = 5, sticky=NSEW)

        # handle resizing
        self.grid_columnconfigure(0, weight=1)
        self.grid_rowconfigure(0, weight=1)
        self.frmMain.grid_columnconfigure(0, weight=1)
        self.frmMain.grid_rowconfigure(0, weight=1)    
        self.HTMLText.grid_columnconfigure(0, weight=1)
        self.HTMLText.grid_rowconfigure(0, weight=1)            

        # sentence text
        self.text = 'Leonardo di ser Piero da Vinci[b] (15 April 1452 – 2 May 1519) was an Italian polymath of the High Renaissance who was active as a painter, draughtsman, engineer, scientist, theorist, sculptor, and architect.'
        
        # load spaCy model
        self.loadSpacyModel()
        
        # create the doc object
        self.createDoc()
        
        # display formatted text
        self.generateDiagram()
        
        # size the window
        self.geometry("{}x{}".format(int(self.winfo_screenwidth()*.8), int(self.winfo_screenheight()*.7)))
        
    def loadSpacyModel(self):
        'load the SpaCy Model'
        self.NLP = None
        try:
            modelName = "C:\\Users\\jsing\\pyenv\\NLP\\Lib\\site-packages\\en_core_web_sm\\en_core_web_sm-3.2.0"
            self.NLP = spacy.load(modelName)
        except Exception as e:
            print("Error loading SpaCy pipeline:{} - {}".format(modelName, e))  
        
    def createDoc(self):
        '''
        Create a spacy document object from the selected text in the raw text area.
        '''
        self.doc = None
        try:
            # create the document object
            self.doc = self.NLP(self.text)
        except Exception as e:
            print("Error Creating Doc Object - {}".format(e))        
            
    def generateDiagram(self):
        # convert the iterator to a list and get the first sentence span in the document object
        self.sentence = list(self.doc.sents)[0]
        # generate html
        myHTML = displacy.render(self.sentence, style="ent")
        # replace mark tags with span tags so htmllabel will work.  
        fix1 = myHTML.replace("<mark", "<span")
        fix2 = fix1.replace("</mark","</span")
        # display the text
        self.HTMLText.load_html(fix2) 


'''start the app running'''
if __name__ == "__main__":
    app = WinEntityDisplay()
    app.mainloop()

You can copy the code above into Idle and run it. But first you need to do two things.

  • Setup a python virtual environment with the appropriate libraries installed. You can find the instructions to do this here.
  • Change the modelName variable to point to the spaCy model you installed (described in the post linked above).

Code Details

Let’s go through the code in detail.

Imports

try:
    from tkinter import *
    from tkinter import ttk
    from tkinter.ttk import *
except ImportError:
    from Tkinter import *
    from Tkinter import ttk
    from Tkinter.ttk import *

This is the import for Tkinter. Tkinter is the graphical user interface provided with Python. I’m not going to spend any time on Tkinter code as there are many good tutorials on the web and I don’t really want to write a Tkinter tutorial. The try: block attempts to import the latest version of Tk and if that fails (i.e. you have an old version of Python) the except: block will import the older version.

from tkinterweb import HtmlFrame 
import spacy
from spacy import displacy

Here we have the imports needed for spaCy and the diagramming utility called “dispaCy”.

  • from tkinterweb import HtmlFrame – HtmlFrame provides the ability to display HTML in a window of a Tkinter application. Tkinter doesn’t do this natively, so we need to install and use a 3rd party UI widget.
  • spacy – this is the spacy module that does all the heavy lifting.
  • displacy – this is the spacy module that generates the formatted HTML.

WinSentenceDiagram Class

class WinSentenceDiagram(Tk):
     
    def __init__(self):
        # lots of tkinter code
        .......
        # sentence text
        self.text = 'Leonardo di ser Piero da Vinci[b] (15 April 1452 – 2 May 1519) was an Italian polymath of the High Renaissance who was active as a painter, draughtsman, engineer, scientist, theorist, sculptor, and architect.'
        
        # load spaCy model
        self.loadSpacyModel()
        
        # create the doc object
        self.createDoc()
        
        # display formatted text
        self.generateDiagram()

Without going into a lot of detail, the WinSentenceDiagram class is the top level Tk object that represents the main window of the UI. The __init__ method is called to initialize the object. I left out the first part with is a bunch of tkinter code that defines all the window widgets.

Next you see the “self.text” variable set to a sentence text. You can replace this sentence with any text you want. This is what will be formatted.

Next “self.loadSpaceModel()” calls a method to load the spaCy language model.

Next “self.createDoc()” calls a method to create the spaCy document. This is what analyzes the text.

Next “generateDiagram()” calls a method to actually generate the formatted HTML from the analyzed text.

Load The spaCy Model

    def loadSpacyModel(self):
        'load the SpaCy Model'
        self.NLP = None
        try:
            modelName = "C:\\Users\\jsing\\pyenv\\NLP\\Lib\\site-packages\\en_core_web_sm\\en_core_web_sm-3.2.0"
            self.NLP = spacy.load(modelName)
        except Exception as e:
            print("Error loading SpaCy pipeline:{} - {}".format(modelName, e))  

This code loads the spaCy language model you installed as a part of the setup process.

The “modelName” variable is set to the folder that contains the language model.

The “self.NLP” variable is set to the loaded language model using the spacy.load method.

An exception block will display a message if any error occurs.

Now we can analyze the text in “self.text” using the spaCy pipeline (language model) stored in “self.NLP”.

Create The spaCy Document

    def createDoc(self):
        '''
        Create a spacy document object from the selected text in the raw text area.
        '''
        self.doc = None
        try:
            # create the document object
            self.doc = self.NLP(self.text)
        except Exception as e:
            print("Error Creating Doc Object - {}".format(e))

Now we will create the spaCy Document object which causes spaCy to perform it’s analysis of the text (self.NLP(self.text)) with all the results saved in the document object (“self.doc”).

An exception block will display a message if any error occurs.

Generate The Formatted HTML Using displaCy

        def generateDiagram(self):
        # convert the iterator to a list and get the first sentence span in the document object
        self.sentence = list(self.doc.sents)[0]
        # generate html
        myHTML = displacy.render(self.sentence, style="ent")
        # replace mark tags with span tags so htmllabel will work.  
        fix1 = myHTML.replace("<mark", "<span")
        fix2 = fix1.replace("</mark","</span")
#        print(fix2)
        # display the text
        self.HTMLText.load_html(fix2) 

Finally we get to the part where we generate the html for the sentence stored in self.text.

self.sentence = list(self.doc.sents)[0]

This line of code retrieves the first sentence in the document object. Because “self.doc.sents” is an iterator we convert it to a list. The “[0]” gets the first item in the list.

myHTML = displacy.render(self.sentence, style=”ent”)

This line of code calls the “display” module and passes it the sentence we obtained from the sentence list in the previous line. Displacy will return a string containing the formatted HTML.

fix1 = myHTML.replace(“<mark”, “<span”)

fix2 = fix1.replace(“</mark”,”</span”)

Now we have to “fix” the HTML generated by displacy. The problem is dispacy generates a “mark” tag for the named entities but the HtmlFrame rendering widget we are using doesn’t understand the “mark” tag (“mark” is HTML 5 and the widget doesn’t support this). The solution is to simply replace the “mark” tags with “span” tags – see the two lines above.

self.HTMLText.load_html(fix2)

This line of code puts the formatted and fixed HTML into the HtmlFrame widget “self.HTMLText” using the “load” method.

And Finally…

This post looked at the code that illustrates how to how to format named entities using spaCy and Python. We also looked at a python library for displaying HTML in a frame. This post is a part of a series of spaCy how to posts that I encourage you to look at.

Total
0
Shares
Previous Post

How To Diagram A Sentence Using spaCy and Python

Next Post

How To Design And 3D Print Gears Using Blender

Related Posts