Text Extraction from HTML by Keyword using Python

Mateusz Wiza
5 min readJun 19, 2021

Recently I worked on a rather quick and simple, yet quite interesting project. I was given a few hundred HTML files and I needed to get all the texts from the document and put them in a CSV file. Sounds easy but there were a few caveats that required a few clever solutions, so in this article, I want to show you how I approached this problem.

Task

There were actually two different tasks, but let’s talk about the data first. As I mentioned, I was given hundreds of similar-looking HTML files that mostly comprised of some legal texts and contracts, clearly take from an old government site, in this case particularly from the U.S. Securities and Exchange Commission. I was also given an Excel spreadsheet with the names of all the files, which helped a lot with organizing the workflow in the script.

The example part of the HTML file as rendered by a web browser (above) and the actual source code (below)

The first task was fairly straightforward: get all the text from the files and export it as CSV in such a way that each row represents one file, the first column is its name, and then right next to it, we have the whole text extracted from the HTML file. The not-so-obvious problem that arises here is that both Excel and CSV formats can only store at most 32,767 characters in each cell. This is not enough to store the entirety of the text from a single file in one cell, so the idea was to divide the text into strings with a maximum length of 32,767 and put them in the next columns, still in the same row.

A quick side note, I actually started wondering were did the number 32,767 come from. This is the answer I found here: While 32767 may seem like an arbitrary number, it’s actually the upper limit of a 16-bit signed integer (called a short in C). The range of a short goes from -32768 to 32767.

The other task appeared a bit more complicated at first but actually ended up being the easier one. As mentioned, the HTML files I was dealing with were legal texts and contracts and each of them has a ‘definitions’ section. We wanted to compare what are the different definitions for a certain term in all the documents. So the task was to find a certain keyword and return its definition, again to a CSV file (the definitions were rather short luckily, so here there’s no need to worry about the magical number 32,767).

HTML Text Extraction to CSV

Text extraction from HTML files isn’t particularly complicated. We could obviously open the file in a web browser that would render it to include only text and other elements (images, embeds) according to the styling defined in the source code. From the browser, we can just copy the text and paste it wherever we need. The solution is good for 1 file but not necessarily for a hundred or thousands. To be able to scale this up, we can use a script that would load the HTML file and get the text. The issue is, that when we read the HTML file to, let’s say, Python, we load the source code with all the tags, comments etc. The mission then is to remove the tags and unnecessary things and leave only the text.

As I mentioned, this isn’t really complicated and involves just a few lines of code and 2 libraries, which come pre-installed with e.g. Anaconda distribution. We can open the file with codecs and then use BeautifulSoup to divide the code into tags and content.

Once we have the soup version of the original HTML file, which is now decomposed into tags and content, we can easily extract the latter to a list strips. The list contains all the text strings from the original file but in such a format that each element of the list is a string from between two HTML tags. To make it more clear, let’s assume that we have HTML code <p> Very <b> important </b> piece of code </p>. The resulting list of text strips would then have 3 elements and would look like this: ['Very', 'important', 'piece of code']. Such a format of results is useful in some cases (remember our second task?) but if we’re interested in the entire text from the HTML file, we can just concatenate all these strips of text into a super-long, single string.

The final step is to divide this super-long string into pieces with at most 32,767 characters but this is also quite straightforward. We can use the fact that in Python each character in a string has its own index. Let’s create an empty list in which we’ll store the strings and then let’s take the first 32767 and put them as the first element in the list. Then another 32767 characters and we have another element. We can repeat it until there’s no more string to slice.

If we want to export the results as a CSV or Excel, the best approach is to convert the list to a Pandas DataFrame. If we have several HTML files, we can append all the individual list_of_string lists to a larger list and then create a Pandas DataFrame out of this larger list. And the best thing about this is that even if for one HTML file we’ll only need 10 cells to store the text, and for another, we’ll need 20, Pandas will still create the DataFrame for us without any issues because it doesn’t require all sublists of a list to have the same length!

Text Extraction by Keyword

When we have all of the above, changing this code to extract the definitions by keyword is fairly simple. However, it’s important to first investigate how do the keywords and definitions appear in the HTML files. In my case, the keyword is always the only word between two HTML tags, however, sometimes it’s in quotation marks and sometimes not. Then, right after the closing HTML tag, there’s the definition for this keyword.

This means that when we have the list strips like in the previous task, we can just look for an element equal to either the keyword or the keyword in quotation marks. At the same time, we shouldn’t use the function contains() or structure if keyword in string because the keyword appears in the text in multiple places, but it’s defined only once and that’s what’s interesting for us. Then, once we find the correct list element, we need to get the definition, which is stored in the next list element.

As you can see the code, in this case, is rather simple. I also added the little counter there to check how many times a certain definition appears in a document (it should appear either 0 or 1 time in each file). Once we have the frequency and the definition itself for each file, we can again append them to a larger list and create a Pandas DataFrame out of it. From there, exporting as either CSV or Excel is really straightforward.

You can find the complete code I used for this task, including an example HTML file and results for it, in this GitHub repository: https://github.com/mateuszwiza/html-text-extraction

--

--