by Kryztof Urban, Lead Data Scientist

Showing an ad on a website is easy. The tricky part is to display an ad that the reader may find useful because it fits in with the context of the page. Here at GumGum we go to great length to figure out the visual and textual context of a website.

With respect to text, we apply Natural Language Processing (NLP) tasks like topic detection, Named Entity Recognition, Sentiment Analysis, and more. But before we can apply our higher-level analysis we need to prepare the text. For website analysis, the first step is finding the main text and ignoring the useless bits such as HTML, CSS, code, ads, and links to other articles (also called boilerplate).

The motivation

In the past, we used the Boilerpipe library for text extraction. Boilerpipe looks at shallow text features (like link density and word count) per HTML block to determine whether that block belongs to proper content (check out the paper it’s based on to learn more). It works reasonably well, and heck, there’s now even a Python wrapper for it.
Boilerpipe is designed to work well for an “average” page. The problem is, page layout varies greatly between websites. We are working with thousands of website publishers, and for too many of them Boilerpipe failed to produce acceptable results. So we did what every respectable researcher does at this point - hack the tool to buy some time. We injected domain-specific text extraction rules into Boilerpipe, but obviously that became a pain to maintain. It was time to work on a solution that is automated, scales, performs well for any page layout, and doesn’t break the bank.

 

The approach

After a bit of researching and testing we decided on Dragnet. For one, it does a decent job out of the box. It also is a Boilerpipe-and-then-some, as it considers the same shallow text features but also looks at semantic features (read the article it’s based on for more info). Most importantly, though, it allows us to train machine learning models for text extraction for each and every one of our publishers.

Here’s an example of Dragnet performance trained on a corpus of 790 manually annotated Japanese documents. We used the Extremely Randomized Trees classifier. They are a more randomized version of Random Forests. In theory, by randomizing feature split order and thresholds, trees are less correlated to each other which reduces the risk of overfitting (at the cost of possibly increased variance).

The figure above shows precision, recall, and f-score on a per-token analysis. The x-axis shows the performance (for example, a precision of 1 means that no boilerplate was picked up), while the y-axis depicts the number of documents for which the performance was achieved.

A f1-score of 0.85 for non-English documents, that told us we might be on the right track (other comparisons report similar performance but of course they were using different data).

The data

Machine learning – check, performance – check. But what about scalability? GumGum works with thousands of publishers. Getting say 1,000 pages manually annotated per subdomain is expensive. But what do you do if you’re too cheap (ahem, economically responsible) to spend big bucks on annotation and yet worried about getting good training data for your classifier? The answer (in our case) is: use Diffbot.

Diffbot offers a number of web services based on NLP and Visual Intelligence analysis. For our purposes we use their Article API. Its JSON returns extracted text plus some additional page information. Here is a sample JSON response for one of the URLs in our test set:

One aspect we particularly like about Diffbot’s approach for text extraction is that it’s computer vision based. In theory, this should mirror a human annotator’s strategy. And indeed, our tests confirm that annotation quality is on par with that from third party annotation services.

Now we were ready to put together a pipeline that continuously creates, updates, and deploys text extraction models for all our subdomains. Training data consists of Diffbot results, manual annotations, or (in many cases) a mix of both.

Our system resides in Amazon’s cloud (AWS). S3 is a scalable storage service. We use it to store our text extraction models and at the same time make them available to all of our NLP production instances.

The results

Dragnet measures performance of its models in terms of HTML blocks. For example, if a HTML page consisted of 20 blocks, and 15 of those were correctly classified as “boilerplate” or “not boilerplate” then accuracy would be 0.75. Our text extraction block accuracy is on average 0.9 for any given subdomain with little variance.

We usually use 300 files to train and test the models. Our tests show that any corpus size north of 200 files (using 10 extra trees) leads to stable classification performance.

The table below shows a direct comparison between our Boilerpipe and Dragnet approaches. Using Levenshtein Distance to measure text extraction accuracy on a per-character basis, we can see that while Dragnet picks up more unwanted text (i.e. more deletes) it does a much better job in finding the correct text (i.e. fewer inserts) compared to Boilerpipe.

 

Cost of INSERTION

Cost of DELETION

Total Cost Boilerpipe (TWS)

Total Cost Dragnet

1

0

332,001

24,919

0

1

655

52,766

1

1

332,656

77,685

 

Digging deeper into the relatively large percentage of unwanted text extracted by Dragnet, we realized that about half of the extra text consists of image captions (which may or may not be considered part of the good text).

And thus we have a scalable process to automatically create domain-specific text extraction models with state-of-the-art performance. Looking forward to your comments and suggestions!

Guides