Intersport Romania

Fixing SEO Site Migrations With Machine Learning [Intersport Search Awards Case Study]

Site migrations are a fact of life — sooner or later, most websites do eventually go through this process. Whether big or small, migrations tend to give SEOs plenty of work to ensure it’s not just users that are properly transferred over to a new site, but ranking signals as well.

With Intersport Romania, one of the main large retailers in the Romanian sportswear niche, the migration to a new platform was supposed to bring it inline with new technologies and a better user experience. 

Unfortunately, the process didn’t take SEO into account from the beginning, which, coupled with a complete change of URL structure for the site, led to a massive drop in organic rankings and traffic.

With thousands of pages to redirect and very little available data, we’ve decided to forgo doing everything manually and instead bring one of Google’s very own tools into the mix: the BERT machine learning model for language processing.

NOTE: You can find an updated version of the script used below over here (make sure to create a copy to use it with in your own project): Vertify – String Similarity with BERT.ipynb (GitHub version)

Pre-migration status

We’ve been working with Intersport Romania since July 2019, when we embarked on a steady process of identifying the site’s technical issues and optimizing their landing pages. 

Despite being on an old custom CMS, fixing a couple of indexing problems and optimizing the site’s landing pages had brought significant results, slowly moving up in traffic and rankings over the first 12 months.

Visibility score / Impression share (~1,400 keywords) for July 2019 – June 2020

Fast forward to June 2020, when we were excited to find out that the site would finally be migrated to a newer, better CMS that Intersport was already using in other European countries.

Unfortunately, due certain time constraints the site was fully migrated on July 6th 2020, without any SEO input. Additionally, since the new platform had an entirely different URL structure, the vast majority of category and product pages from the old site had been redirected to the homepage

As one could expect, the results were devastating. Within a few days of the migration we had lost almost all the visibility we gained during the past year:

Visibility score / Impression share (~1,400 keywords) for June 2020 – July 2020

Because most of the old categories and products were now redirected to the homepage, Google had basically started to deindex those URLs, which led to significant ranking drops:

KeywordSearch VolumeRank
(migration day)
Rank
(~7 days post-migration)
adidas151K56 100 ↓
adidasi nike
(en. ‘nike sport shoes’)
60.1K1727 ↓
trening dama
(en. ‘women’s sweatpants’)
33.1K826 ↓
salomon15.4K919 ↓
hanorace
(en. ‘hoodies’)
14.8K1944 ↓
Ranking drops after migration

It was clear that we needed to reverse this situation as fast as possible and redo all of the site’s redirects to properly transfer ranking signals from the old site to the new one, before Google had a chance to ‘cement’ the post-migration rankings.

Our main objective was thus to reverse the loss in rankings by implementing a proper redirect strategy within maximum two months (to avoid permanent loss of ranking signals from the old site’s pages).

More importantly, we had no access to the previous site — the old platform had been completely deleted from the server, with no backups, database exports or anything else that would allow us to automate at least some redirects using IDs or SKUs matching.

Given this fact, as well as the time-critical nature of getting it done as soon as possible, we realized there would be no way we could do this completely manually, at least not in a way that would cover a significant part of the site.

There was, however, something else that could help us.

Introducing NLP and BERT into our toolset

Within the previous months we’ve experimented with using machine learning (NLP) models to automate some of our keyword research tasks by comparing the ‘similarity’ (basically ‘meaning’) between keywords and landing page titles. Among the options that could be used for this purpose is BERT, an open-sourced language model made available and used by Google to better understand what content is actually about.

The strategy was this:

If we could leverage NLP to ‘understand’ what the pages from the old site were about (based on data we had from our old crawls, like the title or H1 tags), perhaps we could automatically predict what would be the best page on the new site to redirect them to.

That would help us severely cut down on man hours, since our manual efforts would then be limited to simply verifying whether the prediction was correct, and adjust it accordingly if not.

As mentioned earlier, one of the main reasons we decided to try out NLP for this project was the lack of data from the old website. Everything we had left was an older Screaming Frog crawl we did a few months prior to the migration. This limited us to basically only having the URL, meta title and H1 tags to understand what the old pages were about:

Crawl data for the old site

This, plus some Search Console data regarding traffic, was all we had left.

Using BERT cosine similarity scores to associate old and new URLs

For this project we used Google Colab, a fantastic Google product that allows you to use a browser-based Python ‘notebook’ that leverages Google’s resources (processor, memory and GPU power — all great for machine learning projects)… for FREE!

Moving forward, our steps were:

Step 1. Find a pre-trained BERT model for the Romanian language

This was the easy part — BERT, as well as other NLP techniques, generally has pre-trained models (including ones provided by the ‘official’ BERT team) that have already learned the association between words and phrases, and are freely available for anyone to use.

We found a pre-trained BERT model specifically created for the Romanian language, which means our accuracy would be improved versus a more generic multi-language one.

This is the generic code needed to load a pre-trained model into our Google Colab notebook:

Code for loading a BERT model into Python / Google Colab

That’s it, we now had the model loaded up with just four lines of code, waiting for it to be used.

Step 2. Fetch and ‘clean’ the old and the new data into simple word strings

Next we needed to get the data regarding the name of both the new and the old pages from our Google Sheet spreadsheets (we used the H1 tag in most cases), after which we would remove stopwords, numbers and symbols, and make everything lowercase:

Code for retrieving data from a spreadsheet and cleaning it up

Here’s how the H1 tags ended up looking after the clean-up:

How the cleaned up data looks compared to the original data from the spreadsheet

Step 3. Compare the data in terms of similarity using BERT

And now, for the actual ‘magic’. This is the point where we use BERT in order to compare the similarity between each of our old page titles and the set of new ones in order to figure out which are closer in meaning.

To provide a bit of context, BERT (as well as other NLP techniques such as word2vec) does this by transforming words and strings to numbers (actually vectors of numbers, aka ‘embeddings‘), and then calculating the ‘distance’ between them. The closer these embeddings are, the more similar they are.

You’ve probably heard something about this before, with the very common example of kings and queens:

Basically, the distance between the vector that represents ‘king’ and the one that marks ‘queen’ is the same as the one between ‘man’ and ‘woman’ — it’s a sort of representation for the relationship between words, which also allows you to do neat math like:

king + woman man = queen

BERT simply takes this to the next level, being able to better create these embeddings at sentence and phrase level, not just for simple words.

Getting back to our project, we’ve now simply switched our H1 tags from text to embeddings and calculated their distance with just a couple of lines of code (literally most of the code here is to print the results):

Code for comparing the old titles to the new ones for similarity

And now, let’s check what the results looked like. For each old page title, here’s the top 5 new page titles ordered by their similarity (1 is maximum, 0 is minimum):

Output from the BERT model

You can see that for the first two examples (en. “Men’s Shorts” and en. “Kids’s Sandals“), the score is high because it’s either literally the same title on both the old and the new site (first example) or a there’s a good partial match (second example).

For the third page example (en. “Ski Equipment“), this is where you can best see BERT working its magic. There’s no page named “Ski Equipment” on the new site, but there’s one for “Ski Accessories”, though it’s using the Romanian spelling of “ski” (“schi”). So we don’t even have a partial match here.

This is no problem for BERT, which knows to match these two strings properly, just as Google understands that, when it comes to Romania, “schi” = “ski”, and the word for “accessories” is very similar in meaning to “equipment”. Thus, BERT correctly predicts that this would be the most relevant page on the new site to redirect the old one to.

Pretty neat, right? Oh, and this took less than 5 seconds to run 🙂

Step 4. Pick the most similar new pages for our redirect list and add them to our Google spreadsheet

We used the same code as the above, except we now only took the top 2 new page titles and stored them into a pandas dataframe (which you can think of as a table), in order to make it easy to upload them in Google Sheets afterwards:

Code for putting the old and new page titles into a dataframe/table

And now, all that was left was to write the ‘prediction’ columns to our spreadsheet, making sure we sorted them based on how our spreadsheet is sorted:

Code for writing the dataframe/table into Google Sheets

The last four lines of code simply write the two prediction columns from our dataframe to our working spreadsheet:

How Google Sheets looks when the script is running

Step 5: Manually check how correct the predictions are

With this data, all our team had to do is to look over these columns, choose the most appropriate one (some will definitely be wrong, based on how well the H1 tags have been optimized), and replace them manually where necessary.

Then, a quick VLOOKUP to get the URLs based on the new page titles we ultimately approve and that’s that, we now had our final redirect list that the development team can implement:

Sheet that contains the current list of redirects and our optimized options

Sure, there was still manual work involved reviewing everything and making the appropriate changes, but we could say that, overall, our time was reduced by almost 75% compared to doing it the ‘classic’ (fully manual) way!

Aftermath

Within one month of the migration, we managed to complete and implement our first half of redirects (mostly category pages), and after 20 days we finished the second half (mostly product pages). 

One month later, we had almost completely recovered our rankings:

SEOmonitor visibility score (~1,400 keywords) for July 2020 – September 2020

This was clearly visible in terms of ranking changes for most of our competitive keywords, in certain cases even reaching better positions:

KeywordSearch VolumeRank 
(migration day)
Rank
(~7 days post-migration)
Rank
(~60 days post-redirects)
adidas151K56 100↓32 ↑
adidasi nike
(en. ‘nike sport shoes’)
60.1K1727↓17 ↑
trening dama
(en. ‘women’s sweatpants’)
33.1K826↓10 ↑
salomon15.4K919↓12 ↑
hanorace
(en. ‘hoodies’)
14.8K1944↓10 ↑
Ranking recoveries after implementing redirects

As such, by leveraging a pre-trained BERT model to automate a large part of our manual work, we’ve managed to complete our objective of reversing our post-migration rankings loss with minimal manual work.

Using our NLP approach described above, our project managed to snag no less than two European Search Awards and two Global Search Awards for the 2021 editions!

Global Search Awards: Vertify & Intersport
Global Search Awards Winner

If you’d like to use the script in your project, feel free to create a copy of it from here.