I once worked in a Belgium-based educational NGO that cooperated a lot with high schools all around the country. For outreach purposes, we always wished to have a list of contact information for all the educational institutions in Belgium but such a list seemed too perfect to exist. But after all, it’s public information that surely exists somewhere. So let’s scrape them!
Firstly, a little bit of context about how the educational system in Belgium (and the country itself) works because it’s a tad complicated. As you may know, Belgium is divided into 3 semi-autonomous regions: Flanders, Wallonia and Brussels. What you may not know is that it’s also divided into 3 communities, based on what languages they speak. So there’s a Flemish community that spans all Flanders and parts of Brussels where they speak Dutch (Flemish), a French community covering most of Wallonia and the remaining parts of Brussels, and there’s also a small German-speaking community comprising or several towns and villages in the Eastern end of Wallonia.
Simply speaking, Belgium has several governments on both federal (country), regional and community levels, each one with varying competencies, and it’s the three community governments that are responsible for education. Therefore, we virtually have 3 different educational systems, each one with its own set of schools and its own system of storing contact information.
It quickly turned out that acquiring the contact information about schools in the Flemish and German-speaking communities is quite simple and straightforward. The most complicated case was the French community and it required coding a web scraper. So let me walk you through all three stories.
Finding the contact information for Flemish (Dutch-speaking) schools was the easiest because the Flemish community government has some quite advanced solutions already in place. First of all, the education ministry has a proper API Portal from where you can get some useful data. But more importantly, they have lists of all the schools and their contact information ready to be downloaded as CSV!
It’s as easy as going to this website where you can select the type of schools and once you have it, you’ll see a list like the one below. I’m selecting ‘Scholen voltijds gewoon secundair onderwijs’, which translates to ‘full-time, ordinary secondary education schools’ because this consists of the high schools I’m interested in. From there, I’m presented with the table and a ‘Download’ button that allows me to export the entire data sets to a CSV file, not only the Name and Address that are in the table but also other details such as phone numbers or email addresses!
This community is the smallest one in Belgium comprising of only 9 municipalities and approximately 78.000 inhabitants. This small size has a certain disadvantage because the informal system of the community’s government isn’t that advanced as, for example, in Flanders. But it also has one major advantage — there are only a few schools there, so it’s feasible to find their websites and manually copy the contact details.
It required a little knowledge of the German languages, but finally, I managed to find a list of schools on the community’s website but it’s only a subpage that just lists the name of the schools. I counted 7 high schools and simply Googled all of them and copied the contact information from their websites.
Since the government of the French community isn’t known for very sophisticated IT systems, I expected this part of the data acquisition to be the most complicated, and to a certain extent, I was right. To say the least, taking the screenshots you see below took me literally hours today because the official website kept crashing every couple of seconds without an apparent reason.
On the other hand, I can’t only say negative things because I was positively surprised to see that they actually have a neat list of all the schools in the region, together with all the necessary contact information. It is available on this site and you can see the top of the table below.
This is a good thing because if they only gave us the names of schools, we’d still need to crawl all around the web to find all the necessary details. However, there's also a catch to this solution. The table on the Flemish site looked very similar but it also had a Download button that exported the entire dataset. Here, such a button is missing. One good thing is that the table is displayed all at once, i.e. it’s not divided into pages. Because of that, and also given that it’s a standard HTML table without anything fancy, you can just copy all the text from the table. paste it in Excel, and you have a nice data set with names, addresses and phone numbers to all the schools.
The thing is, that we’re not so much interested in phone numbers, but rather the email addresses. And this is where things get complicated because even though the details like email address are there, in order to see them, you need to click the magnifying glass icon at the right end of each row, which gets you to a website as below. There you can see more details, including an email address, but only for one school at a time.
There’s no way I’m going to click through all 206 rows to get each email address manually. Fortunately, we have a blessing in form of Python programming language that we can easily use to create a web scraper. In this case, I decided to go with the most obvious choice and I’m using the Selenium library.
The first step is quite obvious, we can explore the structure of the initial website, the one with the table:
Luckily, everything here is quite standard, so we can write a script that, given a name of a school (and an address, just to make sure), finds the link to its subpage. To achieve this, we can follow these steps:
- Initialize and open the web driver (in this case I’m using Chrome);
- Got to the website with the table;
- Find the table, knowing that its ID is ‘liste_etablissements’;
- Iterate through all the rows (<tr>) in the table;
- For each row, extract the cells (<td>);
- If the first cell is equal to our name and the second cell to our address, we have found our school!
- In this row, find the only link (<a>) and save it.
And here’s how to implement it in Python:
Now, that we have the link for our school, we can go to its subpage and get its email address. But there’s another catch. As far as the structure of the initial website was simple, it’s not a case with the subpages:
Can you see what’s happening here? The data are stored in the table, but to make it more difficult, there actually are 2 tables, one embedded into another. To make things worse, both of them have the same class ‘justatable’ and no individual IDs! That’s why simply using find_element_by_class_name() in Selenium won’t help. Luckily, I discovered that when I look for an element by its tag, so in this case <table>, it actually find the embedded table that we’re interested in. From there, we can perform a similar procedure as before, navigate to a proper row and cell, and get its text. In this case we don’t even need to use loops to search for the right cell because we know where the email address is (and fortunately, it’s always in the same place!).
As I mentioned, I decided not to iterate through the entire table on the initial page and click the link in the following row, even though in theory we could do it. I just quickly realized that the website is quite fragile, it crashes easily, and once it crashes, the scraper crashes with it. I could write a case for it and just wait but for this, I’d need to know after what time it starts working again, which I haven’t been able to determine.
Instead, I use the fact that we can easily copy the initial table and paste it to Excel to create a CSV. Then I can input the CSV into my script, iterate through the rows, and for each one add the email address using the procedure I described. And if it crashes in the middle, then the addresses we got so far should be saved (either directly in the DataFrame, or we can store them in a separate list, then print if necessary, and manually paste in CSV).
And one more note, it turned out that while each French-speaking school has its own subpage with more details, in many cases the email address is missing. Namely, only 96 out of 206 schools have the email address listed. Fortunately, if it’s not there, the email address field is still there, it just says ‘aucun’ (French for ‘none’) instead of the address but thanks to this, the scraper doesn’t crash.
You can find the entire code I used for this scraping task in this Github repository. I also included there all my results files, so the complete contact details for all high schools in all of Belgium. I recon it’s legal for me to do given that it’s public information that you can download from the internet anyway.