Extracting the web data from multiple pages and tables

@Tanzeel,

Unfortunately, LinkedIn is not a good choice because Microsoft institutes a number of measures to try to actively block web scraping from the site unless you are a Company Administrator. However, I did find a good example site (Harvard Medical School) that is ill-behaved enough to be interesting , but well-behaved enough to be scrapeable with some fairly straightforward adaptations to the standard approach.

I have built and implemented the scraper for the site, and attached a copy at the end of this post. Here’s the basic information I scraped. The unfiltered table was over 15,000 rows, but included admin staff, lecturers, students, etc. The final filtered table contains only professors, but as you can see many professors hold multiple appointments so the number of unique individuals is far fewer than the 5000+ rows of the filtered table:

Here are the general steps I took in creating this:

  1. First I tried to scrape the department URLs, in order to form the table over which we would iterate to pull together information for all departments. Here’s the departments page: https://hms.harvard.edu/departments

This initial scrape did not work, however, using the “Add Table Using Examples” I was able to get Power Query to properly scrape that table from the page.

  1. Then went to the “People” section of the first department and tried to pull the faculty information off that page. That partially worked, and I was able to get about half the information on the page. Again using the “Add Table Using Examples” function, I was able to pull the remaining information into a second table.

  2. I then took the two tables created in #2 above and turned them into functions that would iterate over the table created in #1. If you’re not sure how to do this, I’ve got a video coming out on YouTube next Tuesday that explains this process in detail, and/or you can examine the M code in the file below for the two functions created.

  3. After executing both functions, I referenced the scraped table created, renamed it Professors and filtered the title field down to only those records that contained “Professor” somewhere in the title.

  4. Then through a combination of transformations of the URL and mail to: fields, I created two new fields in the Professors table – one for department to allow filtering/slicing the data by department, and one for email that just cleaned up the formatting from the web scraped mail to: field.

I hope this is helpful. Please let me know if you have any questions and how you’d like to proceed from here.

2 Likes