Teach Guide

Digital Teaching platform
Tech

Overview

The main requirement was to gather the data of online teaching platforms, what courses do they offer, how many and which authors are there on each e-learning platform and also the metadata for each course like the categories, course details, and co-authors, schedule, etc.

We did this by storing the sitemap of all these sites in the cloud in a customized JSON format and then triggering an SQL stored procedure which compares the already existing data with the new files update the existing courses if there are any changes or add new data.

→ Creating a common architecture that generalizes scrapping all sitemaps and gathers common data from different sites like Udemy, Udacity, Coursera, etc.
→ Upload large content of scrapped data to Azure cloud and dump the data in the warehouse in Azure.
→SQL procedures to check and update only non-existing data in the warehouse.

Technology

  • Python
  • Scrapy
  • Microsoft Azure Cloud

Key Technical Challenges:

  • The most challenging part here was that each and every site has thousands of pages and it is not necessary that all the sitemaps and HTML pages are structured correctly. Now, getting the most accurate data in this situation is a tough job because you can not go and examine each and every page on all the sites. Moreover, storing this information in the cloud and managing space and memory is really a tough job because the data keeps growing. We have to come up with a solution that would store only the required data which we do not have and once that is stored in a database all kind of cleanup procedures needs to be undertaken.
    Another challenge was to complete the scraping and crawling process in a timely manner because when you are talking about data this bigger from 11 e-learning platform it can take hours and that is not effective scrapping. So we had built several spiders to run in parallel and also the Scrapy’s async behavior helped a lot.

Business + Technical Points:

  • The data warehouse in Microsoft Azure must only get the newer data and checks must be at SQL procedure level and not in Python.
  • Execute all sitemaps in minimum possible time and sites may have thousands of pages, We executed them in parallel.