The main requirement was to gather the data of online teaching platforms, what courses do they offer, how many and which authors are there on each e-learning platform and also the metadata for each course like the categories, course details, and co-authors, schedule, etc.
We did this by storing the sitemap of all these sites in the cloud in a customized JSON format and then triggering an SQL stored procedure which compares the already existing data with the new files update the existing courses if there are any changes or add new data.
→ Creating a common architecture that generalizes scrapping all sitemaps and gathers common data from different sites like Udemy, Udacity, Coursera, etc.
→ Upload large content of scrapped data to Azure cloud and dump the data in the warehouse in Azure.
→SQL procedures to check and update only non-existing data in the warehouse.