TDSM 3.19

From The Data Science Design Manual Wikia
Jump to: navigation, search
  • First we need to look on for the sources of acquiring data on the web.
  • A wise step is to first check if there are available APIs provided by the website we wish to scrape data from or are there any free open source crawlers built which do scraping for the web page/data we require.
  • Next, we can use web scraping tools. Many python libraries(like Requests, Beautiful Soup4, Scrapy) exist which help to parse or scrape the data. But these libraries might not provide all the data we want or not in the required format.
  • So for Customization options, next option is to Build a crawler ourself by finding patterns based on HTML tags and visiting further available hyperlinks on the seed page .
  • To save yourself from the hassle of building a crawler and if our budget permits, we can use services from data provider agencies which specilaize in building crawlers and scraping data.