Design Web Crawler
Functional Requirements: Download all webpages addressed by the urls Generate reverse index of words to pages for search engine Generate title and snippet Pages with duplicate content should be ignored URLs should be prioritised Non Functional Requirements: High Availability Scalability using parallelisation Robustness: Handle, unresponsive servers, crashes, malicious links, bad HTML Politeness: Crawler should not make too many requests to a website within a short period of time. Details: HTML Parser and Content parser service are worker with threads for each URL. Content Parser passes redis key in the queue, next service can fetch the content from redis and process further. Duplicate content service checks if same cintent id already present in the content storage. It may compare the hash values or Sim hash used by Google. Reverse Index and documented service is additional uses on content. URL extractor extracts the url from the page URL filter excludes certain type, file extensio...