Web crawling for business intelligence
The solution is a web crawling technology that allows gathering and extracting large amount of data from a variety of online sources. The crawling process is started in both automated and manual manner, emulates human-user behavior and includes auto-filling of contact forms. The end users have controls for the active management of the crawling process. They are working with the application through convenient interfaces for data search and analytics.
The client is in business of analyzing global web traffic trends and identifying cyber-security threats in real time. They needed a tool to gather and extract large amounts of information from multiple online sources.
- The crawler should go through all the pages between the initial link and pages that meet specific requirements, automatically filling out the contact forms if any show up.
- Create pages map according to the specific rules provided by the client.
- Provide 2 ways of crawling process: manual and automatic according to a user-defined schedule.
- Analyze the content of the pages according to the client rules.
- Provide interfaces to the operations monitor, search and analytics functions.
Actimind developed a solution to meet all client's needs and requirements using the following technologies:
- Firefox browser add-on to log in to the crawler.
- A Java-based crawl engine that controls the execution of crawling processes on the server, manages tasks and interacts with DB.
- Browsing engine with the Firefox browser as a core for navigation through pages (including emulation of the human-like behavior), filling contact forms and saving data logic.
- C++ component that monitors available system resources on the central server and provides this information to the crawl engine for operations adjustment.
- Java-based engine that analyzes the collected data.
- A web application that provides controls for active crawl processes management, a search interface and an interface for analytics.
Besides the key functional features outlined above the application was specifically developed to meet certain security, performance and stability requirements.
The client received a unique technology that provided a scalable, high-quality and performance solution for their business.