Web Crawling for Business Intelligence


Real-Time Identification of Cyber Security Threats



Web crawling technologies help collect and extract large amounts of data from a variety of online sources. They are crucial in web traffic analysis, cyber security, and in any area where data needs to be collected from multiple sources and analyzed in real time.


Solutions provided by Actimind has featured both automatic and manual modes, human-user behavior emulation, and auto-filling of contact forms with convenient controls for end users.


One such solution was an application for data search and analysis built around a highly involved crawling mechanism.



Challenges we faced

  • The crawler should go through all the pages between the initial link and pages that meet specific requirements, automatically filling out the contact forms if any show up.
  • Create pages map according to the specific rules provided by the client.
  • Gather all the assets related to the pages in the map. It includes HTTP protocol information (headers, cookies, etc.), HTML pages, images, video files, style sheets, and javascript files. All the assets are stored in the format they were received from the Web. The additional metadata (crawl process ID, time, and links processed) are stored in a MySQL database.
  • Provide 2 ways of crawling process: manual and automatic according to a user-defined schedule.
  • Analyze the content of the pages according to the client rules.
  • Provide interfaces to the operations monitor, search and analytics functions.

Results we delivered

Actimind developed a solution to meet all client's needs and requirements using the following technologies:

  • Firefox browser add-on to log in to the crawler.
  • A Java-based crawl engine that controls the execution of crawling processes on the server, manages tasks and interacts with DB.
  • Browsing engine with the Firefox browser as a core for navigation through pages (including emulation of the human-like behavior), filling contact forms and saving data logic.
  • C++ component that monitors available system resources on the central server and provides this information to the crawl engine for operations adjustment.
  • Java-based engine that analyzes the collected data.
  • A web application that provides controls for active crawl processes management, a search interface and an interface for analytics.

Areas of Application


Web traffic analysis, real-time identification of cyber security threats, business Intelligence

Technologies Used


Java RMI, Hibernate, Java Persistence API, Struts,Win API, C++, XPCOM, XUL

Benefits


The client received a unique technology that provided a scalable, high-quality and performance solution for their business.

Want to become our client?