Showroom
Contact Us
Toll free voicemail / Fax:
+1 (877) 571 53 65
Phone number:
+1 (323) 203 08 08
General Inquiries:
Web crawling for business intelligence
Client
The company is in a business of analyzing global web traffic trends and identifying cyber-security threats in real time.
Industry
Business intelligence & information security
Technologies Used
Java RMI, Hibernate, Java Persistence API, Struts,Win API, C++, XPCOM, XUL

Project Summary

The solution is a web crawling technology that allows gathering and extracting large amount of data from a variety of online sources. The crawling process is started in both automated and manual manner, emulates human-user behavior and includes auto-filling of contact forms. The end users have controls for the active management of the crawling process. They are working with the application through convenient interfaces for data search and analytics.

The client is in business of analyzing global web traffic trends and identifying cyber-security threats in real time. They needed a tool to gather and extract large amounts of information from multiple online sources.

Challenges:

  • The crawler should go through all the pages between the initial link and pages that meet specific requirements, automatically filling out the contact forms if any show up.
  • Create pages map according to the specific rules provided by the client.
  • Gather all the assets related to the pages in the map. It includes HTTP protocol information (headers, cookies, etc.), HTML pages, images, video files, style sheets, and javascript files. All the assets are stored in the format they were received from the Web. The additional metadata (crawl process ID, time, and links processed) are stored in a MySQL database.
  • Provide 2 ways of crawling process: manual and automatic according to a user-defined schedule.
  • Analyze the content of the pages according to the client rules.
  • Provide interfaces to the operations monitor, search and analytics functions.

Results:

Actimind developed a solution to meet all client's needs and requirements using the following technologies:

  • Firefox browser add-on to log in to the crawler.
  • A Java-based crawl engine that controls the execution of crawling processes on the server, manages tasks and interacts with DB.
  • Browsing engine with the Firefox browser as a core for navigation through pages (including emulation of the human-like behavior), filling contact forms and saving data logic.
  • C++ component that monitors available system resources on the central server and provides this information to the crawl engine for operations adjustment.
  • Java-based engine that analyzes the collected data.
  • A web application that provides controls for active crawl processes management, a search interface and an interface for analytics.

Besides the key functional features outlined above the application was specifically developed to meet certain security, performance and stability requirements.

Client Benefit:

The client received a unique technology that provided a scalable, high-quality and performance solution for their business.