About the Job: In full growth, particularly internationally, we are looking for new collaborators to join our fabulous team! A young but experienced, dynamic and complementary team: a resolutely start-up spirit! Real job and career opportunities A friendly atmosphere and a climate of trust that promotes autonomy and challenge!”
Responsibilities:
- Responsible for the capture of massive data on the web and mobile terminals, and the design of architectures such as extraction, deduplication, classification, clustering, and filtering;
- Responsible for the design and development of distributed web crawlers, able to independently solve various problems encountered in the actual development process;
- Responsible for the research and development of web page information extraction technology algorithms to improve the efficiency and quality of data capture;
- Responsible for the analysis and warehousing of crawled data, monitoring of the crawler system and abnormal alarms;
- Responsible for designing and developing data collection strategies and anti-shielding rules to improve the efficiency and quality of data collection;
- Responsible for the design and development of core algorithms according to the system data processing flow and business function requirements;
Qualifications:
- Proficient in Python language, familiar with one or more of the commonly used crawler frameworks, such as Scrapy framework or other Web scraping frameworks, with independent development experience.
- Have 1+ years of experience
- Familiar with vertical search crawlers and distributed web crawlers, deeply understand the principles of web crawlers, have rich experience in data crawling, parsing, cleaning, and storage related projects, and master anti-crawler technology and breakthrough solutions.
- Master the basic operation of linux,
- Experience in distributed crawler architecture design, IP farms and proxy is preferred.
- A solid foundation in data structure and algorithms is preferred.
- Familiar with common data storage and various data processing technologies are preferred.
- Familiar with commonly used frameworks such as ssh, multi-threading, network communication programming related knowledge.
- Familiar with at least one RDBMS and non-structure DB technologies.
- Hands-on experience for crawling any eCommerce platform is a big plus.