[ GrabLab ] Web Scraping and Data Processing Services ----------------------------------------- Hey! My name is Gregory. I do web scraping. github : github.com/lorien email : lorien.name telegram : @madspectator Examples of databases I've Collected ------------------------------------ Crunchbase Database Owler Database Google Play Database Linkedin Database Etoro Database Gitlab Database Strava Database Medium Database Telegram Database Producthunt Database Angellist Database Vkontakte Database Interesting Projects I've Worked on ----------------------------------- * A crawler which downloads content of 500M+ HTML documents. First, tasks are loaded into google PubSub queue. Then downloader instances run on multiple machines, download HTML documents and save content into google storage. Downloader instances are deployed with ansiable as docker containers. Statistics from all crawlers are saved into InfluxDb on central node and rendered with Grafana. The system handles about 3000 HTML documents per second. * A twitter crawler. First, a few million usernames collected from other datasets are loaded into task queue. For each username in task queue the user crawler checks if such twitter user exists and saves profile data. For each existing user the follower crawler fetches first page of followers and saves these usernames into task queue. Found 500M+ twitter users so far. * A telegram users crawler. First, a few million usernames collected from other datasets are loaded into task queue. For each username in task queue the user crawler checks the "t.me/{username}" page and saves profile data if such username is registered in telegram. * Github commits stream crawler. The event crawler uses github API to save every push and commit event into task queue. The downloader crawler fetches each commit's diff data into database. The analyzer module processes each commit's diff data an and save found data into database. The system handles each commits published on github, about 5-20 commits per second. * MongoDB explorer. The masscan tool is used to find all mongodb instances opened to public access. The explorer crawler connects to each MongodDB server and collects information about each database and each collection in database. * Telegram Bots. I've developed a number of telegram bots which help chat administrators to fight spam and do other administrative things. Most popular bots of mine are installed in thousands chats, they process millions of messages per day. Tools I Use to Build Things --------------------------- * Debian - I use debian on laptop and on every server where I deploy my software * vim - my text editor * Python - all my software are powered by python * MongoDB - I store 99% of data into MongoDB, also I use it as task queue * scaleway, oneprovider - dedicated server providers * bash - I use it to write scripts to control software instances and process data * InfluxDB - I collect statistics data from servers and crawlers into InfluxDB * Grafana - I use it to display statistics data * Django, bottle - I use these python web frameworks to build backends for API and web UI * ioweb, urllib3 - I use these python libraries to build my crawlers * twitter bootstrap - to buid HTML/CSS for web UI * ansible - to set up servers and deploy software * telegram, gmail - to read spam and communicate with robots * xiaomi air - this is my laptop * paper - I write every task to do on a piece of paper * git, mercurial - to manage versions of my source code * github - to store open source and private software projects * supervisor - to daemonize and manage crawler instances on server * nginx - a webserver * letsencrypt - SSL certificates