code hosting address: https://github.com/hhqcontinue/zhihuSpider
grabbed 1 million 100 thousand of the user data this time, and the data analysis was as follows:
installs the Linux system (Ubuntu14.04) and installs a Ubuntu under the VMWare virtual machine;
installs PHP5.6 or more;
installs curl and pcntl extensions.
uses PHP’s curl extension to grab page data
The curl extensions of
PHP are libraries supported by PHP that allow you to use various types of protocols to connect and communicate with a variety of servers.
this program is to grab almost the user data, to access the user’s personal page, users need to log in to access. When we are in the browser page click on a user avatar link into the user’s personal center page, can see the user’s information, is because when click on the link, the browser will help you bring together the local cookie submitted to the new page, so you can gain access to the user’s personal page center. Therefore, before you can access your personal page, you need to obtain the user’s cookie information and then bring cookie information each time the curl request is made. In obtaining cookie information, I used my own cookie, in the page can see their cookie information:
copies one by one, forming a cookie string in the form of ", __utma=, __utmb=, ", and so on. You can then use the cookie string to send the request.