php采集网页数据不完整问题(php采集网页数据不完整问题怎么解决)

发布时间:2022-11-15

本文目录一览:

  1. php抓取网页内容不完整
  2. [用PHP获取网页内容的时候获取不完全 求能完全获取的方法](#用PHP获取网页内容的时候获取不完全 求能完全获取的方法)
  3. php获取数据为什么curl获取不完整
  4. php获取数据为什么curl获取不完整?而用file_get_contents能获取完整?

php抓取网页内容不完整

用CURL可以抓取到的 可能是你网速太慢超时了 所以抓取不完整 用 curl_setopt($ch, CURLOPT_TIMEOUT, 360) 试试看

用PHP获取网页内容的时候获取不完全 求能完全获取的方法

curl是获取的服务器端编译后返回的代码 . 是原始的。 curl 里 没法解析执行js . 所以得到的一直都是原始的代码。 而浏览器在拿到服务器返回的代码的时候, 会执行页面加载js , js 会在DOM 里动态添加或修改删除一些节点元素。 查看元素看到的就是经过js一顿处理之后的html内容 不是原始的了. .. 所以单纯使用curl 没法获取到"所见即所存"的代码...

php获取数据为什么curl获取不完整

因为,PHP CURL库默认1024字节的长度不等待数据的返回,所以你那段代码需增加一项配置: curl_setopt($ch, CURLOPT_HTTPHEADER, array('Expect:')); 给你一个更全面的封装方法:

function req_curl($url, $status = null, $options = array())
{
    $res = '';
    $options = array_merge(array(
        'follow_local' => true,
        'timeout' => 30,
        'max_redirects' => 4,
        'binary_transfer' => false,
        'include_header' => false,
        'no_body' => false,
        'cookie_location' => dirname(__FILE__) . '/cookie',
        'useragent' => 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1',
        'post' => array() ,
        'referer' => null,
        'ssl_verifypeer' => 0,
        'ssl_verifyhost' => 0,
        'headers' => array(
            'Expect:'
        ) ,
        'auth_name' => '',
        'auth_pass' => '',
        'session' => false
    ) , $options);
    $options['url'] = $url;
    $s = curl_init();
    if (!$s) return false;
    curl_setopt($s, CURLOPT_URL, $options['url']);
    curl_setopt($s, CURLOPT_HTTPHEADER, $options['headers']);
    curl_setopt($s, CURLOPT_SSL_VERIFYPEER, $options['ssl_verifypeer']);
    curl_setopt($s, CURLOPT_SSL_VERIFYHOST, $options['ssl_verifyhost']);
    curl_setopt($s, CURLOPT_TIMEOUT, $options['timeout']);
    curl_setopt($s, CURLOPT_MAXREDIRS, $options['max_redirects']);
    curl_setopt($s, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($s, CURLOPT_FOLLOWLOCATION, $options['follow_local']);
    curl_setopt($s, CURLOPT_COOKIEJAR, $options['cookie_location']);
    curl_setopt($s, CURLOPT_COOKIEFILE, $options['cookie_location']);
    if (!empty($options['auth_name']) && is_string($options['auth_name']))
    {
        curl_setopt($s, CURLOPT_USERPWD, $options['auth_name'] . ':' . $options['auth_pass']);
    }
    if (!empty($options['post']))
    {
        curl_setopt($s, CURLOPT_POST, true);
        curl_setopt($s, CURLOPT_POSTFIELDS, $options['post']);
        //curl_setopt($s, CURLOPT_POSTFIELDS, array('username' => 'aeon', 'password' => '111111'));
    }
    if ($options['include_header'])
    {
        curl_setopt($s, CURLOPT_HEADER, true);
    }
    if ($options['no_body'])
    {
        curl_setopt($s, CURLOPT_NOBODY, true);
    }
    if ($options['session'])
    {
        curl_setopt($s, CURLOPT_COOKIESESSION, true);
        curl_setopt($s, CURLOPT_COOKIE, $options['session']);
    }
    curl_setopt($s, CURLOPT_USERAGENT, $options['useragent']);
    curl_setopt($s, CURLOPT_REFERER, $options['referer']);
    $res = curl_exec($s);
    $status = curl_getinfo($s, CURLINFO_HTTP_CODE);
    curl_close($s);
    return $res;
}

php获取数据为什么curl获取不完整?而用file_get_contents能获取完整?

因为,PHP CURL库默认1024字节的长度不等待数据的返回,所以你那段代码需增加一项配置: curl_setopt($ch, CURLOPT_HTTPHEADER, array('Expect:')); 给你一个更全面的封装方法:

function req_curl($url, $status = null, $options = array())
{
    $res = '';
    $options = array_merge(array(
        'follow_local' => true,
        'timeout' => 30,
        'max_redirects' => 4,
        'binary_transfer' => false,
        'include_header' => false,
        'no_body' => false,
        'cookie_location' => dirname(__FILE__) . '/cookie',
        'useragent' => 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1',
        'post' => array() ,
        'referer' => null,
        'ssl_verifypeer' => 0,
        'ssl_verifyhost' => 0,
        'headers' => array(
            'Expect:'
        ) ,
        'auth_name' => '',
        'auth_pass' => '',
        'session' => false
    ) , $options);
    $options['url'] = $url;
    $s = curl_init();
    if (!$s) return false;
    curl_setopt($s, CURLOPT_URL, $options['url']);
    curl_setopt($s, CURLOPT_HTTPHEADER, $options['headers']);
    curl_setopt($s, CURLOPT_SSL_VERIFYPEER, $options['ssl_verifypeer']);
    curl_setopt($s, CURLOPT_SSL_VERIFYHOST, $options['ssl_verifyhost']);
    curl_setopt($s, CURLOPT_TIMEOUT, $options['timeout']);
    curl_setopt($s, CURLOPT_MAXREDIRS, $options['max_redirects']);
    curl_setopt($s, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($s, CURLOPT_FOLLOWLOCATION, $options['follow_local']);
    curl_setopt($s, CURLOPT_COOKIEJAR, $options['cookie_location']);
    curl_setopt($s, CURLOPT_COOKIEFILE, $options['cookie_location']);
    if (!empty($options['auth_name']) && is_string($options['auth_name']))
    {
        curl_setopt($s, CURLOPT_USERPWD, $options['auth_name'] . ':' . $options['auth_pass']);
    }
    if (!empty($options['post']))
    {
        curl_setopt($s, CURLOPT_POST, true);
        curl_setopt($s, CURLOPT_POSTFIELDS, $options['post']);
        //curl_setopt($s, CURLOPT_POSTFIELDS, array('username' => 'aeon', 'password' => '111111'));
    }
    if ($options['include_header'])
    {
        curl_setopt($s, CURLOPT_HEADER, true);
    }
    if ($options['no_body'])
    {
        curl_setopt($s, CURLOPT_NOBODY, true);
    }
    if ($options['session'])
    {
        curl_setopt($s, CURLOPT_COOKIESESSION, true);
        curl_setopt($s, CURLOPT_COOKIE, $options['session']);
    }
    curl_setopt($s, CURLOPT_USERAGENT, $options['useragent']);
    curl_setopt($s, CURLOPT_REFERER, $options['referer']);
    $res = curl_exec($s);
    $status = curl_getinfo($s, CURLINFO_HTTP_CODE);
    curl_close($s);
    return $res;
}