python爬虫多页爬取
本帖最后由 2022@lif 于 2022-1-25 08:30 编辑爬取网址https://list.szlcsc.com/catalog/439.html中的所有产品的信息,发现在此页面翻页时网址的url不会变。查看网络请求时发现这些页面页数信息应该是存放在Payload中的表单数据中:
data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAOEAAACBCAYAAAA2R90gAAAd10lEQVR4nO2dfVAUZ77vP26yhZ5794LWku1REiFgErcqRxuQuFp1isSUAY2imxeSc68VQuTN483LbqJErTBsuS4mqd0kZQkMakhZtZsYExgSgeSElTq1UYNA66YqWhEPYEZmIntXp/buhamto/ePfqHnBRh0ZnqU51OVCtMvT/+67W//fs/T3d+ece3atWvEiKGhIebOnRurzd2UiGM0ObfaMZrR1XM6ZiIUCATBzBgZGREiFAgs5AdWByAQTHdutzqAeOYf//gHPp+PGHabLWPGjBkkJCTwwx/+0OpQph0iE07AdBEgwLVr1/D5fFaHMS0RIpyA6SJAnem2v/GCEKFAYDFChAKBxQgRCgQWI0QoEFiMEKFAYDFChAJBGIyOjlJfX8/g4GDE247bm/Xt7e3YbDYWLVpEe3s7bW1tQcvIskxRUREAXq+Xd999l2effZbExEQGBwc5ceIEhYWFsQ59WlBZWcnIyIjxW5ZlFEUJuWxxcTFutxubzca9997L+++/z/r16wH8/s1izejoKO+99x55eXnMnz9/wmVnzpzJ/PnzOXr0qHHORYq4FaGZvLw88vLyJl0uNTWV1157jfz8fBYuXBiDyFROnTrFs88+a/yuqKigtLR0yu0MDAzQ3NxMRUUFCQkJNxyX1+tlz549bN68OeIneU1NDQCDg4O0t7fz1FNPUVRUxOjoqCEy8zbdbndQG8ePH6e/v5/XXnvNmDZr1iwqKiomFcVUaWxsRFEUZs2axdNPP01TUxOXL18G4JtvvjGWy8/PJy8vD6/Xy+9+9ztjGTPmi83s2bN56aWXbuj4hiXCsrIySktLycrK8pve09NDdXU1LS0t1x1AKF544QXj73B2Uj8RnnnmGWRZ5g9/+AM2my2iMU2GWXgOh4PW1lZWrVoV0xhiSXt7O8ePHzdE4/P5qKmpMU5a/URNS0ujvLw8aH2v18vZs2dZtmwZiYmJYV1kb4RZs2bxi1/8ghMnTgCQlJQUdF6dPn3a72KRmppKZWUlM2fOjGpsYYmwqqqK8vJyqqqqDCH29PRQVlZGfX19xIN6++23aWxsRJZlFi1aNOnySUlJAGzdupXi4mIqKyujUruHS05ODl1dXZZtPxboovF4PCQmJuL1eoNObK/XS1NTU8j1P/74Yx588EGSkpJoamqis7MzKhlwIhRFMeKbPXs2ubm5Ey5v7iJNpZSdjLBEOHfuXOrq6gwhAoYAA7NjJPB6vfh8Ptra2jh69ChXrlwJWRaA2t9YtGgRZWVlRka89957uXLlil+fxQp8Ph9vvvkmhw8fBtRs+cwzz/Dmm2+yevVqFi9eDEBraysul4uVK1f6rW8uc++55x52795NamoqXq+XV199lePHjwPw61//2si6ra2tbN++HYBf/vKXUds3c7WiKArLli0zfusna2pqqjHNnOkC+1QvvvgiXq+X2tpann766ZgKUZZl1q9fz7vvvhtyfnt7OzNnziQvLw+bzYaiKCxatAifz0dCQgI/+clPbjiGsPuEZiEODQ1FTYAwVsrcd999XLlyxei4j9ff0Jk/fz5lZWXGb0mSohLfRPh8Po4cOcLq1atJSEhg+/btbN++HZ/PR21tLW63m9WrV9PV1cXixYvx+Xx8++23rFu3zq+dU6dO4XA46OzsJDExkYGBAerr66msrCQxMZG9e/cCY/2+5cuX09/fz6effmqs09rayrFjx6Kyn2+//bZf+TZz5syQ/T4zer9sPPLz82MqwHBYuHAhnZ2djI6OkpSUhM/nY3R0lIGBASRJikipOqWBGbfbHRNrAa/Xy/z5843UPzg4yPvvv89TTz0FqAMYbW1tVFRUAIzbgQb8RlWj1ekHqK2tpba2FlAzk57lAgdtcnNzSUtL48iRI3i9XiNum83mdxIPDQ1RWlpqXGxSU1NJS0ujv7+fxYsX+2W8e+65h8uXLwets3z58glP+kgwOjrK999/z9KlSyddVu+XhTr+7e3t0Qjvurly5QqzZs0yYv3+++9JSkpiZGQEn8/H2bNnw9rncAhbhOY+oM1mC+ojRpL77ruP/v5+4/fRo0dZunSpMWKYmpqKJEkcP36cvLw87HZ7UBuNjY1IkhT1Dr9OqBHRU6dOceTIEU6cOEFCQgIOhwOAxMREZFmmv7+foaEhcnNzpzQaqpeviqIYGRbA5XLF3Htl5syZlJWV4fV6ARgZGWFwcHDS0enBwUFqa2vJzc2N2b9RKBRFQVEUZs+e7Tdd787otybOnDlDbm4us2bNYmBggJGRkYiUohDmzXqzALOysozStLq6mp6enogEYubee+81/h4cHMTn8/lNA3jwwQc5deqU8Y9vprGxkf7+fjo7O3nrrbcYHR2NeIzhMDQ0RHJyMgkJCXi9Xk6dOmXM++lPf0pXVxcul4u0tLSgdefOnYvD4TD2b2BggP7+ftLS0nC5XKSkpABqdfLtt98C6oDQkSNHjPcCv/zySy5evBjt3QTUC8uLL77I999/zzfffENTUxMJCQkUFRWFLNnOnDkTss/e2NjI6dOnIx7fyMgIv/3tb4MqA1mWefvtt7Hb7X5CdLvdyLIMqFWKx+MxLjhAxEpRCFOEDocjqA+oC1G/ukeD0dFRmpqaWLp0adAOz58/H0mSGBgYMKY1NjbywgsvIEkS1dXV1NTUsH79eux2uyViXL58OadOnUKWZV599VXmzZtnzLPZbAwPD5OSkhKyf7t48WIeffRRcnNzkWWZrVu3UlZWRmJiIitXruS9995DlmXq6+uNdhcvXkxycjJLly5FlmX++te/+m0z2jQ2NtLU1MSvfvUr47ibBVVYWEhSUhJ2ux2Px0N+fr4xLy8vj9zcXHw+nzHaHUn0UlgXFvg/7GFGv2DqcSxatMhvubNnz0b0PnTcGj2Zh4PNN071+066KM3z9JHSUJw+fZqmpqYp3Vj929/+FrH9CUQvI9etW+c3img1P/rRj8JaTn9ipri4mKSkpJClpT6Mv2jRIpqbmxkZGfG773v69GkOHDjg127gv280CPV0VW1trbE/oJap69evn3C8ASIzzhC3IowHoilCvb/48ssvR+TpmEgRrggFkeOmeGztVkK/d3jx4kV+85vfxJUABdYgMuEERDMTxisiE8Ye8SqTQGAxQoQCgcUIEQoEFiNEOAEzZsywOoSYMt32N14Qo6MT8MjvvfS6/2F1GDEj0/ZDviz971aHMe0QmXACpltmmG77Gy+IWxQCgcXc/ve//52rV69y9epVq2MRCKYlt8+cOVMIUCCwkNtvv10dmxFf5BEIrOH22267zeoYBIJpzYyrV69eu3btmhgZEwgs4gcghqYFAiu5XQhQILAWcbNeILAYIUKBwGKECAUCi4nhA9xuWqvsOCnAXr2KcT/X4mnFXtVFTrWdVbqBtlJHWZ2CXF5PuWxe7gIF1XfhrHIildu5q9nOhXX1lBNiecB9xI69Kwd7tYxSZcfpCR1C4HqW011D8cBzHHg8OcRMhV13P0nDpI08x0f/WUmm50OKl33OI8caKHTVkPbk/vFXyd1F14EnCLVVS4jkcYh8dJPj+ZDibbDbfEw9H0ZRhJpwgnFiL3P6TQk+6d04q+qw1Zcjo1BXpwAyd9ncYMjXhiQ5+arnLvXnSSdOjw0ZN8pJdfkHJhNSVjn1paaFPK3Yq5zjLx8LuscTxn7StvhPefD1Yxx4XGbbf55jm2l6b00J54oaKAz1FQDpCQ4cg+JlJfB6OpQeor9SPQbDh0toSG1gW3akduYGiPZxCKC3ZgHtD5/z2/fhwyVsZZcq+sB4/C5QCrvu/oy8ScQ9/KfPydjUQLLnQ4qXbeOoNn3GtUkelamtreX8+fMh56WnpxtW9JMzWSbU5o+TnQwkdX2O2LG3BH/7QF5bgKfFiTtgncJ5Tj7QfYolGRkF5SbIhL01JZzLSOcz0snoy2BbZWBgw3xQvIzKzslaCpEBumtI++IRujL2krMlsIFcao6FdwLHgqgeB8IU4RePGBes3poF7Mk4pmXlcESosKsGNe6AjBjFcnQ8UQVkQilAlEZ2Chat4iijTjOUtq22U5+lZi43Y8JRHGXU6b9t6nxp3Soeklex0K8cVVD0bSl1lNV5KDCXwBbTW7OAxxxQcugc26jhs4Fctj28n7S7n/TLXpBM4YFzqN8jVk/E85vOjZvNemsW8Bjm9VXUbKIXScN8ULwteGULiNZxuFEyH36Oo1+4IMxiffjwZ6QXVarC7kuHzv3k3K0e40lFWFhYyK5du0LOC/yUlz82VlXXM/aZzDD7hD11mD6sRLBotf/rfcfyArrqnCh1ZfitBuC+gBsbOeNtLGBbzqoynBB8YYgh+kmn0/DkAq2fsx/DLN/xJGkOoPQQH/Gk3/IAdC4I3TfK3UXXgXN8VLOAtJpD9D88NuvolmUBZV4uNTe4LzdC9I/DjfV1e7/YT8nDlWEuPcwfW/dTuWW/euEo6qP427EYJhXhnDlzWLlyJZ9//rnf9JUrV5Kenj7ueoqjjLqQn6kI7hOq2U8y/T15JlRatMEbWulCzXwFQ60oXKCrS8Ez5MaNB5CwmbObx4m96gIyYFtrx77aFleZMLPyHP2VCruK+ygJeaIM80HxfhYc0Eufc/RXatMPuyh8PLiODuzrJVeeox+g+zNjmXjLhLE4DlNGFz1aZg67nWQKDxzifHEfJZUyePqAz9l6t9ovDKsczc7O5vz580bfMD09fZIsCHJpPfo3fFVB2pCzQOmRKK8vR0YdraynTBWC3osLMxPKpXZ10KbMCWvtah9OXoVNUSDHg7NLQZnnhqwCZHNpLBVgr7bhLFOgq54yU79SzYS2uBAjnduMciWY5/goYMrw4W1UbumkckvIFSg5NPa30f8BHsxQPywTb5nQIIrHYcpo5e/w4RJyvlDYlh3+wEFvzZM0dELD3dsoKX2Oo/ekU8JKdu8KU4Rz5syhsLCQDz74gMuXL08qQDNGRswqoLxUorXKTp1DwT7PqQ2s1NOaZWeVNLXy1S/TttgpawF0AdncdLWoo6UF1eqBWlVdj3zEjr0L8LjxYCMnR8LZwpjotIwYF4xbMqkZwA/PhzS0EtBHMq1xuMRUlg1z7ttc0ougt3E/GQ9XkpzdQP/jhB5Ct5qoHQd/7szIpW9gGLIn3/Pkx3dRU7yNDzzhD1ypmX0sjj5WgmMbOcumMDAzZ86cKYyEAijUldWhZBVQIDnVvhY2VlUU0FVVh72H4NIz5PkfonyVCrBXlyOX1eFZa8e+Gk2wOcgSgEyO5MTpcXPBzVg/UsPT04VbykGeewGLb0iMT7gZQBfOpnQanhwrlwIZywAuzrOSEkmhwaGK0RjdO6Z2L/5Ys4DzD0dvUGNKRO04+JOcms7RJ/fT+7hW3no+ZOsWeORYKFEmU7gpnbRtH/LQFC5Yw4dLjFHokkOb1AtMuJnwelAcdVBeT72sUFcGeMbKTLm8HKmuDqXnKxRk5KBBHJh0IEdReKC6AGeVOQvqfUc7TmRkSUGps9PqV15e4KsuN7YcGRsXUO9JlpnEaMVwTCDB97zGo7fxPJsPVJLcXRNeBuj+jIZ7HqHk8F4aclfSJUFvzV7SjzWQyYcAPFR5DIpL+GCX1bcoongcAsmupOv1EnLuXmBMKjl0bvz9z36OGpax9XAuBx4H2M9jd+/3W3fsIqaO1n626hj9xzopXnaevGzQv00cNRHKpfUEHga/+2/19Wr5V1YWfNM8DJSTdUY5KmtPyxhiyiqnvtqUYVsUbNSZyuJyrRHA3AeMg3I0cFRwQkoP0V9pGqFzTJ4BhgfO82BGCt/1Qc2uJ+BwCe0PN1DypxLStnRC6SGSSaZw10qKGxUKQ5zMsSDaxyEUyY9rZXkosivp96sMzLdEkie5WIwtO3z4czIONZCJoopQemLym/UCgSC6iAe4BQKLESIUCCxGiFAgsBghQoHAYoQIBQKLESIUCCxGiFAgsBghQoHAYoQIBQKLESIUCCxGiFAgsBghQoHAYoQIBQKLuclF6KFjRwVFGx2cucGWhj+ppmhjYFt6+9V0TPkNp/BjO1M3wTY8bezcWEFRSA9Xwa1AfIpQO/F2fmLFu30KH1my3WhwiebyFFJS3mHMc6uHd1JSSNH/e8vsxmWeV0JzsK2rIArEpQiHT3bRZ+H2+5z115H54o9LzdvZfN8W/D2Psnje5cLlcuFy9bLnbAHv9IIq2AJo0eadXM2R6mYuWRL59CI236LwtLFzR4shrBWbatkw138ayGzdV8qPP6nmFc1sps9ZTdFXa3ljp0T7RgcdxrISG3ZWsSKE9cCwaX2AjIIqdqyRQs4DILuURpPd9oqCtQw6WzjYrLCiPITVRa+Dor1K6PU9gfs08borNtWyIYRlc8g41Q3QsaOag57x99/A3cz2zffjdC3nyze/HGeh77jwaT53Vel/b2F5nTbLtozVbOeYex3r4sHx4xYm+plQPzGltbyxr5bGfdqJJ+WzQ/vduElGLwOT11TxRoF6dmUUVNG4M59kZDboy+5cSwYeDjaH6CP1OnjF6VHX21fLGwUSfc5qtaz1tFHv9Kii0eaFZF4+ZQUSdDs42BtiX/YqRhuNm2Todmj9NQ8de1roQ2arEad/bEV7FVZsqqVxXxUbJOjYG6K/GG6cE3KJ5urN3N/yPFmhZrubKUlJISWlgK/3/Noksq/5Ti9Be99n86fXsWnBlIm6CPXScsXP84NdqfRBBy079F2cqBOicHBjBUV6pnF5GA5Y4kyXAkj8bIl64iYvySFDa9eII0f2mxeK5DVrWQF0fKxASoh90dogcwkr9Fg8Csc9QPYSFgJIMj8z6UeNDTr2qoMwBz0AboYCEt7EcUqs2KmKeKIsqJahTp4f78MItnU0aOXo6vZMSpovAVk833I/m5dofcL/uIs9j46/DUHkiOGn0fw5U1fB7m4t2y1RJizjjPIsu5TGcjjoV5pGA5kNm2Q69rZoYokUocvIGx3Z9ecSx9rb4NM2Ut40TU55nS0trgBh3sGyvHw2D3wH3AGZz+NyPa/N6+GdlPvHylNB1Ih6Jkyep9Y6HR+3mTKXhyEXgMxjayQY+s5PgPo6On+5qNrZb1gng8fD4DjbWpgjAx6On1SVY84qRhxaRpp08CdzLRsCxKJnJb0Nek/SAWQ8IJMsScwH6D6pikrPjBo/nieBXxntoeOT4JJ64jjDuWVyB+vq9IEXFy6Xky1swekKFCDogt3yL8FFa89bBXy956nQ5awgokS/T5hZqvadPC28srGCoo0VHOyVWPFztR+4e2MFRR+7/UtDTQB9zmqKdrTx43VaP3BHBUV7uibclt4PLNpYYfQPN2Sq87Zmo/bhNlbwyleMW46qSKzYHNCvk/LZofcD9TI6u1Qb+FGzp7FPO7r8TIeNvq6+7sZqDl4MvQ9Ti3OKGP3BFFJSMrlQOibOS80lxq2LutReGtbdEcktC8Zh+loeagNGmEZPBQIriMv7hNFHH8kcG8QRCKzCsoGZ2KMEDOiEca9NIIgB07ccFQjihGlajgoE8YMQoUBgMUKEAoHFCBEKBBYzjUZHbxL+vR1eKof/64VrM4AIjJvNAP7pv8G//RL+7aUbb08QUcToaDzx7+1Q8j/h2tUobWAGVLwAlVVRal9wPQgRxhP3zYOR/xfdbcz4AQz8n+huQzAlRDkaT5gFuOklKC6H5Ag8v3niT1C4Rv07allWcL2ITBhPpM4e6wJ+PQD/IzFybc+fg9r4DBj8a+TaFdwwt97oqP6i8MYKvzfjdTe1SJtHTeiUNlXMl8NICtCvcXHNjTduPRGa8H+H8VZCuKLdStzSIsTTQv0tY184xqXmOsMVrXcPbBauaDc10R+YMYyeJDI8Hu0t8bE3GHSbCx2zO1rgPP/5uvOYNkNayxs7TT422WvZ4GrhoLOFM2tK+bFfK9obFcY6/r//UlfB7m6JDMlDn95+tsyKbkV7CyP4DYzjeyrGYpnAgU2NH8M1Td2G1l6Yh/SOdQ3oJhR3PLCa/PYwVxTEJbHLhB4bj5nd0vaopeLCcs1FTXMg63O2qPYQvY4xD5p9terb5mONqScxmoOb9ua+f9bT34xX2H1d7tUe5v+8lsZ9pao4ut3M3ak7wwW6vXngAVOculOb6cVhfZ6/p6m+jet/perSV0dou+9OxDvwNy+xu0Xh50LWQp/nO/4CJBOY8TQHsgDntIU5MnRrJ77h39LCKxtbjE30XXTDEtM2pXwey25hd3cLf55nA6ZSmkrMnWv+bWOuBHAnGSgB/jTBcQ5e9DB8UfOHcVZT5NSX9TA0BGrTMtnjOaKFQ+87ZG6+H6dLOMHczFh7n9Ao1VTjX0KUnxMSYNyrt2lmYXkpKzY6OBjSTDc2BJv8euj4+AYb7X2HlLVfs+dkgzBjusmJXTmqu5D1ajaC2UtYqLmsZRSsZaHhwKaiu5Ppzmm6bycAhrNZy1hp19s2zm0CmbwgA10tq2nZeCIHt/DQ4/SoXqVaZgzlNHfmkwiM2LqbKdEEKNyxb35imAlVFzJAHUQplwEbGySFg85qipwSGX7uZGVs+KpamwcZkipKFZkN+0pho4ODOyo4qDbKhp35IbdstGWIVHV7O7hXi0mSbtjRrM9Ucq7YpPXxpFIaNzko2msqm6W1vLFmgoYmfWZbddduA9qWpLBZmxrsKXo9bQusIPpPzOglZ6jScSqYbORDfb/hlmD+7LG/vzwNKXdFpl2fD+4xXeEGL0emXUFEuEmeHVU4uFfhhgcy4p3bboP/+i/17+WLiVzamhGhdgTRIH5FGPSFI3XwZqGFIUWdwv8Fv39P+xHJAsXUVs7yCLYriATiAe5443+XQMth7UckO3EzICsHPhZ39uMNIUKBwGJu7WdHBYKbACFCgcBihAgFAosRIhQILEaIUCCwGCFCgcBihAgFAosRIhQILCZ+H1uLNNoD4AaBdhg3QtAjdlNof7y49Ok3+uC7IO6ZHpnQfELvU+00tj5w/c2p9okhbA719jW7jVcms9UII66MebF6YfASzeXCwc0KpkUmHL6onlErcsYyysI1od89nBwPf/7KA0xgCjM3lAXGFOPKLKVx33WGeB30vJXJkbxeXHV3qC8NL3mHO13Pi7f2Y0BMRDj8STWvBNpLZJfSWG7TXMf0NyQ8Ab8JKtfU9wk9wW5l/2rj4O/N5ZvJQe3nqr9Mx94KBk1ubioBrm1mJ7VApzjpIdbzR5o86noHd1RwvKCKHWZfG2D4pOotsyJH9s92U4lLW093lxv/GMqaR4+/O1zgehlB+22mhy/f3EK5S7OLcl+gjde5v/d5sm7lV8fihOiXo5426p0eo+R6I8hqYgJML/Lqbmwdex2qTYba+Jhb2UNLNFc03UbjJB1AxgMyyZmlxnb7nNV+7txn6lQxb9ipb8PDwR0BpabhFPcEBTvVOFSx1vqf2N0OijZWqGLJLlVfPs68vrimfgz93eEM17pwcH/H14/exZ3ApeYSUhx3seflcFcW3ChRF6FfVgCSl+SEbSWh+8p07FWt5tVspbmxAf4v+epeMgrdvfq6Yy5oyWv8rRNVMSt0dwNSDv+sCeufH1BtNIaGTIHoTnGTofftdq4lo9tB0Y42hq8rLn/CO4a6O5zuCqceJ7398bPgGN81l5A5UI6rbh13hrO/gohwEwzM6FlqzJ90PI9O/eTs6GpTTaOy1wYtu7BcP+HNYo50yDI/MxlJxU1c42G7k/s/3UzBQDmuF7OAS3x3NsYxTGOiLkLDcUzLavpVXUVibgqMnXj+J6DuuDZmtOuh45MJRhz1k79bdXTTM8fwJw5TeallP2zMlWSyswFPF3/W+nnqoMtENhp6zBOg+6JKd6rO31OOy7+5iY/hxIT3IZwslr8MW/5FG4bpfZ/Nn25huegPxoToD8xklrI1u4Ld3Q6KNmI4m+kn0cJ1a8nobtFc0zTHNe18SV5TxRtU84pTWxcgu3QCu3itnHQGCsltcmVTl9uwUxv4Ka9iw47qANe2iW00zDH7Dcx0m+JEZqtxn/A64prCMYwEWS86+TIlBfX6ki/8TGNI7N+sj5T72rhoo49xd5M7gnGZ7PXD6esJ4puboE84NYY/aaEDiQ3r4kmAkYzLQ8eeFvpMgzuCm5tb52a96dGxFZtqr/sDKxEnInFpWdT4HfxVKMHNizB6Eggs5pYrRwWCmw0hQoHAYoQIBQKL+f+6UI0a58lHfwAAAABJRU5ErkJggg==
但是我发起POST请求时携带了这些参数,并不能得到这个页面的页面源码。
data = {
'catalogNodeId': 439,
'pageNumber': 2,
'querySortBySign': 0,
'showOutSockProduct': 1,
'showDiscountProduct': 1,
'queryBeginPrice':'',
'queryEndPrice':'' ,
'queryProductArrange':'',
'queryProductGradePlateId': '',
'queryProductTypeCode': '',
'queryParameterValue': '',
'queryProductStandard': '',
'querySmtLabel': '',
'queryReferenceEncap': '',
'queryProductLabel':'',
'lastParamName': '',
'baseParameterCondition': '',
'parameterCondition': ''
}
session = requests.session()
response = session.post(lit_url,data = json.dumps(data),headers = headers).content.decode('utf-8')完整代码如下:
import requests
from lxml import etree
import json
import pandas as pd
import sys
url = 'https://list.szlcsc.com/'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.34'
}
#获取页面源码
def get_page(url):
session = requests.session()
#session在发起请求时会创建cookies,并在下次发起session请求时包含sookies;'.content.decode('utf-8')解决爬取到的数据中文乱码问题'
page_text = session.get(url = url,headers = headers,proxies=proxy).content.decode('utf-8')
return page_text
#获取路径下的内容
def get_content(text,path):
tree = etree.HTML(text)
content_list = tree.xpath(path)
return content_list
resp = get_page(url)
for num in range(1,19):
Manul_list = get_content(resp,f'/html/body/div/div/div/ul/li[{num}]/div/dl/dt/a/text()')
Manul = '/'.join(Manul_list)
print(str(num)+' '+Manul)
#选择需要查询的大类
choice = input('please select a number you would like to choice:')
lit_list = get_content(resp,f'/html/body/div/div/div/ul/li[{choice}]/div/dl/dd/a/text()')
lit_url_list = get_content(resp,f'/html/body/div/div/div/ul/li[{choice}]/div/dl/dd/a/@href')
lit_dict = {}
for i in range(0,len(lit_list)):
lit_dict.update({lit_list:lit_url_list})
name_dict = {}
for n in range(1,len(lit_list)+1):
print(str(n)+' '+lit_list)
name_dict.update({n:lit_list})
#选择需要查询的具体类型
cho = int(input('please select a number you would like to choice:'))
lit_url = lit_dict]
data = {
'catalogNodeId': 439,运行代码后选1,然后选18.
得到的却不是我想要的这个页面的页面源码,哪里出现了问题呢?
看看这些东西是不是你要的,这个页面太专业了,有点看不懂
import requests,json
data = {
'catalogNodeId': 439,
'pageNumber': 2,
'querySortBySign': 0,
'showOutSockProduct': 1,
'showDiscountProduct': 1,
'queryBeginPrice':'',
'queryEndPrice':'' ,
'queryProductArrange':'',
'queryProductGradePlateId': '',
'queryProductTypeCode': '',
'queryParameterValue': '',
'queryProductStandard': '',
'querySmtLabel': '',
'queryReferenceEncap': '',
'queryProductLabel':'',
'lastParamName': '',
'baseParameterCondition': '',
'parameterCondition': ''
}
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.34',
"cookie": "acw_tc=da3dd31c16430728279798128e080b8460c74973023561a97ab6c006d3; SID=471d893b-2380-43e5-ba2f-7c8cf55e7592; SID.sig=bPEGcnKdLpFDiL5G5xuKHDFqRnS7DjKR0uoun7RX0Cg; Qs_lvt_290854=1643072833; Qs_pv_290854=4487896694418019300; cpx=1; guidePage=true; noLoginCustomerFlag=929acea54e81e00b7866; noLoginCustomerFlag2=a0c9da953af694050bd3; PRO_NEW_SID=8920f7cd-dc56-4f97-bd00-f3cb9b98596b; computerKey=d87e0a87824fc7096b6a; AGL_USER_ID=85617b33-f9fb-4d6e-9d65-1b1878037a9d; Hm_lvt_e2986f4b6753d376004696a1628713d2=1643072840; Hm_lpvt_e2986f4b6753d376004696a1628713d2=1643072840; show_out_sock_product=1",
"origin":"https://list.szlcsc.com",
"referer": "https://list.szlcsc.com/catalog/439.html"
}
url="https://list.szlcsc.com/products/list"
txt=requests.post(url,headers=headers,data=data).text
js=json.loads(txt)
print(js["productRecordList"]) wp231957 发表于 2022-1-25 09:26
看看这些东西是不是你要的,这个页面太专业了,有点看不懂
所以这些产品的一些信息都是在返回的json串中是吗?
‘origin’和·‘referer’是一定要的吗?
非常感谢{:5_106:} 2022@lif 发表于 2022-1-25 10:21
所以这些产品的一些信息都是在返回的json串中是吗?
‘origin’和·‘referer’是一定要的吗?
非常感 ...
‘origin’和·‘referer’是一定要的吗?
这个需要一个一个的测试,我图方便 全都放上去了,我也不知道是否全部必须,感觉上COOKIE是必须
所以这些产品的一些信息都是在返回的json串中是吗?
是的,这个请求返回的是一个json串,而不是一个页面 wp231957 发表于 2022-1-25 10:26
‘origin’和·‘referer’是一定要的吗?
这个需要一个一个的测试,我图方便 全都放上去了,我也不知道 ...
那是需要用字典和列表的方式来对json串进行处理从而来得到我想要的数据吗? 2022@lif 发表于 2022-1-25 10:32
那是需要用字典和列表的方式来对json串进行处理从而来得到我想要的数据吗?
是的,看返回数据类型,有时候是列表(js中叫数组)有时候是字典(js中叫json串) wp231957 发表于 2022-1-25 10:35
是的,看返回数据类型,有时候是列表(js中叫数组)有时候是字典(js中叫json串)
txt=requests.post(url,headers=headers,data=data,proxies = proxy).text
这里的data=data不需要用data=json.dumps(data)吗?这个我是在网上看到的,说是将 Python 对象编码成JSON字符串。我也不是很清楚 2022@lif 发表于 2022-1-25 10:39
txt=requests.post(url,headers=headers,data=data,proxies = proxy).text
这里的data=data不需要用da ...
一般情况下都不用吧,字典本身就和json都是一个玩意 wp231957 发表于 2022-1-25 09:26
看看这些东西是不是你要的,这个页面太专业了,有点看不懂
import requests,json
import pandas as pd
productinformation_list = []
#for i in range(1,7):
data = {
'catalogNodeId': 11182,
'pageNumber': 1,
'querySortBySign': 0,
'showOutSockProduct': 1,
'showDiscountProduct': 1,
'queryBeginPrice':'',
'queryEndPrice':'' ,
'queryProductArrange':'',
'queryProductGradePlateId': '',
'queryProductTypeCode': '',
'queryParameterValue': '',
'queryProductStandard': '',
'querySmtLabel': '',
'queryReferenceEncap': '',
'queryProductLabel':'',
'lastParamName': '',
'baseParameterCondition': '',
'parameterCondition': ''
}
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.34',
"cookie": "acw_tc=da3dd31c16430728279798128e080b8460c74973023561a97ab6c006d3; SID=471d893b-2380-43e5-ba2f-7c8cf55e7592; SID.sig=bPEGcnKdLpFDiL5G5xuKHDFqRnS7DjKR0uoun7RX0Cg; Qs_lvt_290854=1643072833; Qs_pv_290854=4487896694418019300; cpx=1; guidePage=true; noLoginCustomerFlag=929acea54e81e00b7866; noLoginCustomerFlag2=a0c9da953af694050bd3; PRO_NEW_SID=8920f7cd-dc56-4f97-bd00-f3cb9b98596b; computerKey=d87e0a87824fc7096b6a; AGL_USER_ID=85617b33-f9fb-4d6e-9d65-1b1878037a9d; Hm_lvt_e2986f4b6753d376004696a1628713d2=1643072840; Hm_lpvt_e2986f4b6753d376004696a1628713d2=1643072840; show_out_sock_product=1",
"origin":"https://list.szlcsc.com",
"referer": "https://list.szlcsc.com/catalog/11182.html"
}
url="https://list.szlcsc.com/products/list"
txt=requests.post(url,headers=headers,data=data).text
js=json.loads(txt)
#详情页的数据处理
product_list = js["productRecordList"]
print(product_list)
for n in range(0,len(product_list)):
product_dict = product_list
productinformation = []
productName = product_dict['productName'] #产品名
productinformation.append(productName)
productCode = product_dict['productCode'] #产品编号
productinformation.append(productCode)
productModel = product_dict["productModel"] #产品型号
productinformation.append(productModel)
package = product_dict['encapsulationModel'] #产品封装
productinformation.append(package)
brandname = product_dict['lightBrandName'] #产品品牌
productinformation.append(brandname)
packageArrange = product_dict['productArrange']#产品包装
productinformation.append(packageArrange)
gdWarehouseStockNumber = product_dict['gdWarehouseStockNumber']
productinformation.append(gdWarehouseStockNumber)
jsWarehouseStockNumber = product_dict['jsWarehouseStockNumber']
productinformation.append(jsWarehouseStockNumber)
#产品价格数据处理
#内地(人民币)
productprice_list = product_dict["productPriceList"]
Num1 = ''
for price in productprice_list:
Num1 = Num1+str(price['spNumber'])+'->'+str(price['epNumber'])+':¥'+str(price['thePrice'])+'\n'
productinformation.append(Num1)
#香港(美元)
productprice_list = product_dict["productHkDollerPriceList"]
Num2 = ''
for price in productprice_list:
Num2 = Num2+str(price['spNumber'])+'->'+str(price['epNumber'])+':$'+str(price['thePrice'])+'\n'
productinformation.append(Num2)
productinformation_list.append(productinformation)
lable = ['产品名','产品编号','产品型号','产品封装','产品品牌','产品包装','广东存储量','江苏存储量','内地(RMB)/个','香港(美元)/个']
text = pd.DataFrame(productinformation_list,columns = lable)
text.to_csv('./1-25.csv',encoding='utf-8')
print('OK')
当pageNumber为1的时候运行代码,显示在68行那里报错,说keyError: 'productHkDollerPriceList'。但是我将获得的product_list打印出来,搜索是有 'productHkDollerPriceList'这个字段的。
这是什么原因啊?‘productPriceList’这个字段就不会报错。 2022@lif 发表于 2022-1-26 08:46
当pageNumber为1的时候运行代码,显示在68行那里报错,说keyError: 'productHkDollerPriceList'。但 ...
确实没有看到你这个 productHkDollerPriceList东东
页:
[1]