zchb258 发表于 2024-4-20 22:00:08

通过python爬取https://sofifa.com/所有的球员数据

本帖最后由 zchb258 于 2024-4-20 22:01 编辑

本人80后和一些球友常玩这个实况10游戏,因为球员和球员数据总是不断更新,单方面手动复制粘贴工程量很大,下面这篇文章很棒但我实在看不懂,看得有点晕头转向,想请各位大神结合文章出一下完整版,或给个教程,谢谢了{:7_112:}

这是原链接https://mp.weixin.qq.com/s/1KX3dw3cfX1J7vk0andL8A
或许看图片文章{:10_250:}



FishC 发表于 2024-4-20 22:01:17

嗨!我了解你的问题。爬取网页数据是一个常见的需求,可以通过Python中的requests库来实现。下面是一个简单的示例代码,用于从https://sofifa.com/网站上爬取所有球员数据:

import requests
from bs4 import BeautifulSoup

url = "https://sofifa.com/"

# 发送HTTP请求
response = requests.get(url)

# 解析HTML
soup = BeautifulSoup(response.text, "html.parser")

# 获取球员数据所在的标签
player_divs = soup.find_all("div", class_="col-name")

# 遍历每个球员标签,并提取数据
for player_div in player_divs:
    # 提取球员名字和评分
    name = player_div.a.text.strip()
    rating = player_div.parent.find("div", class_="col col-oa").text.strip()

    # 打印结果
    print(name, rating)

在这个示例中,我们使用了requests库发送HTTP请求,然后使用BeautifulSoup库解析返回的HTML文档。通过查找特定的标签和类名,我们可以提取出球员的姓名和评分。

请注意,网站的结构可能会有所变化,所以代码可能需要根据实际情况进行调整。此外,爬取网站数据也需要尊重网站的规则和政策,不要进行过于频繁的请求,以免对网站造成负担。

希望这个简单的示例能帮助到你!如果还有其他问题,请随时提问。

以上回复来自 FishC AI 小助理,如未能正确解答您的问题,请继续追问。

小甲鱼的二师兄 发表于 2024-4-23 04:39:03

帮你整合了一下代码:

import requests
from bs4 import BeautifulSoup
import re
import csv

def fetchData(id):
    url = f'https://sofifa.com/player/{str(id)}'
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7',
    'Referer': 'https://www.google.com/'
    }
    myRequest = requests.get(url)
    soup = BeautifulSoup(myRequest.text,'lxml')
    print(myRequest)
    meta = soup.find(attrs={'name':'description'})
    if meta is None:
      return None
   
    meta = meta['content']
    years = soup.find(name='span',attrs={'class':'bp3-button-text'})
    if meta[:4] != 'FIFA' and (str(years.string)) != "FIFA 23" or meta[:4]=='FIFA':
      return None
   
    info = soup.find(name='div',attrs={'class':'info'})
    playerName = info.h1.string
    myList =
   
    rawdata = soup.select("#body > div:nth-child(5) > div > div.col.col-12 > div.bp3-card.player > div > div")
    offset = rawdata.find_all("span")
    offset = (len(offset))-1
    temp = rawdata.text
    temp = re.split('\s+',temp)
    if offset > 0:
      for i in range(offset):
            temp.pop(i)
            
    month = ["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"]
    mon = temp
    mon = month.index(mon)+1
    day = temp[:-1]
    year = temp[:-1]
    birthday =
    birthday = eval(str(birthday))
    myList.append(birthday)
   
    height = int(temp[:-2])
    myList.append(height)
    weight = int(temp[:-2])
    myList.append(weight)
   
    rawdata = soup.select("#body > div:nth-child(5) > div > div.col.col-12 > div:nth-child(2) > div > ul")
    temp = rawdata.find_all('li',class_="ellipsis")
    preferred_foot = temp.contents
    preferred_foot = 1 if (preferred_foot == 'Left') else 2
    myList.append(preferred_foot)
   
    skill_move_level = temp.contents
    myList.append(int(skill_move_level))
   
    reputation = temp.contents
    myList.append(int(reputation))
   
    todostr = temp.text
    workrateString = re.split('\s+',todostr)
    wr_att = workrateString
    wr_def = workrateString
    wrList = ['Low',"Medium","High"]
    wr_att = wrList.index(wr_att)+1
    wr_def = wrList.index(wr_def)+1
    myList.append(wr_att)
    myList.append(wr_def)
   
    rawdata = soup.select("#body > div:nth-child(5) > div > div.col.col-12 > div.bp3-card.player > img")
    img_url = rawdata.get("data-src")
    img_r = requests.get(img_url,stream=True)
    img_name = f"{id}_{playerName}.png"
    with open(f"./{img_name}","wb") as fi:
      for chunk in img_r.iter_content(chunk_size=120):
            fi.write(chunk)
            
    rawdata = soup.select("#body > div:nth-child(5) > div > div.col.col-12 > div.bp3-card.player > div > div > span")
    allPos = ''.join(f"{p.text} " for p in rawdata)
    myList.append(allPos)
   
    rawdata = soup.select("#body > div:nth-child(6) > div > div.col.col-4 > ul > li:nth-child(1) > span")
    bestPos = rawdata.text
    myList.append(bestPos)
   
    rawdata = soup.select("#body > div:nth-child(5) > div > div.col.col-12 > div:nth-child(4) > div > h5> a")
    club = rawdata.text if len(rawdata)>0 else "没有俱乐部"
    myList.append(club)
   
    rawdata = soup.select("#body > div:nth-child(5) > div > div.col.col-12 > div.bp3-card.player > div > div > a")
    nation = rawdata.get("title") if len(rawdata)>0 else "其他国家"
    myList.append(nation)
   
    rawdata = soup.select('#body>div:nth-child(6)>div>div.col.col-12')
    data = rawdata.find_all(class_=re.compile('bp3-tag p'))
    myList.extend(allatt.text for allatt in data)
   
    return myList

def dealWithData(dataToWrite):
    header_list = ['id','name','birthday','height','weight','preferred_foot',"skill_move_level","reputation","wr_att","wr_def",'Positions','Best Position','Club',"nation",'Crossing','Finishing','Heading Accuracy', 'Short Passing','Volleys','Dribbling','Curve', 'FK Accuracy','Long Passing','Ball Control','Acceleration','Sprint Speed','Agility','Reactions','Balance','Shot Power','Jumping','Stamina','Strength','Long Shots','Aggression','Interceptions','Positioning','Vision','Penalties','Composure','Defensive Awareness','Standing Tackle','Sliding Tackle','GK Diving','GK Handling','GK Kicking','GK Positioning','GK Reflexes']
    with open('./output.csv', 'a+', encoding='utf-8-sig', newline='') as f:
      writer = csv.writer(f)
      writer.writerow(header_list)
      writer.writerows(dataToWrite)

def getPlayerID(key):
    url = f"https://sofifa.com/players?keyword={str(key)}"
    myRequest = requests.get(url)
    soup = BeautifulSoup(myRequest.text,'lxml')
    playerTable = soup.select("#body>div.center>div>div.col.col-12>div>table>tbody")
    data = playerTable.contents
    playersCandicate = []
   
    if len(data) > 0:
      for p in data:
            id = p.find("img")["id"]
            name = p.find("a")["aria-label"]
            ovr = p.find(attrs={"data-col":"oa"}).get_text()
            playersCandicate.append()
    else:
      print("not found")
      playersCandicate.append(["not found","the name you're searching is >>", key])
      
    return playersCandicate

if __name__ == "__main__":
    # 通过递增ID搜索
    for start in range(20000, 40000, 1000):# 每次爬取1000个球员
      soData = []
      for s in range(start, start + 1000):
            l = fetchData(s)
            if l != None:
                soData.append(l)
      dealWithData(soData)
      time.sleep(60)# 爬取完一批次后,休眠60秒

页: [1]
查看完整版本: 通过python爬取https://sofifa.com/所有的球员数据