Bing搜索引擎的鏈接自動提交程序

Bing引擎每日有1W條的鏈接提交額度，對於做搜索引擎網站流量的站長來說，肯定是每天能盡量多的提交鏈接給它。

版本一：每天自動從數據庫里面提取10000條數據，生成鏈接，分成500條一份進行提交（Bing的限制），按照正序、偏移量的方法提交可以保證一個鏈接都不會重複而且簡單，但是缺點是最先提交的都是老舊的數據，如果數據庫裡面有幾百萬條數，那就傻眼了，程序自動提交都得1年時間。

版本二：優先提交昨日更新的最新數據，多餘的提交額度用於提交正序老舊鏈接。

先看下官網給出的JSON提交示例，

POST /webmaster/api.svc/json/SubmitUrlbatch?apikey=sampleapikeyedecc1ea4ae341cc8b6 HTTP/1.1
Content-Type: application/json; charset=utf-8
Host: ssl.bing.com
{
"siteUrl":"http://example.com",
"urlList":[
"http://example.com/url1","http://example.com/url2"
]
}

HTTP/1.1 200 OK
Content-Length: 10
Content-Type: application/json; charset=utf-8

{
"d":null
}

全程使用面向過程的方法

引入相關庫

import pymysql
import requests
import datetime
import math

定義Bing後台的host及生成的apikey

my_host = "domain.com"
api_key = "api_key"

偏移量

# 每次取500条，每次偏移500
limit_amount = 500
offset_amount = 500

獲取今日鏈接提交剩餘額度，該函數來源於Jaynet

# 首先尝试今日额度 查询额度
def search_bing_quota():
    headers = {
        "Content-Type": "application/json; charset=utf-8",
        "Host": "ssl.bing.com",
    }
    quota_url = f"https://ssl.bing.com/webmaster/api.svc/json/GetUrlSubmissionQuota?siteUrl=http://{my_host}&apikey={api_key}"
    try:
        quota_res = requests.get(quota_url, headers=headers)
        quota_json = quota_res.json()
        if quota_res.status_code == 200:
            print("Search bing quota Success")
            DailyQuota = quota_json['d']['DailyQuota']
            MonthlyQuota = quota_json['d']['MonthlyQuota']
            print(f"Daily quota:{DailyQuota}, Monthly quota:{MonthlyQuota}.")
            return int(DailyQuota)
        else:
            print("Search bing quota Faild")
            print(quota_json)
    except Exception as err:
        print("Search bing quota Faild")
        print(err)

後台提交url的路徑

# bing SubmitUrl
url = f'https://ssl.bing.com/webmaster/api.svc/json/SubmitUrlbatch?apikey={api_key}'
headers = {'content-type': 'application/json'}

# 待组装的url前缀
data_host = f"https://{my_host}"
data_prefix_url = f"https://www.{my_host}/number/"

鏈接數據庫，查詢數據庫昨日更新的數據數量

# 首先获取昨日更新的数据
conn = pymysql.connect('rds','zxf','password','database')
table_name = "table_name"
cursor = conn.cursor()

def getYesterday(): 
    today=datetime.date.today() 
    oneday=datetime.timedelta(days=1) 
    yesterday=today-oneday  
    return yesterday

yesterday_number_count_sql = f"SELECT number FROM `{table_name}` WHERE `up_date` LIKE '{getYesterday()}%'"
cursor.execute(yesterday_number_count_sql)
res = cursor.fetchall()
print(f"昨日更新数量：{len(res)}")

這裡面WHERE `up_date` LIKE '{getYesterday()}%'語句是為了方便利用更新時間的索引，例如 like "2020-03-01%"的會將所有更新日期以2020-03-01開頭的數據查找出來（可以利用索引）。

接下來，在昨日更新數量不為0的前提下，作3種判斷，如果昨日更新數量大於500小於10000，大於或等於10000，小於500

if len(res) != 0:
    # 如果昨天更新的数据大于500，那么要进行分段提交
    if 10000> len(res) > 500:
        fre_times = len(res)/500
        fre_times_ceil = math.ceil(fre_times) # 9.044 >>10
        for i in range(fre_times_ceil):
            yesterday_url_list =[] # 每次提交都清空列表
            number_search_sql = f"SELECT number FROM `{table_name}` WHERE `up_date` LIKE '{getYesterday()}%' LIMIT {limit_amount} OFFSET {offset_amount*i}"
            cursor.execute(number_search_sql)
            res = cursor.fetchall()
            for number in res:
                yesterday_url_list.append(data_prefix_url + number[0])
            #组装成json数据
            data = {
                'siteUrl': data_host,
                "urlList":yesterday_url_list
            }
            # 必须要在这里面提交url，否则外面提交链接列表都被清空了
            r = requests.post(url, json=data, headers=headers)
            print(r.text) # {"d":null} is successful!
    elif len(res) >=10000:
        for i in range(20):
            yesterday_url_list =[] # 每次提交都清空列表
            number_search_sql = f"SELECT number FROM `{table_name}` WHERE `up_date` LIKE '{getYesterday()}%' LIMIT {limit_amount} OFFSET {offset_amount*i}"
            cursor.execute(number_search_sql)
            res = cursor.fetchall()
            for number in res:
                yesterday_url_list.append(data_prefix_url + number[0])
            data = {
                'siteUrl': data_host,
                "urlList":yesterday_url_list
            }
            r = requests.post(url, json=data, headers=headers)
    else:  # 小于500直接提交就行了
        yesterday_url_list =[]
        number_search_sql = f"SELECT number FROM `{table_name}` WHERE `up_date` LIKE '{getYesterday()}%' LIMIT {limit_amount}"
        cursor.execute(number_search_sql)
        res = cursor.fetchall()
        for number in res:
            yesterday_url_list.append(data_prefix_url + number[0])
        data = {
            'siteUrl': data_host,
            "urlList":yesterday_url_list
        }
        r = requests.post(url, json=data, headers=headers)
else:
    print("昨日更新数量为0！")

接下來查詢剩餘可提交額度，按次數計入到偏移量中

# 再次获取剩余额度，除以500得到可提交的次数,多余的都不要了
remain_times = search_bing_quota()/500
remain_times = int(remain_times)
print(f"今日剩余可提交次数: {remain_times}")

如果低於500條，那麼就直接丟棄不提交了

# 如果可剩余数量为0，则后面不用走了
if remain_times != 0:
    # 读取上一次的偏移量
    with open("last_time_offset_index.txt", 'r', encoding="utf-8") as fp:
        x= fp.read().split(",")
        right_index = int(x[1])
        print(f"right_index, {right_index}")
    # 每次提交都偏移可用次数
    offset_range = range(right_index, right_index + remain_times)
    try:
        for i in offset_range:
            urlList = [] # 用于组装url的列表
            sql_offset = f"""SELECT number FROM {table_name} LIMIT {limit_amount} OFFSET {offset_amount*i}"""
            cursor.execute(sql_offset)
            res = cursor.fetchall()
            for number in res:
                urlList.append(data_prefix_url + number[0])
            #组装成json数据
            data = {
                'siteUrl': data_host,
                "urlList":urlList
            }
            # 与 get 请求一样，r 为响应对象
            r = requests.post(url, json=data, headers=headers)
    except:
        print("something wrong!")
    finally:
        # 关闭连接
        cursor.close()
        conn.close()
    # 记录本次的偏移量
    with open("last_time_offset_index.txt", 'w', encoding="utf-8") as fp:
        fp.write(f"{right_index},{right_index + remain_times}")

因為想要掛在服務器上自動運行，那麼肯定是需要記住每天偏移量為多少，我的方法是新建了一個last_time_offset_index.txt文件，裡面存放了每個前一天的索引，例如79,89這樣的數字。

在每次運行完程序之後，都對這個偏移量數字進行修改寫入，方便下一次的更新讀取。

這個程序還有很多可以修改的地方，例如，封裝函數面向對象；例如昨日更新數據超過10000條時的程序邏輯沒有關閉游標、數據庫鏈接（因為我對數據庫里面的數據更新比較了解，知道大概率是不會出現這種情況）；又例如寫入到last_time_offset_index文件中索引量其實只用寫入一個數字即可；又比如剩餘的500條以內的提交額度可以不用浪費等等，時間關係，就先寫到這吧。

參考：

https://docs.microsoft.com/en-us/dotnet/api/microsoft.bing.webmaster.api.interfaces.iwebmasterapi.submiturlbatch?view=bing-webmaster-dotnet
https://docs.microsoft.com/en-us/bingwebmaster/getting-started

YOLO813

Bing搜索引擎的鏈接自動提交程序