|
1) DeepSeek 接口——最短路径三步走
Step A. 安装与密钥
- pip install openai
- # mac/linux
- export DEEPSEEK_API_KEY="你的密钥"
- # windows (PowerShell)
- setx DEEPSEEK_API_KEY "你的密钥"
复制代码
Step B. 基本调用(用OpenAI SDK,改 base_url 即可)
基础域名:https://api.deepseek.com
模型名:deepseek-chat(V3),或 deepseek-reasoner(R1)
Chat Completions 路径:/chat/completions
认证:HTTP Bearer(Authorization: Bearer <key>)
官方文档明确支持上述方式,并提供Python示例。
- from openai import OpenAI
- import os
- client = OpenAI(api_key=os.getenv("DEEPSEEK_API_KEY"),
- base_url="https://api.deepseek.com")
- resp = client.chat.completions.create(
- model="deepseek-chat",
- messages=[
- {"role":"system","content":"You are a helpful assistant."},
- {"role":"user","content":"ping"}
- ],
- stream=False
- )
- print(resp.choices[0].message.content)
复制代码
Step C. 结构化输出(JSON模式)
把 response_format={"type":"json_object"} 打开,并在提示里明确“只输出合法JSON”,DeepSeek文档专门强调要在提示里包含“json”及输出示例。
2) 把 DeepSeek 接到你现有 Starter(直接粘贴)
在你的项目里,把core/classify.py里的 llm_classify 用下面版本替换/补上(无需改动UI):
- # core/classify.py 中加入/替换
- import os, json, time
- from typing import List, Dict
- from openai import OpenAI
- from .taxonomy import TAXONOMY
- def _batched(seq, n):
- for i in range(0, len(seq), n):
- yield seq[i:i+n]
- def llm_classify(rows: List[Dict],
- model: str = "deepseek-chat",
- batch_size: int = 40,
- temperature: float = 0.0,
- max_tokens: int = 1500,
- retries: int = 3,
- sleep_s: float = 2.0) -> List[Dict]:
- """
- 用 DeepSeek 对评论做多标签分类 + 负面判断。
- 输入 rows 的字段至少包含:review_text, seller, store_type
- 输出字段对齐 classify_batch:labels / is_negative / (可选 is_official)
- """
- client = OpenAI(api_key=os.getenv("DEEPSEEK_API_KEY"),
- base_url="https://api.deepseek.com")
- allowed_labels = list(TAXONOMY.keys())
- sys_prompt = f"""
- You are a multilingual e-commerce review classifier.
- Return STRICT json only.
- The final JSON schema:
- {{
- "results": [{{"idx": int, "labels": [str], "is_negative": bool, "is_official": bool}}]
- }}
- Allowed labels (choose zero or more, must be subset of this list):
- {allowed_labels}
- Label meanings (zh/brief): {TAXONOMY}
- Rules:
- - Classify each item independently.
- - Detect negativity: true if strong complaint/defect/delay/refund/fake, else false.
- - is_official: true if seller/store_type implies official/flagship/self-run; else false.
- - If pure meme/noise/not about product, use ["other_noise"] only.
- Output valid JSON. No extra text.
- """
- out = []
- for batch in _batched(rows, batch_size):
- items = []
- for i, r in enumerate(batch):
- items.append({
- "idx": i,
- "text": (r.get("review_text") or "")[:2000], # 防止超长
- "seller": r.get("seller",""),
- "store_type": r.get("store_type","")
- })
- user_payload = {"items": items}
- messages = [
- {"role":"system","content": sys_prompt},
- {"role":"user","content": "json\n" + json.dumps(user_payload, ensure_ascii=False)}
- ]
- for attempt in range(retries):
- try:
- resp = client.chat.completions.create(
- model=model,
- messages=messages,
- temperature=temperature,
- response_format={"type":"json_object"},
- max_tokens=max_tokens,
- )
- data = json.loads(resp.choices[0].message.content)
- result_map = {e["idx"]: e for e in data.get("results", [])}
- # 合并回原始行
- for i, r in enumerate(batch):
- e = result_map.get(i, {})
- merged = dict(r)
- merged["labels"] = e.get("labels", ["other_noise"])
- merged["is_negative"] = bool(e.get("is_negative", False))
- # 也保留规则版 is_official(若LLM没给)
- merged["is_official"] = bool(e.get("is_official", r.get("is_official", False)))
- out.append(merged)
- break
- except Exception as ex:
- if attempt == retries - 1:
- # 最后一次仍失败:保底 fall back
- for r in batch:
- rr = dict(r)
- rr.setdefault("labels", ["other_noise"])
- rr.setdefault("is_negative", False)
- rr.setdefault("is_official", r.get("is_official", False))
- out.append(rr)
- time.sleep(sleep_s)
- return out
复制代码
model="deepseek-chat"(V3)足够做分类;若要更强推理,可改deepseek-reasoner(R1)。
response_format={"type":"json_object"}+提示里明确JSON,能最大化保证可解析结构输出。
若遇到间歇性空内容/高压流量,请重试/退避,或换成流式(stream)并解析SSE。官方也有速率/错误码说明。
把UI切到LLM分类
在 app/streamlit_app.py 里把:
- from core.classify import classify_batch
复制代码
临时改成:
- from core.classify import classify_batch, llm_classify
复制代码
并在点击“开始分析”后,用你想要的路径(先跑规则,再跑LLM纠错,或直接LLM):
- # baseline
- classified = classify_batch(rows)
- # 只对“边界样本/负面样本/高价值SKU”再走 LLM(示例:全量直接LLM)
- # classified = llm_classify(rows, model="deepseek-chat", batch_size=40)
复制代码
3) 32KB 字典够不够?怎么“写成标签训练集”
结论:完全可以先作为弱监督的“种子字典”,把 900万评论自动打上多标签,做一个“一遍过”的训练集。
再训一个小模型(TF-IDF→Linear SVM/LogReg,或 fastText)。
之后用 DeepSeek 抽检并纠错边界样本,回填修正标签,得到迭代更干净的训练集。
最小训练脚本示例(可新建 scripts/train_svm.py):
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
# 1) 用你的“字典打标器”先生成 labels 列(List[str])
df = pd.read_csv("your_clean_reviews.csv") # 至少有 review_text
# 假设你已用规则把每条评论变成多标签列表
# df["labels"] = [["product_quality","price_value"], ...]
df = df[df["labels"].map(len) > 0]
# 2) 向量化 + 多标签训练
X_train, X_test, y_train, y_test = train_test_split(
df["review_text"], df["labels"], test_size=0.1, random_state=42, stratify=df["labels"].map(lambda x: x[0] if x else "none")
)
vec = TfidfVectorizer(max_features=200000, ngram_range=(1,2), min_df=3)
Xtr = vec.fit_transform(X_train)
Xte = vec.transform(X_test)
mlb = MultiLabelBinarizer()
Ytr = mlb.fit_transform(y_train)
Yte = mlb.transform(y_test)
clf = OneVsRestClassifier(LogisticRegression(max_iter=200, n_jobs=8))
clf.fit(Xtr, Ytr)
pred = clf.predict(Xte)
print(classification_report(Yte, pred, target_names=mlb.classes_))
# 保存 vec / clf / mlb 备用
字典扩展建议(很快见效)
对每个标签统计高PMI/卡方的新词,把Top候选加入字典。
为中英/东南亚语种准备同义词/常见错拼/口语写法/emoji变体。
负样本挖掘:抽取模型命中概率低但被人工判定为该类的样本,反向补词。
4) 封装与风控小贴士(很短)
并发与退避:批处理 40 条左右一呼叫,失败指数退避;注意 429/503/500 处理。官方错误码有说明。
DeepSeek API Docs
JSON模式:务必在提示+response_format双重声明;并设置足够 max_tokens,避免JSON被截断。
DeepSeek API Docs
官方合规:平台评论尽量走官方API/导出或你已有数据;避免绕过访问控制。
仅官方/旗舰过滤:保留“Apple官方旗舰店/自营/Official”等店铺;疑似二手/翻新用标签单独统计,便于报告里提示。
5) 现在建议你立刻做的“最小交付清单”
用你清洗好的CSV跑一遍基线分类(已在Starter里)。
接上面的 llm_classify,对一个可展示的数据切片跑DeepSeek版分类,导出CSV/JSON报告。
根据Top负面标签自动生成≤10条建议(Starter已内置),人工再润色2句,把证据样例(原文评论)贴进报告。
若还想上一个“本地模型版”,就用弱监督标注→跑上面train_svm.py,把推理结果与LLM结果对齐做个对照图表。 |
|