Vertica 内置 ML 速查表(v12.0.1)¶
Vertica 通过 SQL 接口支持完整机器学习工作流(v12.0.1)。更多信息请参考 Vertica ML 文档 或 Vertica ML 示例库。数据集下载指南见 Vertica 文档,也可直接从 Vertica ML GitHub 下载。另请参阅 VerticaPy —— 与 Vertica 数据库深度集成的 scikit-like 机器学习库。
数据预处理¶
数据摘要¶
对数值列输出 COUNT、MEAN、STDDEV、MIN、PERC25、MEDIAN、PERC75 和 MAX:
对分类列输出 CATEGORY、COUNT 和 PERCENT:
异常值检测¶
Robust Z-Score 方法,将异常值输出至指定表:
=> SELECT detect_outliers('baseball_outliers', 'baseball', 'hr, hits, avg, salary', 'robust_zscore' USING PARAMETERS outlier_threshold=3.0);
Isolation Forest 方法,训练模型:
=> SELECT iforest('baseball_outliers', 'baseball', 'team, hr, hits, avg, salary' USING PARAMETERS ntree=75, sampling_size=0.7, max_depth=15);
应用 iforest 模型,列出异常行:
=> SELECT * FROM (SELECT first_name, last_name, apply_iforest(hr, hits, salary USING PARAMETERS model_name='baseball_outliers', contamination=0.1) AS predictions FROM baseball) AS outliers WHERE predictions.is_anomaly IS true;
相关性分析¶
计算各输入列之间的 Pearson 相关系数:
数据归一化¶
Z-Score 归一化,输出到视图:
计算并存储归一化参数至模型:
应用存储的归一化参数:
逆转归一化变换:
降维¶
PCA — 计算主成分并保存至模型:
应用 PCA 变换,保留 99% 方差:
=> CREATE TABLE worldPCA AS SELECT apply_pca(* USING PARAMETERS model_name='world_pca', exclude_columns='HDI, country', key_columns='HDI, country', cutoff=0.99) OVER() FROM world;
逆转 PCA 变换:
=> SELECT apply_inverse_pca(HDI, country, col1, col2 USING PARAMETERS model_name='world_pca', exclude_columns='HDI, Country', key_columns='HDI, country') OVER() FROM worldPCA;
SVD 分解 — 计算并保存结果:
计算 U 矩阵并存储至表:
=> CREATE TABLE worldSVD AS SELECT apply_svd(* USING PARAMETERS model_name='world_svd', exclude_columns='HDI, country', key_columns='HDI, country', cutoff=0.99) OVER() FROM world;
将 SVD 变换后的数据还原回原始数据:
=> CREATE TABLE inverse_worldSVD AS SELECT apply_inverse_svd(* USING PARAMETERS model_name='world_svd', exclude_columns='HDI, country', key_columns='HDI, country') OVER() FROM worldSVD;
分类特征编码¶
生成 One-Hot 编码并存储至模型:
=> SELECT one_hot_encoder_fit('bTeamEncoder', 'baseball', 'team' USING PARAMETERS extra_levels='{"team" : ["Red Sox"]}');
应用 One-Hot 编码生成编码列视图:
=> CREATE VIEW baseballEncoded AS SELECT apply_one_hot_encoder(* USING PARAMETERS model_name='bTeamEncoder', drop_first=True, ignore_null=False) FROM baseball;
缺失值填充¶
按球队分组,用各组的均值填充 hits 列的缺失值:
=> SELECT impute('baseballImputed', 'baseball', 'hits', 'mean' USING PARAMETERS partition_columns='team');
数据平衡¶
使用混合采样使每个球队的样本数相等:
数据采样¶
创建包含 25% 数据的新表:
模型训练与预测¶
回归¶
线性回归¶
训练线性回归模型(BFGS 优化器):
=> SELECT linear_reg('linear_reg_faithful', 'faithful_training', 'eruptions', 'waiting' USING PARAMETERS optimizer='bfgs');
应用模型预测喷发持续时间:
=> SELECT id, predict_linear_reg(waiting USING PARAMETERS model_name='linear_reg_faithful') FROM faithful_testing;
SVM 回归¶
训练 SVM 回归模型:
=> SELECT svm_regressor('svm_reg_faithful', 'faithful_training', 'eruptions', 'waiting' USING PARAMETERS error_tolerance=0.1);
应用模型预测:
=> SELECT id, predict_svm_regressor(waiting USING PARAMETERS model_name='svm_reg_faithful') FROM faithful_testing;
随机森林回归¶
训练随机森林回归模型:
=> SELECT rf_regressor('rf_reg_cars', 'mtcars_train', 'mpg', 'carb, cyl, hp, drat, wt' USING PARAMETERS ntree=100);
应用模型预测 MPG:
=> SELECT mpg, predict_rf_regressor(carb, cyl, hp, drat, wt USING PARAMETERS model_name='rf_reg_cars') FROM mtcars_test;
XGBoost 回归¶
训练 XGBoost 回归模型:
=> SELECT xgb_regressor('xgb_cars', 'mtcars_train', 'mpg', 'carb, cyl, hp, drat, wt' USING PARAMETERS learning_rate=0.5);
应用模型预测 MPG:
=> SELECT mpg, predict_xgb_regressor(carb, cyl, hp, drat, wt USING PARAMETERS model_name='xgb_cars') FROM mtcars_test;
分类¶
逻辑回归¶
训练逻辑回归模型(BFGS + L2 正则化):
=> SELECT logistic_reg('logistic_cars', 'mtcars_train', 'am', 'mpg, cyl, disp, hp, drat, wt, qsec, gear, carb' USING PARAMETERS optimizer='BFGS', regularization='L2');
应用模型预测自动/手动变速箱:
=> SELECT car_model, predict_logistic_reg(mpg, cyl, disp, hp, drat, wt, qsec, gear, carb USING PARAMETERS model_name='logistic_cars') FROM mtcars_test;
SVM 分类¶
训练 SVM 分类模型(排除指定列):
=> SELECT svm_classifier('svm_cars', 'mtcars_train', 'am', 'mpg, cyl, disp, hp, drat, wt, qsec, vs, gear, carb' USING PARAMETERS exclude_columns='hp,drat');
应用模型预测:
=> SELECT car_model, am, predict_svm_classifier(mpg, cyl, disp, wt, qsec, vs, gear, carb USING PARAMETERS model_name='svm_cars') FROM mtcars_test;
朴素贝叶斯¶
训练朴素贝叶斯模型:
=> SELECT naive_bayes('naive_congress', 'house84_train', 'party', '*' USING PARAMETERS exclude_columns='party, id');
应用模型,返回预测结果:
=> SELECT party, predict_naive_bayes(vote1, vote2, vote3 USING PARAMETERS model_name='naive_congress', type='response') AS predicted_party FROM house84_test;
返回预测结果及各类别概率:
=> SELECT predict_naive_bayes_classes(id, vote1, vote2, vote3 USING PARAMETERS model_name='naive_congress', key_columns='id', exclude_columns='id', classes='democrat, republican', match_by_pos='false') OVER() FROM house84_test;
随机森林分类¶
训练随机森林分类模型:
=> SELECT rf_classifier('rf_iris', 'iris1', 'species', 'sepal_length, sepal_width, petal_length, petal_width' USING PARAMETERS ntree=100, sampling_size=0.3);
应用模型预测物种分类:
=> SELECT id, predict_rf_classifier(sepal_length, sepal_width, petal_length, petal_width USING PARAMETERS model_name='rf_iris') FROM iris2;
返回预测分类及概率:
=> SELECT predict_rf_classifier_classes(id, sepal_length, sepal_width, petal_length, petal_width USING PARAMETERS model_name='rf_iris', key_columns='id', exclude_columns='id') OVER() FROM iris2;
XGBoost 分类¶
训练 XGBoost 分类模型:
=> SELECT xgb_classifier('xgb_iris', 'iris1', 'species', 'Sepal_Length, Sepal_Width, Petal_Length, Petal_Width' USING PARAMETERS max_depth=5);
应用模型预测物种分类:
=> SELECT predict_xgb_classifier(Sepal_Length, Sepal_Width, Petal_Length, Petal_Width USING PARAMETERS model_name='xgb_iris') FROM iris2;
返回预测分类及各分类概率(softmax):
=> SELECT predict_xgb_classifier_classes(Sepal_Length, Sepal_Width, Petal_Length, Petal_Width USING PARAMETERS model_name='xgb_iris', classes='virginica, versicolor, setosa', probability_normalization='softmax') OVER() FROM iris2;
聚类¶
K-means¶
训练 K-means 模型,将微生物划分为 5 个聚类:
=> SELECT kmeans('agar_kmeans', 'agar_dish_1', '*', 5 USING PARAMETERS exclude_columns='id', output_view='agar_view', key_columns='id');
应用模型,为新行分配聚类:
Bisecting K-means¶
训练二分 K-means 模型:
=> SELECT bisecting_kmeans('agar_bkmeans', 'agar_dish_1', '*', 5 USING PARAMETERS exclude_columns='id', key_columns='id', output_view='agar_bk_view');
应用模型,指定输出 3 个聚类:
=> SELECT id, apply_bisecting_kmeans(x, y USING PARAMETERS model_name='agar_bkmeans', number_clusters=3) FROM agar_dish_2;
时间序列¶
自回归¶
训练自回归模型(使用前 3 个值预测当前值):
返回 20 个预测值:
=> SELECT predict_autoregressor(Temperature USING PARAMETERS model_name='AR_temp', npredictions=20) OVER(ORDER BY time) FROM temp_data;
移动平均¶
训练移动平均模型(使用前 3 次预测的误差):
返回 20 个预测值:
=> SELECT predict_moving_average(Temperature USING PARAMETERS model_name='MA_temp', npredictions=20) OVER(ORDER BY time) FROM temp_data;
模型评估¶
回归指标¶
均方误差(MSE)¶
=> SELECT mse(obs, pred) OVER() FROM (SELECT eruptions AS obs, predict_linear_reg(waiting USING PARAMETERS model_name='linear_reg_faithful') AS pred FROM faithful_testing) AS prediction_output;
R 方(R-squared)¶
=> SELECT rsquared(obs, pred) OVER() FROM (SELECT eruptions AS obs, predict_linear_reg(waiting USING PARAMETERS model_name='linear_reg_faithful') AS pred FROM faithful_testing) AS prediction_output;
分类指标¶
混淆矩阵¶
=> SELECT confusion_matrix(obs::int, pred::int USING PARAMETERS num_classes=2) OVER() FROM (SELECT am AS obs, predict_logistic_reg(mpg, cyl, disp, hp, drat, wt, qsec, gear, carb USING PARAMETERS model_name='logistic_cars') AS pred FROM mtcars_test) AS prediction_output;
错误率¶
=> SELECT error_rate(obs::int, pred::int USING PARAMETERS num_classes=2) OVER() FROM (SELECT am AS obs, predict_logistic_reg(mpg, cyl, disp, hp, drat, wt, qsec, gear, carb USING PARAMETERS model_name='logistic_cars') AS pred FROM mtcars_test) AS prediction_output;
Lift 表¶
=> SELECT lift_table(obs::int, prob USING PARAMETERS num_bins=50) OVER() FROM (SELECT am AS obs, predict_logistic_reg(mpg, cyl, disp, hp, drat, wt, qsec, gear, carb USING PARAMETERS model_name='logistic_cars', type='probability') AS prob FROM mtcars_test) AS prediction_output;
ROC 曲线¶
=> SELECT roc(obs::int, prob USING PARAMETERS num_bins=50) OVER() FROM (SELECT am AS obs, predict_logistic_reg(mpg, cyl, disp, hp, drat, wt, qsec, gear, carb USING PARAMETERS model_name='logistic_cars', type='probability') AS prob FROM mtcars_test) AS prediction_output;
交叉验证¶
对 SVM 分类器执行 K 折交叉验证(5 折 x 3 个 C 超参数 = 15 个模型):
=> SELECT cross_validate('svm_cars', 'mtcars_train', 'am', 'mpg,cyl,disp,hp,drat,wt,qsec,vs,gear,carb' USING PARAMETERS cv_fold_count=5, cv_hyperparams='{"C":[0.1,1,5]}', cv_model_name='svm_cv', cv_metrics='error_rate');
PRC(精确率-召回率曲线)¶
=> SELECT prc(obs::int, prob::float USING PARAMETERS num_bins=50, f1_score=true) OVER() FROM (SELECT am AS obs, predict_logistic_reg(mpg, cyl, disp, hp, drat, wt, qsec, gear, carb USING PARAMETERS model_name='logistic_cars', type='probability') AS prob FROM mtcars_test) AS prediction_output;
决策树指标¶
查看决策树信息(如 is_leaf、node_depth 等):
计算各特征重要性(MDI —— Mean Decrease Impurity):
模型管理¶
列出所有模型:
删除模型:
升级模型至当前数据库版本:
修改模型所有者、模式、名称:
=> ALTER MODEL xgb_cars OWNER TO analyst;
=> ALTER MODEL xgb_cars SET SCHEMA public;
=> ALTER MODEL xgb_cars RENAME TO xgb_autos;
授权 / 撤销模型使用权限:
=> GRANT USAGE ON MODEL linear_reg_faithful TO analyst;
=> REVOKE USAGE ON MODEL linear_reg_faithful FROM analyst;
查看模型摘要:
列出模型所有属性(如正则化类型、迭代次数等):
获取模型特定属性值:
导出/导入模型(跨 Vertica 集群):
=> SELECT export_models('/home/dbadmin/myModels', 'AR_temp');
=> SELECT import_models('/home/newDir/myModels/*' USING PARAMETERS new_schema='analyst');
使用外部模型¶
以下示例依赖外部模型,无法直接开箱运行。更多信息请参考 Vertica 文档 或 TensorFlow 示例。
PMML 模型¶
导入 PMML 模型:
使用 PMML 模型预测:
=> SELECT predict_pmml(* USING PARAMETERS model_name='spark_logistic_reg') AS prediction FROM test_data;
以 PMML 格式导出模型:
=> SELECT export_models('/path/to/export/to', 'spark_logistic_reg' USING PARAMETERS category='pmml');
TensorFlow 模型¶
导入 TensorFlow 模型:
使用 TensorFlow 模型预测:
=> SELECT predict_tensorflow(* USING PARAMETERS model_name='tf_mnist_keras') OVER(PARTITION BEST) FROM tf_mnist_test_images;
以冻结图(frozen graph)格式导出 TensorFlow 模型:
=> SELECT export_models('/path/to/export/to', 'tf_mnist_keras' USING PARAMETERS category='tensorflow');