跳转至

Vertica 内置 ML 速查表(v12.0.1)

原文:Vertica Machine Learning V12.0.1 Cheat Sheet PDF 版:点击下载

Vertica 通过 SQL 接口支持完整机器学习工作流(v12.0.1)。更多信息请参考 Vertica ML 文档Vertica ML 示例库。数据集下载指南见 Vertica 文档,也可直接从 Vertica ML GitHub 下载。另请参阅 VerticaPy —— 与 Vertica 数据库深度集成的 scikit-like 机器学习库。

数据预处理

数据摘要

对数值列输出 COUNT、MEAN、STDDEV、MIN、PERC25、MEDIAN、PERC75 和 MAX:

=> SELECT summarize_numcol(hits, salary) OVER() FROM baseball WHERE dob > '1975-7-1'::DATE;

对分类列输出 CATEGORY、COUNT 和 PERCENT:

=> SELECT summarize_catcol(team USING PARAMETERS topk = 10) OVER() FROM baseball;

异常值检测

Robust Z-Score 方法,将异常值输出至指定表:

=> SELECT detect_outliers('baseball_outliers', 'baseball', 'hr, hits, avg, salary', 'robust_zscore' USING PARAMETERS outlier_threshold=3.0);

Isolation Forest 方法,训练模型:

=> SELECT iforest('baseball_outliers', 'baseball', 'team, hr, hits, avg, salary' USING PARAMETERS ntree=75, sampling_size=0.7, max_depth=15);

应用 iforest 模型,列出异常行:

=> SELECT * FROM (SELECT first_name, last_name, apply_iforest(hr, hits, salary USING PARAMETERS model_name='baseball_outliers', contamination=0.1) AS predictions FROM baseball) AS outliers WHERE predictions.is_anomaly IS true;

相关性分析

计算各输入列之间的 Pearson 相关系数:

=> SELECT corr_matrix(hr, hits, avg, salary) OVER() FROM baseball;

数据归一化

Z-Score 归一化,输出到视图:

=> SELECT normalize('baseball_normz', 'baseball', 'hr, hits', 'zscore');

计算并存储归一化参数至模型:

=> SELECT normalize_fit('baseball_normfitrz', 'baseball', 'hr,hits', 'robust_zscore');

应用存储的归一化参数:

=> SELECT apply_normalize(* USING PARAMETERS model_name = 'baseball_normfitrz') FROM baseball;

逆转归一化变换:

=> SELECT reverse_normalize(* USING PARAMETERS model_name = 'baseball_normfitrz') FROM baseball;

降维

PCA — 计算主成分并保存至模型:

=> SELECT pca('world_pca', 'world', '*' USING PARAMETERS exclude_columns='HDI, country');

应用 PCA 变换,保留 99% 方差:

=> CREATE TABLE worldPCA AS SELECT apply_pca(* USING PARAMETERS model_name='world_pca', exclude_columns='HDI, country', key_columns='HDI, country', cutoff=0.99) OVER() FROM world;

逆转 PCA 变换:

=> SELECT apply_inverse_pca(HDI, country, col1, col2 USING PARAMETERS model_name='world_pca', exclude_columns='HDI, Country', key_columns='HDI, country') OVER() FROM worldPCA;

SVD 分解 — 计算并保存结果:

=> SELECT svd('world_svd', 'world', '*' USING PARAMETERS exclude_columns='HDI, country');

计算 U 矩阵并存储至表:

=> CREATE TABLE worldSVD AS SELECT apply_svd(* USING PARAMETERS model_name='world_svd', exclude_columns='HDI, country', key_columns='HDI, country', cutoff=0.99) OVER() FROM world;

将 SVD 变换后的数据还原回原始数据:

=> CREATE TABLE inverse_worldSVD AS SELECT apply_inverse_svd(* USING PARAMETERS model_name='world_svd', exclude_columns='HDI, country', key_columns='HDI, country') OVER() FROM worldSVD;

分类特征编码

生成 One-Hot 编码并存储至模型:

=> SELECT one_hot_encoder_fit('bTeamEncoder', 'baseball', 'team' USING PARAMETERS extra_levels='{"team" : ["Red Sox"]}');

应用 One-Hot 编码生成编码列视图:

=> CREATE VIEW baseballEncoded AS SELECT apply_one_hot_encoder(* USING PARAMETERS model_name='bTeamEncoder', drop_first=True, ignore_null=False) FROM baseball;

缺失值填充

按球队分组,用各组的均值填充 hits 列的缺失值:

=> SELECT impute('baseballImputed', 'baseball', 'hits', 'mean' USING PARAMETERS partition_columns='team');

数据平衡

使用混合采样使每个球队的样本数相等:

=> SELECT balance('baseballBalanced', 'baseball', 'team', 'hybrid_sampling');

数据采样

创建包含 25% 数据的新表:

=> CREATE TABLE baseball_sample AS SELECT * FROM baseball TABLESAMPLE(25);

模型训练与预测

回归

线性回归

训练线性回归模型(BFGS 优化器):

=> SELECT linear_reg('linear_reg_faithful', 'faithful_training', 'eruptions', 'waiting' USING PARAMETERS optimizer='bfgs');

应用模型预测喷发持续时间:

=> SELECT id, predict_linear_reg(waiting USING PARAMETERS model_name='linear_reg_faithful') FROM faithful_testing;

SVM 回归

训练 SVM 回归模型:

=> SELECT svm_regressor('svm_reg_faithful', 'faithful_training', 'eruptions', 'waiting' USING PARAMETERS error_tolerance=0.1);

应用模型预测:

=> SELECT id, predict_svm_regressor(waiting USING PARAMETERS model_name='svm_reg_faithful') FROM faithful_testing;

随机森林回归

训练随机森林回归模型:

=> SELECT rf_regressor('rf_reg_cars', 'mtcars_train', 'mpg', 'carb, cyl, hp, drat, wt' USING PARAMETERS ntree=100);

应用模型预测 MPG:

=> SELECT mpg, predict_rf_regressor(carb, cyl, hp, drat, wt USING PARAMETERS model_name='rf_reg_cars') FROM mtcars_test;

XGBoost 回归

训练 XGBoost 回归模型:

=> SELECT xgb_regressor('xgb_cars', 'mtcars_train', 'mpg', 'carb, cyl, hp, drat, wt' USING PARAMETERS learning_rate=0.5);

应用模型预测 MPG:

=> SELECT mpg, predict_xgb_regressor(carb, cyl, hp, drat, wt USING PARAMETERS model_name='xgb_cars') FROM mtcars_test;

分类

逻辑回归

训练逻辑回归模型(BFGS + L2 正则化):

=> SELECT logistic_reg('logistic_cars', 'mtcars_train', 'am', 'mpg, cyl, disp, hp, drat, wt, qsec, gear, carb' USING PARAMETERS optimizer='BFGS', regularization='L2');

应用模型预测自动/手动变速箱:

=> SELECT car_model, predict_logistic_reg(mpg, cyl, disp, hp, drat, wt, qsec, gear, carb USING PARAMETERS model_name='logistic_cars') FROM mtcars_test;

SVM 分类

训练 SVM 分类模型(排除指定列):

=> SELECT svm_classifier('svm_cars', 'mtcars_train', 'am', 'mpg, cyl, disp, hp, drat, wt, qsec, vs, gear, carb' USING PARAMETERS exclude_columns='hp,drat');

应用模型预测:

=> SELECT car_model, am, predict_svm_classifier(mpg, cyl, disp, wt, qsec, vs, gear, carb USING PARAMETERS model_name='svm_cars') FROM mtcars_test;

朴素贝叶斯

训练朴素贝叶斯模型:

=> SELECT naive_bayes('naive_congress', 'house84_train', 'party', '*' USING PARAMETERS exclude_columns='party, id');

应用模型,返回预测结果:

=> SELECT party, predict_naive_bayes(vote1, vote2, vote3 USING PARAMETERS model_name='naive_congress', type='response') AS predicted_party FROM house84_test;

返回预测结果及各类别概率:

=> SELECT predict_naive_bayes_classes(id, vote1, vote2, vote3 USING PARAMETERS model_name='naive_congress', key_columns='id', exclude_columns='id', classes='democrat, republican', match_by_pos='false') OVER() FROM house84_test;

随机森林分类

训练随机森林分类模型:

=> SELECT rf_classifier('rf_iris', 'iris1', 'species', 'sepal_length, sepal_width, petal_length, petal_width' USING PARAMETERS ntree=100, sampling_size=0.3);

应用模型预测物种分类:

=> SELECT id, predict_rf_classifier(sepal_length, sepal_width, petal_length, petal_width USING PARAMETERS model_name='rf_iris') FROM iris2;

返回预测分类及概率:

=> SELECT predict_rf_classifier_classes(id, sepal_length, sepal_width, petal_length, petal_width USING PARAMETERS model_name='rf_iris', key_columns='id', exclude_columns='id') OVER() FROM iris2;

XGBoost 分类

训练 XGBoost 分类模型:

=> SELECT xgb_classifier('xgb_iris', 'iris1', 'species', 'Sepal_Length, Sepal_Width, Petal_Length, Petal_Width' USING PARAMETERS max_depth=5);

应用模型预测物种分类:

=> SELECT predict_xgb_classifier(Sepal_Length, Sepal_Width, Petal_Length, Petal_Width USING PARAMETERS model_name='xgb_iris') FROM iris2;

返回预测分类及各分类概率(softmax):

=> SELECT predict_xgb_classifier_classes(Sepal_Length, Sepal_Width, Petal_Length, Petal_Width USING PARAMETERS model_name='xgb_iris', classes='virginica, versicolor, setosa', probability_normalization='softmax') OVER() FROM iris2;

聚类

K-means

训练 K-means 模型,将微生物划分为 5 个聚类:

=> SELECT kmeans('agar_kmeans', 'agar_dish_1', '*', 5 USING PARAMETERS exclude_columns='id', output_view='agar_view', key_columns='id');

应用模型,为新行分配聚类:

=> SELECT id, apply_kmeans(x, y USING PARAMETERS model_name='agar_kmeans') FROM agar_dish_2;

Bisecting K-means

训练二分 K-means 模型:

=> SELECT bisecting_kmeans('agar_bkmeans', 'agar_dish_1', '*', 5 USING PARAMETERS exclude_columns='id', key_columns='id', output_view='agar_bk_view');

应用模型,指定输出 3 个聚类:

=> SELECT id, apply_bisecting_kmeans(x, y USING PARAMETERS model_name='agar_bkmeans', number_clusters=3) FROM agar_dish_2;

时间序列

自回归

训练自回归模型(使用前 3 个值预测当前值):

=> SELECT autoregressor('AR_temp', 'temp_data', 'Temperature', 'time' USING PARAMETERS p=3);

返回 20 个预测值:

=> SELECT predict_autoregressor(Temperature USING PARAMETERS model_name='AR_temp', npredictions=20) OVER(ORDER BY time) FROM temp_data;

移动平均

训练移动平均模型(使用前 3 次预测的误差):

=> SELECT moving_average('MA_temp', 'temp_data', 'temperature', 'time' USING PARAMETERS q=3);

返回 20 个预测值:

=> SELECT predict_moving_average(Temperature USING PARAMETERS model_name='MA_temp', npredictions=20) OVER(ORDER BY time) FROM temp_data;

模型评估

回归指标

均方误差(MSE)

=> SELECT mse(obs, pred) OVER() FROM (SELECT eruptions AS obs, predict_linear_reg(waiting USING PARAMETERS model_name='linear_reg_faithful') AS pred FROM faithful_testing) AS prediction_output;

R 方(R-squared)

=> SELECT rsquared(obs, pred) OVER() FROM (SELECT eruptions AS obs, predict_linear_reg(waiting USING PARAMETERS model_name='linear_reg_faithful') AS pred FROM faithful_testing) AS prediction_output;

分类指标

混淆矩阵

=> SELECT confusion_matrix(obs::int, pred::int USING PARAMETERS num_classes=2) OVER() FROM (SELECT am AS obs, predict_logistic_reg(mpg, cyl, disp, hp, drat, wt, qsec, gear, carb USING PARAMETERS model_name='logistic_cars') AS pred FROM mtcars_test) AS prediction_output;

错误率

=> SELECT error_rate(obs::int, pred::int USING PARAMETERS num_classes=2) OVER() FROM (SELECT am AS obs, predict_logistic_reg(mpg, cyl, disp, hp, drat, wt, qsec, gear, carb USING PARAMETERS model_name='logistic_cars') AS pred FROM mtcars_test) AS prediction_output;

Lift 表

=> SELECT lift_table(obs::int, prob USING PARAMETERS num_bins=50) OVER() FROM (SELECT am AS obs, predict_logistic_reg(mpg, cyl, disp, hp, drat, wt, qsec, gear, carb USING PARAMETERS model_name='logistic_cars', type='probability') AS prob FROM mtcars_test) AS prediction_output;

ROC 曲线

=> SELECT roc(obs::int, prob USING PARAMETERS num_bins=50) OVER() FROM (SELECT am AS obs, predict_logistic_reg(mpg, cyl, disp, hp, drat, wt, qsec, gear, carb USING PARAMETERS model_name='logistic_cars', type='probability') AS prob FROM mtcars_test) AS prediction_output;

交叉验证

对 SVM 分类器执行 K 折交叉验证(5 折 x 3 个 C 超参数 = 15 个模型):

=> SELECT cross_validate('svm_cars', 'mtcars_train', 'am', 'mpg,cyl,disp,hp,drat,wt,qsec,vs,gear,carb' USING PARAMETERS cv_fold_count=5, cv_hyperparams='{"C":[0.1,1,5]}', cv_model_name='svm_cv', cv_metrics='error_rate');

PRC(精确率-召回率曲线)

=> SELECT prc(obs::int, prob::float USING PARAMETERS num_bins=50, f1_score=true) OVER() FROM (SELECT am AS obs, predict_logistic_reg(mpg, cyl, disp, hp, drat, wt, qsec, gear, carb USING PARAMETERS model_name='logistic_cars', type='probability') AS prob FROM mtcars_test) AS prediction_output;

决策树指标

查看决策树信息(如 is_leaf、node_depth 等):

=> SELECT read_tree(USING PARAMETERS model_name='rf_iris', format='tabular');

计算各特征重要性(MDI —— Mean Decrease Impurity):

=> SELECT rf_predictor_importance(USING PARAMETERS model_name='rf_iris');

模型管理

列出所有模型:

=> SELECT * FROM models;

删除模型:

=> DROP MODEL xgb_iris;

升级模型至当前数据库版本:

=> SELECT upgrade_model(USING PARAMETERS model_name='xgb_cars');

修改模型所有者、模式、名称:

=> ALTER MODEL xgb_cars OWNER TO analyst;
=> ALTER MODEL xgb_cars SET SCHEMA public;
=> ALTER MODEL xgb_cars RENAME TO xgb_autos;

授权 / 撤销模型使用权限:

=> GRANT USAGE ON MODEL linear_reg_faithful TO analyst;
=> REVOKE USAGE ON MODEL linear_reg_faithful FROM analyst;

查看模型摘要:

=> SELECT get_model_summary(USING PARAMETERS model_name='linear_reg_faithful');

列出模型所有属性(如正则化类型、迭代次数等):

=> SELECT get_model_attribute(USING PARAMETERS model_name='linear_reg_faithful');

获取模型特定属性值:

=> SELECT get_model_attribute(USING PARAMETERS model_name='linear_reg_faithful', attr_name='data');

导出/导入模型(跨 Vertica 集群):

=> SELECT export_models('/home/dbadmin/myModels', 'AR_temp');
=> SELECT import_models('/home/newDir/myModels/*' USING PARAMETERS new_schema='analyst');

使用外部模型

以下示例依赖外部模型,无法直接开箱运行。更多信息请参考 Vertica 文档TensorFlow 示例

PMML 模型

导入 PMML 模型:

=> SELECT import_models('/data/username/temp/spark_logistic_reg' USING PARAMETERS category='pmml');

使用 PMML 模型预测:

=> SELECT predict_pmml(* USING PARAMETERS model_name='spark_logistic_reg') AS prediction FROM test_data;

以 PMML 格式导出模型:

=> SELECT export_models('/path/to/export/to', 'spark_logistic_reg' USING PARAMETERS category='pmml');

TensorFlow 模型

导入 TensorFlow 模型:

=> SELECT import_models('/path/tf_models/tf_mnist_keras' USING PARAMETERS category='tensorflow');

使用 TensorFlow 模型预测:

=> SELECT predict_tensorflow(* USING PARAMETERS model_name='tf_mnist_keras') OVER(PARTITION BEST) FROM tf_mnist_test_images;

以冻结图(frozen graph)格式导出 TensorFlow 模型:

=> SELECT export_models('/path/to/export/to', 'tf_mnist_keras' USING PARAMETERS category='tensorflow');

扩展阅读