Scikit-learn Python Library: APIs

These APIs are well-documented and designed to facilitate easy and efficient model building, training, and evaluation. Here’s an overview of the main APIs provided by Scikit-learn:

1. Estimator API

This is the core API in scikit-learn and is used for all the machine learning algorithms. Each estimator in scikit-learn is a Python class, and the library includes estimators for classification, regression, clustering, and dimensionality reduction. Key methods include:

  • .fit(): Used for training the model.
  • .predict(): Used for making predictions.
  • .score(): Used for evaluating the predictions.

2. Transformers API

Transformers are used for data preprocessing and feature extraction. They include scaling, normalizing, and converting data so that it can be effectively used by machine learning models. Key methods include:

  • .fit(): Learning the transformation parameters from the training data.
  • .transform(): Applying the transformation to any data using the learned parameters.
  • .fit_transform(): A utility method that combines fit and transform into a single operation.

3. Pipeline API

Pipelines help to streamline the process of chaining multiple estimators into one, which is useful for building a model that includes a sequence of transformations followed by a classifier or regressor. Key components include:

  • Pipeline: Class that behaves like a compound estimator.
  • make_pipeline: Helper function to simplify pipeline construction.

4. Model Selection API

This part of the library includes tools to choose between models, primarily through cross-validation:

  • train_test_split: Split arrays or matrices into random train and test subsets.
  • cross_val_score: Evaluate a score by cross-validation.
  • GridSearchCV: Exhaustive search over specified parameter values for an estimator.
  • RandomizedSearchCV: Randomized search over parameters.

5. Metrics API

Scikit-learn provides a broad range of metrics to evaluate the performance of your models, such as accuracy, ROC-AUC, mean squared error, etc., and also tools to compute some of these metrics across different cross-validation folds.

  • Classification metrics: accuracy_score, roc_auc_score, confusion_matrix, etc.
  • Regression metrics: mean_squared_error, r2_score, etc.
  • Clustering metrics: silhouette_score, adjusted_rand_score, etc.

6. Decomposition API

This API is used for dimensionality reduction, offering various methods to break down high-dimensional datasets into manageable parts while retaining most of the important information:

  • PCA: Principal component analysis.
  • NMF: Non-negative matrix factorization.
  • TruncatedSVD: Dimensionality reduction using truncated SVD.

7. Ensemble Methods API

Scikit-learn includes several ensemble algorithms which combine the predictions of several base estimators to improve generalizability and robustness:

  • RandomForestClassifier and RandomForestRegressor
  • GradientBoostingClassifier and GradientBoostingRegressor
  • AdaBoostClassifier and AdaBoostRegressor