Galaxy Machine Learning Community
The Galaxy Machine Learning workbench is a comprehensive set of data preprocessing, machine learning, deep learning and visualisation tools, consolidated workflows for end-to-end machine learning analysis and training materials to showcase the usage of these tools. The workbench is available on the Galaxy framework, which guarantees simple access, easy extension, flexible adaption to personal and security needs, and sophisticated machine learning analyses independent of command-line knowledge.
The workbench provides you with a Swiss Army knife of scikit-learn, Keras (a deep learning library based on TensorFlow) and various other tools to transform, learn and predict and plot your data.
The workbench is currently developed by the Goecks Lab and the European Galaxy project. The German Network for Bioinformatics Infrastructure (de.NBI), which runs the German ELIXIR Node, provides the necessary compute clusters with CPUs and GPU resources.
The project is a community effort, please jump in, ask questions, and contribute to the development of new tools, workflows or trainings!
Training
We are passionate about training. So we are working in close collaboration with the Galaxy Training Network (GTN) to develop training materials of data analyses based on Galaxy. These materials hosted on the GTN GitHub repository are available online at https://training.galaxyproject.org.
Want to learn more about machine learning? Take one of our guided tours or check out the following hands-on tutorials, developed together with the GTN community.
Available tools
In this section we list the most important tools that have been integrated into the Machine Learning workbench. There are many more tools available so please have a more detailed look at the tool panel at https://ml.usegalaxy.eu. All tools follow the IUC best practise guidelines for Galaxy tool development and are available under https://github.com/bgruening/galaxytools and https://github.com/goeckslab/Galaxy-ML. For better readability, we have listed the most powerful tools below and divided them into categories.
Classification
Identifying which category an object belongs to.
Tool | Description | Reference |
---|---|---|
SVM Classifier | Support vector machines (SVMs) for classification | Pedregosa et al. 2011 |
NN Classifier | Nearest Neighbors Classification | Pedregosa et al. 2011 |
Ensemble classification | Ensemble methods for classification and regression | Pedregosa et al. 2011 |
Discriminant Classifier | Linear and Quadratic Discriminant Analysis | Pedregosa et al. 2011 |
Generalized linear | Generalized linear models for classification and regression | Pedregosa et al. 2011 |
CLF Metrics | Calculate metrics for classification performance | Pedregosa et al. 2011 |
Regression
Predicting a continuous-valued attribute associated with an object.
Tool | Description | Reference |
---|---|---|
Ensemble regression | Ensemble methods for classification and regression | Pedregosa et al. 2011 |
Generalized linear | Generalized linear models for classification and regression | Pedregosa et al. 2011 |
Regression metrics | Calculate metrics for regression performance | Pedregosa et al. 2011 |
Clustering
Automatic grouping of similar objects into sets.
Tool | Description | Reference |
---|---|---|
Numeric clustering | Different numerical clustering algorithms | Pedregosa et al. 2011 |
Model building
Building general machine learning models.
Tool | Description | Reference |
---|---|---|
Estimator Attributes | Estimator attributes to get all attributes from an estimator or scikit object | Pedregosa et al. 2011 |
Stacking Ensemble Models | Stacking Ensembles to build stacking, voting ensemble models with numerous base options | Pedregosa et al. 2011 |
Search CV | Hyperparameter Search performs hyperparameter optimization using various SearchCVs | Pedregosa et al. 2011 |
Build Pipeline | Pipeline Builder as an all-in-one platform to build pipeline, single estimator, preprocessor and custom wrappers | Pedregosa et al. 2011 |
Model evaluation
Evaluation, validating and choosing parameters and models.
Tool | Description | Reference |
---|---|---|
Model validation | Model Validation includes cross_validate, cross_val_predict, learning_curve, and more | Pedregosa et al. 2011 |
Pairwise Metrics | Evaluate pairwise distances or compute affinity or kernel for sets of samples | Pedregosa et al. 2011 |
Train/Test evaluation | Train, Test and Evaluation to fit a model using part of dataset and evaluate using the rest | Pedregosa et al. 2011 |
Model Prediction | Model Prediction predicts on new data using a preffited model | Chollet et al. 2011 |
Fitted model evaluation | Evaluate a Fitted Model using a new batch of labeled data | Pedregosa et al. 2011 |
Model fitting | Fit a Pipeline, Ensemble or other models using a labeled dataset | Pedregosa et al. 2011 |
Preprocessing and feature selection
Feature selection and preprocessing.
Tool | Description | Reference |
---|---|---|
Data preprocessing | Preprocess raw feature vectors into standardized datasets | Pedregosa et al. 2011 |
Feature selection | Feature Selection module, including univariate filter selection methods and recursive feature elimination algorithm | Pedregosa et al. 2011 |
Deep learning
Build and use deep neural networks.
Tool | Description | Reference |
---|---|---|
Batch Models | Build Deep learning Batch Training Models with online data generator for Genomic/Protein sequences and images | Chollet et al. 2011 |
Model Builder | Create deep learning model with an optimizer, loss function and fit parameters | Chollet et al. 2011 |
Model Config | Create a deep learning model architecture using Keras | Chollet et al. 2011 |
Train and evaluation | Deep learning training and evaluation either implicitly or explicitly | Chollet et al. 2011 |
Visualization
Plotting and visualization.
Tool | Description | Reference |
---|---|---|
Regression performance plots | Plot actual vs predicted curves and residual plots of tabular data | |
ML performance plots | Plot confusion matrix, precision, recall and ROC and AUC curves of tabular data | |
Visualization | Machine Learning Visualization Extension includes several types of plotting for machine learning | Chollet et al. 2011 |
Utilities
General data and table manipulation tools.
Tool | Description | Reference |
---|---|---|
Table compute | The power of the pandas data library for manipulating and computing expressions upon tabular data and matrices. | |
Datamash operations | Datamash operations on tabular data | |
Datamash transpose | Transpose rows/columns in a tabular file | |
Sample Generator | Generate random samples with controlled size and complexity | Pedregosa et al. 2011 |
Train/Test splitting | Split Dataset into training and test subsets | Pedregosa et al. 2011 |
Interactive Environments
You have done the heavy lifting and now want to use your coding skills inside Jupyter or RStudio? Work on data with the following:
Tool | Description | Reference |
---|---|---|
Jupyter | Jupyter lab | |
RStudio | RStudio |