Dr. Lukas Pfannschmidt | Lukas Pfannschmidt

Supercharge Your Developer Productivity

Sat, 04 Nov 2023 17:05:04 +0100

Boost Your Macbook’s Productivity with These Power Tools

Maximize your efficiency as a developer on macOS with a few savvy tools and shortcuts designed to speed up your workflow. Here’s a rundown of the tools I’ve integrated into my routine to navigate and manage projects more efficiently.

Oh My Zsh: The Powerhouse Shell

Oh My Zsh is a collection of extensions for the normal ZSH. It comes packed with handy features and plugins to help enhance your terminal experience. It also allows using custom prompts like the Spaceship prompt, which provides a wealth of information at a glance, including the current directory, git status, and Python virtual environment.

Venv Display plugin

For developers who work with Python, keeping track of virtual environments is crucial. Spaceship allows you to display your current environment directly in the prompt, ensuring you’re always aware of the context you’re working in.

K8s Display

If you’re juggling multiple Kubernetes contexts, Spaceship can display the current context and namespace, saving you from the confusion of command-line checks.

Git Branch Display

Avoid the git status commands with the branch name on display, helping you keep tabs on your current work branch without additional commands.

Integrating these features into your terminal can greatly improve your navigation and productivity within complex development workflows.

Jump to any directory based on substring

zoxide replaced cd command for me. It allows you to jump to any directory based on a substring of its name. It learns your habits and uses a ranking algorithm to prioritize the most likely directory you want to jump to.

z foo

will jump to the directory containing foo in its name, sorted by frequency of use. I basically never have to use z a second time because it is nearly 100% accurate in guessing the directory I want to jump to.

Jump to any command based on substring

Similar for my shell command history, I use fzf. With Ctrl-R I can search through my command history and jump to any command based on a substring of its name.

Here is an example of searching for docker in my command history:

Raycast: Your Search, Supercharged

Raycast replaces the need for multiple apps and tools by consolidating all your search and command needs into one sleek, unified application. It can be accessed with a simple keyboard shortcut, allowing you to search for files, open applications, and execute commands without ever leaving your keyboard.

GitHub Repositories

With Raycast’s powerful plugin capabilities you can add the Github extension. It gives you the ability to swiftly navigate through all the repositories in your organization—a real time-saver for developers handling multiple projects.

VS Code Project Switching

Forget about sifting through your project directories. Raycast lets you switch between your Visual Studio Code projects without breaking your flow.

Instant Zoom Access

Raycast also provides a direct line to your scheduled Zoom meetings, allowing you to join with just one command—no more digging through emails or calendars.

Wrapping Up

Incorporating Oh My Zsh, Fuzzy Finder, fasd, and Raycast into your daily routine is like adding superpowers to your development workflow. These tools help minimize friction and maximize productivity, letting you focus on what you do best: creating incredible software. Try them out and see the difference for yourself.

This curated selection of tools is by no means exhaustive, but it represents a personal toolbox that has significantly improved my efficiency on macOS. Hopefully, they’ll do the same for you.

Knative Serving: Streamlining Microservice Deployment on Kubernetes

Sat, 04 Nov 2023 11:09:29 +0100

Kubernetes has revolutionized the way organizations deploy and manage applications at scale. However, its complexity can be daunting for developers who may not be familiar with container orchestration concepts. Enter Knative Serving, a Kubernetes-based platform that simplifies the deployment and scaling of serverless applications.

Knative Serving: Making Kubernetes Accessible to All Developers

Knative Serving builds on Kubernetes to support deploying and serving of applications and functions as serverless, autoscaling services. It abstracts away much of the Kubernetes-specific workflow, allowing developers to focus on writing code. This simplifies the deployment model, where a single configuration file can replace a myriad of Kubernetes objects and commands.

Scaling Microservices with Knative Serving

One of the standout features of Knative Serving is its ability to automatically manage the scale of your applications, including scaling down to zero when services are not in use. This feature is particularly useful for cost-saving and efficient resource utilization.

Knative Serving supports various scaling metrics and parameters, allowing for fine-tuned control over how your applications respond to traffic demands. Developers can specify the number of concurrent requests per pod and control the ramp-up and cool-down behavior of their services.

Example: Knative Serving Configuration

Here’s a look at a Knative Service configuration that showcases the simplicity of getting a service up and running:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: example-service
namespace: default
spec:
template:
spec:
containers:
- image: gcr.io/my-project/my-app:latest
ports:
- containerPort: 8080

Note that the configuration file is significantly shorter than the equivalent Kubernetes deployment file, which would require additional objects such as a deployment, service, and ingress. Knative has sensible defaults for many of its parameters, allowing developers to get started quickly. Knative wil take care of the rest, including creating the necessary Kubernetes objects and managing the scaling of your service.

Health check using the container port.
Deploying the service will create a new revision of the application.
The revision will be scaled to zero if there are no requests for a specified period of time.
A new revision will be created when the service is updated, allowing for seamless rollouts and rollbacks.
Traffic splitting can be configured to allow for canary rollouts and A/B testing.

However, Knative Serving also provides the flexibility to customize many parameters to suit your needs.

Autoscaling

For example the autoscaling configuration can be modified to specify the minimum and maximum number of pods, the maximum number of concurrent requests per pod, and the target CPU utilization percentage. The default autoscaling in vanilla Kubernetes is the Horizontal Pod Autoscaler (HPA), which scales based on CPU utilization. Knative Serving uses a custom autoscaler that supports scaling based on concurrency, which is more suitable for serverless applications.

The default in Knative Serving is identical to using those annotations on the service:

spec:
template:
metadata:
annotations:
autoscaling.knative.dev/metric: "concurrency"
autoscaling.knative.dev/target-utilization-percentage: "70"

To revert back to plain CPU-based autoscaling, you can use the following annotations:

 autoscaling.knative.dev/class: "hpa.autoscaling.knative.dev"
autoscaling.knative.dev/metric: "cpu"
autoscaling.knative.dev/target: "100"

which would scale up another pod if the CPU utilization of the current pod is at 100%.

More information on Knative Serving configuration can be found in the official documentation.

Effortless Deployment Pipelines with ArgoCD

ArgoCD can integrate with Knative Serving to create a seamless deployment pipeline. This GitOps tool allows developers to simply merge changes into specific branches, such as the main branch for integration or deployment branches for staging and production environments, to initiate automated deployment processes.

A Continuous Integration (CI) process like Github Actions can be triggered by a merge into the main branch, which will build the container image and tag it with a version. A subsequent merge into a deployment branch can prompt ArgoCD to deploy the tagged image to the respective environment.

Branching Strategy

To visualize the workflow, imagine a branching strategy resembling the following:

[main] ---- [development] ---- [feature branches]
\ /
\-- [staging] -- [QA] -- [production]

The only interface for developers is the GitHub UI, no special tools or knowledge of Kubernetes is required. This allows for a clear separation of concerns, where developers can focus on writing code and leave the deployment and scaling to Knative Serving and ArgoCD.

Knative Serving vs. AWS Lambda

Knative Serving offers a similar proposition to AWS Lambda in that it removes the need for developers to manage the underlying infrastructure. However, unlike the closed AWS Lambda environment, Knative operates on the open-source Kubernetes system, allowing for use across multiple cloud providers or on-premises environments. It also hooks into the Kubernetes ecosystem, allowing for seamless integration with other tools and services

In Conclusion

Knative Serving stands as a robust solution for teams seeking the benefits of serverless architectures without the intricate knowledge of Kubernetes. It simplifies application deployment, automates scaling to match demand, and integrates easily with modern development workflows. By providing developers with tools that are easy to use and manage, Knative ensures that the focus remains on creating value through application functionality, not infrastructure complexity.

For organizations already invested in Kubernetes, Knative Serving offers a way to streamline and enhance their deployment strategies without the need for extensive Kubernetes expertise, thus further democratizing the power of container orchestration.

PS: Knative Eventing

Knative not only offers the Serving component but also an event mesh and primitives to control delivery of async events. This allows for a more complete serverless experience, where events can trigger serverless functions and services. This is a topic for another post, but I wanted to mention it here as it is a powerful feature of Knative.

Efficient Machine Learning Model Deployment: Integrating Seldon into MLOps Workflows

Sat, 14 Oct 2023 11:09:29 +0100

Enhancing MLOps with Seldon: Advantages and Practical Deployment with Scikit-Learn

Deploying machine learning models can often be a complex process that extends beyond the model’s development. The ease with which these models transition into production can significantly impact their usefulness and applicability. In this context, Seldon Core offers a suite of features that cater to various aspects of MLOps with a particular emphasis on ease of monitoring, scaling, and deployment. In this article, I’ll outline some of the features I appreciate about Seldon and walk through the deployment of a Scikit-Learn classifier using Seldon’s tools.

Advantages of Using Seldon in MLOps

Easy Monitoring with Prometheus: One of the more tedious aspects of machine learning operations is setting up monitoring for deployed models. Seldon simplifies this by providing out-of-the-box integration with Prometheus, a powerful monitoring system that automatically collects and stores metrics in a time-series database. This integration allows for real-time monitoring of a wide array of model performance metrics, without the need for complex setup procedures.

Automatic Scaling with KEDA: Maintaining the balance between resource allocation and cost-efficiency is key in production environments. Seldon integrates with Kubernetes Event-driven Autoscaling (KEDA) to facilitate automatic scaling of machine learning models. KEDA allows Seldon deployments to scale based on metrics from external sources like Kafka queues, providing a responsive and resource-efficient solution for handling variable workloads. This is especially useful for scaling to zero, which allows for significant cost savings when the model is not in use.

Seamless Deployments: The need for smooth rollouts and updates to machine learning models cannot be overstated. Seldon supports seamless deployments, allowing for blue-green testing, canary rollouts, and phased introductions of new model versions. This results in reduced downtime and improved user experience, as new features or models can be tested and rolled out with minimal disruption to the production service.

Practical Deployment: A Scikit-Learn Classifier Example

To demonstrate the advantages mentioned above, let’s consider the deployment of a simple logistic regression classifier using Scikit-Learn, wrapped with Seldon’s Sklearn server.

from sklearn import datasets
from sklearn.linear_model import LogisticRegression
import joblib
# Load Iris dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target
# Create and train the logistic regression model
model = LogisticRegression()
model.fit(X, y)
# Serialize the model to a file
joblib.dump(model, 'model.joblib')

After training and saving the model, we create a SeldonDeployment custom resource definition (CRD) that outlines the deployment specifics:

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: iris-model
spec:
predictors:
- name: default
replicas: 1
graph:
name: iris-classifier
implementation: SKLEARN_SERVER
modelUri: gs://my-bucket/iris-model
envSecretRefName: seldon-init-container-secret

Using kubectl, we apply the manifest to our Kubernetes cluster, which triggers the deployment process orchestrated by Seldon Core. In the end, it will create a new plain Kubernetes deployment with a pod running our model. The pod will be exposed via a Kubernetes service, which can be used to send requests to the model. Depending on the deployment setup, it can also be exposed via an Istio gateway, which allows for more advanced traffic management and monitoring.

If we make updates, we should create a new version of the model and update the SeldonDeployment CRD to point to the new model version. This will trigger a rolling update of the deployment, which will ensure that the model is updated without any downtime.

Advanced Deployment Options

For more advanced use cases, Seldon also provides support for other machine learning frameworks like TensorFlow, PyTorch, and XGBoost, as well as integration with other tools like KubeFlow and Kubeflow Pipelines. It is also possible to wrap your own custom model code with Seldon’s Python server. This gives extreme flexibility in terms of deployment options, allowing for a wide range of use cases without the need for extensive deployment code.

Monitoring and Predicting with Your Deployed Model

With our model deployed, we can utilize Prometheus to monitor it closely. This setup allows us to keep track of our model’s performance and health with ease. Prometheus can be queried to fetch relevant metrics, which aids in maintaining the robustness of the deployed model. More information can be found in the Seldon documentation.

Conclusion

By providing built-in support for monitoring, scaling, and deployment, Seldon Core addresses three critical aspects of MLOps, making the journey from model development to production a lot smoother. This simplifies many of the operational complexities and allows data scientists and ML engineers to focus more on model improvement and less on the intricacies of production environments. As we’ve seen, leveraging Seldon Core with a Scikit-Learn model can be a very straightforward process, illustrating how practical and beneficial Seldon can be in real-world applications.

Reproducible Experiments in Machine Learning

Mon, 04 May 2020 17:10:53 +0200

Replication Crisis

Today not only the economy but also science is working in a breakneck pace. Even more accelerated through the current pandemic, the iteration time of new scientific research is short and not much time for peer review is available.

Good practice in science (and life in general) is the replication of results: to check for correctness or to facilitate understanding. This is crucial in a peer review process, having an ever increasing amount of scientific papers with questionable quality. One big problem is therefore the lack of replicable results also known as the replication crisis. The term covers many facets of this problem in different scientific disciplines. Specifically in machine learning many results and comparisons are questionable. A recent study tried to replicate results in scientific papers and was only successful on average 40% of the time.

Current problems

There are many frustrating aspects I encountered when reading papers in machine learning:

1. No public data used

While I understand the reasons for it a study should not solely rely on private data to highlight its merits.

2. No source code

While many scientists are not software engineers and are shy about sharing their paper-deadline scripts it should at least be part of the requirements in journals to share the code. It is not reasonable for other peers to replicate results by implementing algorithms themselves.

3. High install/usage barriers

If the source code is available, necessary declarations of dependencies can be missing which requires installing those manually (if possible). This problem gets worse with age of the publication, as newer versions of programming languages or libraries are not always backwards compatible.

4. No replicable experimental setup

Even if 2. and 3. are fulfilled, sometimes the experimental scripts are missing. While most parameters should be part of the scientific manuscript itself, sometimes authors forget to mention crucial preprocessing steps or parameters. If the complete experimental script is available, which was used to produce the results in the paper, this problem would be impossible.

(5. Lack of necessary resources to replicate models)

Even if all the former things are provided, another problem, specifically in deep learning, is the necessary amount of resources to train a model. Many big players (Google et al.) in this area have nearly unlimited GPU resources available which is unattainable for many research institutions. This also often leads to the question, if stated improvements are based on architectural changes or on more time for training.

Towards reproducible science and experiments

While aspects 1. and 2. are getting better in my opinion a solution to 5. is still not clear to me. On the other hand 3. and 4. can be improved but are still lacking in academia as they require skills in software engineering and development most often found in industry applications.

In the following I will describe how I made the experiments in my newest scientific preprint reproducible.

The source code of the algorithm is available on GitHub.

In my case I am using Python for the implementation. There exist (too) many approaches for Python to declare dependencies. A new alternative which tries to encapsulate the best ideas of all before is the tool Poetry, which allows the declaration of general requirements and also specific versions.

Excerpt of general dependency declaration for Poetry.

In addition, Poetry supports the automatic creation of virtual environments which encapsulate these specific dependencies, even if the global Python environment is widely different to the original creators. These environments are defined by hashes which Poetry automatically derives and are located in the poetry.lock file.

To get another layer of encapsulation we also utilize Containers made popular under the name Docker. While Poetry encapsulates Python environments, containers can encapsulate the complete operating system. This makes it possible to run experiments even many years in the future with the same global software stack.

We provide a GitHub repository for all experiments with a Dockerfile included, which is basically a recipe list for all software needed (including the OS).

Excerpt of Dockerfile responsible for installing Poetry and its dependencies.

In short the Dockerfile instructs the container builder to use a specific OS and Poetry to install all Python dependencies and create an image. One can also execute these instructions beforehand to create a container image and upload it to a public repository, which makes building unnecessary.

Now we can provide the potential reviewer or user with the following instructions which automatically perform replication:

Instructions for replication

To replicate the experimental results of the paper (figure and tables)

1. Get container image

Build the image yourself with

docker build -t squamish_experiments .

or pull it from DockerHub

docker pull mirek1337/squamish_experiments

2. Run container

docker run -v ./tmp:/exp/tmp:Z -v ./output:/exp/output:Z -it squamish_experiments make

which calls make inside the container to execute all experiments in the Makefile. After the experiments are done (can take several hours) the output should end up in the ./output folder.

It’s also possible to change the following parameters as environment variables in the docker command via the -e option:

Defaults used in paper

SEED = 123
REPEATS = 10
N_THREADS = 1

Conclusion

While working with containers is still new for many scientists, the advantages are big. One can not expect everybody to know the tools in detail, and more work in usable abstractions is needed. For now there are project templates or libraries available with which make this work easier.

Decentralized Website

Tue, 25 Feb 2020 18:11:19 +0100

The website you are reading can be completely used without a running backend on a server. Such a website is known as static.

Static websites deliver all the content and logic (JavaScript) to the browser. All the interaction, such as search or clicking on internal links, is happening through the JS scripts included. While this sounds like a layman would expect it to, this is far from the current state of the internet.

In the early days, many websites only consisted of static HTML sites. Today, many modern websites rely on a running centralized backend server. This enables dynamic experiences but also leads to link rot, where specific websites (and their URLs) have a limited lifespan. Many people experienced the sight of dead links at least once and this problem is expected to grow with an ageing internet.

Content-Addressable Storage

A recent push to decentralize the internet again lead to technologies such as content-addressable storage.

Normal URLs on the internet such as https://lpfann.me/ are arbitrarily chosen words and have no relation to the actual content.

Content-addressing uses a mathematical hash function to compress the contents of a website into a short string called a hash. The great thing about hash functions is that they most likely produce a unique output and as such a unique address.

This allows use cases where people can serve and exchange content just based on a content hash. An example for an application of this is IPFS (Interplanetary Filesystem).

IPFS introduced an address scheme for content and also the exchange of information using peer to peer networking without a central server. People using the IPFS application automatically act as servers for other peers when they have information another node needs.

This enables a more robust and decentralized web without the need of a big central server or a content distribution network.

To host a website using IPFS we need it to be static.

Making a website static

This website is built using Hugo which already produces static output. It is only important to enable relativeURLs to work with the IPFS addressing.

We are also using the Academic theme for Hugo. Academic uses several external font and JavaScript resources to enhance the content presentation. While hosting a IPFS-website with references to non-IPFS resources is perfectly possible, it is not completely decentralized.

Luckily the Academic theme also provides a downloader tool, which saves all external assets inside the website folder.

At the time of writing the main downloader does not support all assets yet, but an open pull request added support for most of the missing things. Another thing missing were the fonts, which originally came from Googles Font CDN which we downloaded manually.

Now we have a complete website running on local fonts and JavaScript assets¹. As such you could download the website files and kill the internet connection and you would have the same experience.

Hosting an IPFS website

If we would use IPFS to hash our website we would get a content hash like this:

/ipfs/QmSPZuY3K1XieH7M9zh4qs9MEGFf4GZdBv3STaiJpBaC6o

Now somebody else could retrieve the website using his own IPFS client directly or using one of the available browser plugins.

Draft of this blog post hashed and pinned to local IPFS node.

For somebody else to retrieve the files of the website we would have to keep a IPFS node running or ask somebody else to keep it cached (pinned).

There are so called pinning services (e.g. Pinata) which provide this service. Another project is Filecoin which is built on top of IPFS. It provides monetary incentive using a type of Blockchain to reward nodes to keep IPFS files pinned.

#Dynamic folders for pinning and managing data on @ipfsbot: Introducing Textile Buckets. A tool to host your #staticwebsite, app assets, #opensource code and more. #commandline tool, #CI integration, and #web3 gateway. Check'em out: https://t.co/K6RY5e1t2h pic.twitter.com/JyRgvknMAt
— Recall Labs | re/acc (@RecallLabs_) February 24, 2020

In the last few days we looked for ways to automatically pin this website when new content is added to the git repository. Just yesterday Textile announced dynamic buckets working on top of IPFS. While not the main focus of their blogpost, they also presented new GitHub Actions which automatically deploy content to their free bucket hosting. We extended their scripts on the demo site based on Gatsby to work with Hugo.

GitHub Action building and pushing files to Textile bucket

Now after every push and pull request, the GitHub Action compiles Hugo output and pushes it to a Textile bucket which is also pinned and works with IPFS.

Our website content is automatically available under a content hash after every change and push to the repository.

DNS

To let people know that a site is available with IPFS one can use DNSLinks. These are TXT records attached to DNS domains which hint at the IPFS resource available. IPFS browser extensions can detect these records and automatically use IPFS for content retrieval when coming to such a site.

The scripts from Textile also included an updater for DNS records which post the IPFS hash to Cloudflare DNS service. This script updates the DNSLink after every manual release.

Ethereum Name Service (ENS)

To have a completely decentralized solution one can use technologies like ENS which is an alternative to the DNS system.

Our website is also available under the ENS domain https://pfannschmidt.eth or via the transition link https://pfannschmidt.eth.link/ which uses the eth.link service to allow browsers without ENS support to visit the site.

For now, we update the IPFS hash stored in ENS manually, but we could automate this in the future.

Backwards Compatibility.

IPFS is still in its early stages. Most popular browsers do not support the protocol which is necessary to reach the majority of web users.

Until that changes one additionally needs to host websites the traditional way using web servers and DNS. One can use Cloudflares IPFS gateway and DNS solution to automatically serve IPFS content over normal HTTP.

For now this Blog is hosted by Netlify for non-IPFS enabled visitors.

Summary

Overall this process is still very much a complicated and hard thing. While IPFS and its ecosystem is steadily improving there is still a lot to do.

Luckily new services such as fleek ~~Terminal.co~~ are coming up which provide end to end decentralized hosting solutions.

Update

We now tried out fleek which makes it a lot easier to deploy a static site to IPFS. They automatically build your site from your GitHub repository, pin it on IPFS and also handle your DNSLinks such that people know you also provide a content hash.

We have one² external reference left which provides our visitor counting script. Missing it would not influence the usability negatively for the visitors. (You could argue it would improve the experience 😉) ↩︎
After publishing this article we added a new commenting system. While it is self hosted, it is not decentralized. Apparently, that is still a non-trivial thing to do. ↩︎

FRI Quickstart Guide

Thu, 02 May 2019 13:51:40 +0200

Quick start guide

In this guide i am going describe how to use the FRI python library to analyse arbitrary datasets.

(This guide is a copy of the official documentation found here)

Installation

Stable

Fri can be installed via the Python Package Index (PyPI).

If you have pip installed just execute the command

pip install fri

to get the newest stable version.

The dependencies should be installed and checked automatically. If you have problems installing please open issue at our tracker.

Development

To install a bleeding edge dev version of FRI you can clone the GitHub repository using

git clone git@github.com:lpfann/fri.git

and then check out the dev branch: git checkout dev.

To check if everything works as intented you can use pytest to run the unit tests. Just run the command

pytest

in the main project folder

# For the purpose of viewing this notebook online we install the library directly with pip
!pip install fri

Requirement already satisfied: fri in /home/lpfannschmidt/workbench/fri (3.4.0+2.g1eb5429.dirty)
Requirement already satisfied: numpy in /home/lpfannschmidt/anaconda3/lib/python3.7/site-packages (from fri) (1.15.1)
Requirement already satisfied: scipy>=0.19 in /home/lpfannschmidt/anaconda3/lib/python3.7/site-packages (from fri) (1.1.0)
Requirement already satisfied: scikit-learn>=0.18 in /home/lpfannschmidt/anaconda3/lib/python3.7/site-packages (from fri) (0.19.2)
Requirement already satisfied: cvxpy==1.0.8 in /home/lpfannschmidt/anaconda3/lib/python3.7/site-packages (from fri) (1.0.8)
Requirement already satisfied: ecos==2.0.5 in /home/lpfannschmidt/anaconda3/lib/python3.7/site-packages (from fri) (2.0.5)
Requirement already satisfied: matplotlib in /home/lpfannschmidt/anaconda3/lib/python3.7/site-packages (from fri) (2.2.3)
Requirement already satisfied: scs>=1.1.3 in /home/lpfannschmidt/anaconda3/lib/python3.7/site-packages (from cvxpy==1.0.8->fri) (2.0.2)
Requirement already satisfied: toolz in /home/lpfannschmidt/anaconda3/lib/python3.7/site-packages (from cvxpy==1.0.8->fri) (0.9.0)
Requirement already satisfied: multiprocess in /home/lpfannschmidt/anaconda3/lib/python3.7/site-packages (from cvxpy==1.0.8->fri) (0.70.6.1)
Requirement already satisfied: osqp in /home/lpfannschmidt/anaconda3/lib/python3.7/site-packages (from cvxpy==1.0.8->fri) (0.4.1)
Requirement already satisfied: fastcache in /home/lpfannschmidt/anaconda3/lib/python3.7/site-packages (from cvxpy==1.0.8->fri) (1.0.2)
Requirement already satisfied: six in /home/lpfannschmidt/anaconda3/lib/python3.7/site-packages (from cvxpy==1.0.8->fri) (1.11.0)
Requirement already satisfied: cycler>=0.10 in /home/lpfannschmidt/anaconda3/lib/python3.7/site-packages (from matplotlib->fri) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /home/lpfannschmidt/anaconda3/lib/python3.7/site-packages (from matplotlib->fri) (2.2.0)
Requirement already satisfied: python-dateutil>=2.1 in /home/lpfannschmidt/anaconda3/lib/python3.7/site-packages (from matplotlib->fri) (2.7.3)
Requirement already satisfied: pytz in /home/lpfannschmidt/anaconda3/lib/python3.7/site-packages (from matplotlib->fri) (2018.5)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/lpfannschmidt/anaconda3/lib/python3.7/site-packages (from matplotlib->fri) (1.0.1)
Requirement already satisfied: dill>=0.2.8.1 in /home/lpfannschmidt/anaconda3/lib/python3.7/site-packages (from multiprocess->cvxpy==1.0.8->fri) (0.2.8.2)
Requirement already satisfied: future in /home/lpfannschmidt/anaconda3/lib/python3.7/site-packages (from osqp->cvxpy==1.0.8->fri) (0.16.0)
Requirement already satisfied: setuptools in /home/lpfannschmidt/anaconda3/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib->fri) (40.2.0)

Using FRI

Now we showcase the workflow of using FRI on a simple classification problem.

Data

To have something to work with, we need some data first. fri includes a generation method for binary classification and regression data.

In our case we need some classification data.

from fri import genClassificationData

We want to create a small set with a few features.

Because we want to showcase the all-relevant feature selection, we generate multiple strongly and weakly relevant features.

n = 100
features = 6
strongly_relevant = 2
weakly_relevant = 2

X,y = genClassificationData(n_samples=n,
n_features=features,
n_strel=strongly_relevant,
n_redundant=weakly_relevant,
random_state=123)

Generating dataset with d=6,n=100,strongly=2,weakly=2, partition of weakly=None

The method also prints out the parameters again.

X.shape

(100, 6)

We created a binary classification set with 6 features of which 2 are strongly relevant and 2 weakly relevant.

Preprocess

Because our method expects mean centered data we need to standardize it first. This centers the values around 0 and deviation to the standard deviation

from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)

Model

Now we need to creata a Model. We use the FRIClassification class.

For regression one would use FRIRegression

from fri import FRIClassification
fri_model = FRIClassification()

fri_model

FRIClassification(C=None, debug=False, n_resampling=3,
optimum_deviation=0.001, parallel=False, random_state=None)

We used no parameters for creation so the defaults are active.

C=None means, that FRI itself chooses the regularization parameter C using crossvalidation on a fixed grid.

By default, parallel computation is also disabled but can be enabled using parallel=True.

Fitting to data

Now we can just fit the model to the data using scikit-learn like commands.

fri_model.fit(X_scaled,y)

The resulting feature relevance bounds are saved in the interval_ variable.

fri_model.interval_

array([[0.45993233, 0.46169499],
[0.26954548, 0.27159876],
[0. , 0.25802293],
[0. , 0.25802293],
[0.00516909, 0.00711219],
[0.00446591, 0.00694219]])

fri_model.interval_.shape

(6, 2)

The bounds are grouped in 2d sublists for each feature.

To acess the relevance bounds for feature 2 we would use

fri_model.interval_[2]

array([0. , 0.25802293])

The relevance classes are saved in the corresponding variable relevance_classes_:

fri_model.relevance_classes_

array([2, 2, 1, 1, 0, 0])

2 denotes strongly relevant features, 1 weakly relevant and 0 irrelevant.

Plot results

The bounds in numerical form are useful for postprocesing. If we want a human to look at it, we recommend the plot function plot_relevance_bars.

We can also color the bars according to relevance_classes_

# Import plot function
from fri.plot import plot_relevance_bars
import matplotlib.pyplot as plt
%matplotlib inline
# Create new figure, where we can put an axis on
fig, ax = plt.subplots(1, 1,figsize=(6,3))
# plot the bars on the axis, colored according to fri
out = plot_relevance_bars(ax,fri_model.interval_,classes=fri_model.relevance_classes_)

In the plot we can see both strongly relevant features 1 and 2 not allowing much change in their contribution. Feature 3 and 4 are highly correlated and show therefore a big variance. Noise features 5 and 6 show some necessary contribution which can be accounted to numerical instabilities of the solver.

Print internal Parameters

If we want to take at internal parameters, we can use the debug flag in the model creation.

fri_model = FRIClassification(debug=True)

fri_model.fit(X_scaled,y)

loss 0.517120931358002
L1 6.743126681372926
offset 0.32474176019022094
C 1
score 1.0
coef:
[[ 3.10516847]
[-1.82001413]
[ 0.86614471]
[-0.86614471]
[-0.03919911]
[-0.03971916]]

This prints out the parameters of the baseline model loss (sum of slack), L1 ($L_1$ norm of weight vector) and offset (from the origin). coef shows the coefficients of the baseline model.

One can also see the best C according to gridsearch and the training score of the model in score.

These values can also be accessed by the object variables.

Print out hyperparameter found by GridSearchCV:

fri_model.tuned_C_

or the baseline parameters:

fri_model.optim_L1_

6.743126681372926

Setting constraints manually

Our model also allows to compute relevance bounds when the user sets a given range for the features.

Presets

Presets are encoded using a array in the same shape as the interval_ variable. Each value represents the user given minimum and maximum contribution of the feature. If one would set both values to be the same, we interpret this feature as fixed.

Additionally, entries with np.nan are interpreted as not-set or free.

import numpy as np
preset = np.full_like(fri_model.interval_,np.nan,dtype=np.double)

Now we have a preset array without any constraints:

preset

array([[nan, nan],
[nan, nan],
[nan, nan],
[nan, nan],
[nan, nan],
[nan, nan]])

Example

As an example, let us constrain feature 3 from our example to the minimum relevance bound.

Note the different indexing using numpy (3 -> 2)

preset[2] = fri_model.interval_[2, 0]

We use the function constrained_intervals_.

Note: we need to fit the model before we can use this function. We already did that, so we are fine.

constrained_interval = fri_model.constrained_intervals_(preset=preset)

constrained_interval

array([[0.45993233, 0.46169499],
[0.26954548, 0.27159876],
[0. , 0. ],
[0.25608488, 0.25802293],
[0.00516909, 0.00711219],
[0.00446591, 0.0069422 ]])

Feature 3 is set to its minimum (at 0).

How does it look visually?

fig, ax = plt.subplots(1, 1,figsize=(6,3))
out = plot_relevance_bars(ax, constrained_interval)

Feature 3 is reduced to its minimum (no contribution).

In turn, its correlated partner feature 4 had to take its maximum contribution.