JADSB Just another data science blog

Programming in C# for People Who Like Vim

As someone working in the mobile gaming industry, I wanted to learn more about the nuts and bolts of game making. To get hands-on experience, I am taking C# Programming for Unity Game Development from . The first step in learning any programming language is to setup the development environment. The course instructor suggested using either Visual Studio or MonoDevelop, but I am a fan of Vim. In this post, I will document my initial experience in programming with C# from the terminal on macOS.

  1. Installing Mono

    In order to compile C# code, you would need Mono, “a free and open source .NET Framework-compatible software framework”. Using brew, you can install Mono via the following command:

    brew install mono
    
  2. Hello World on C#

    Here is my first program in C#:

    using System;
       
    namespace HelloWorld 
    {
        class MainClass
        {
            public static void Main(string[] args)
            {
                Console.WriteLine("Hello, World!");
            }
        }
    }
    

    You can compile it with the csc command:

    csc HelloWorld.cs
    

    Now, you can run the code with mono command

    mono HelloWorld.exe
    
  3. Compiling with DLL

    If the project you are working on has dynamic link library (DLL), you can compile the code with the -reference option

     csc -reference:foo.DLL Program.cs
    

Final Thoughts

So far, writing C# code on macOS seems straightforward. Though, the course involves using Unity. I might end up needing the functionalities in Visual Studio later.

Guide to Using Proxmox, Containers, and GPU Pass Through for Machine Learning

Setting up a home lab with GPU support can be helpful in learning machine learning and other technologies. While we could use virtual machines on Proxmox, we will focus on using containers with GPU pass through in this guide. Aside from making a few updates, this post has most of the details necessary to setup the home lab properly. I am going to detail my experience in setting up a lab with Nvidia 1070 Ti and Core i7 setup.

Installing Proxmox

To install Proxomox, you will need to create a USB key with the installation software. I downloaded Proxomox 6.3 and used a Mac to create the installation media.

  1. Download Proxmox

    You can download the latest version of Proxmox directly at this page.

  2. Creating the USB key

    I created the key on macOS Cantalina. For other OSes, you can find the instructions at the Proxmox wiki.

     hdiutil convert -format UDRW -o proxmox-ve_*.dmg proxmox-ve_*.iso # convert the proxmox image into proper format
    

    After converting the image file, plug the USB key into the Mac and looking for the disk.

     # look for the external disk
     diskutil list # list all the disk attached to the computer
    

    From the list of disks, there should be an external disk. We need to unmount it prior to writing the installation media.

     diskutil unmountDisk /dev/diskX # replace X with the number corresponding to the external disk
    

    Now, you can create the install media using dd

     sudo dd if=proxmox-ve_*.dmg of=/dev/rdiskX bs=1m
    

    Installing Proxmox

  3. Installing Proxmox

    Aside from a making a couple of choices on the file system and the address of the home lab, installation of Proxmox is straight forward.

  4. Logging into Proxomox’s Web Interface

    At this point, you can unplug the home lab box’s monitor and keyboard. The rest of the guide can be done via a terminal or Proxmox’s web interface.

     https://address.you.chose:8006
    

    Port 8006 is the default port number if you did not change it during the installation process. On visiting the web interface from a browser, the server would prompt for user name and password. The default user name is root and the password would have been chosen during the install. If the page fails to load, you may have forgotten the s in https. Because the s is needed, your browser may give you are warning about the webpage being insecure. You may safely ignore it.

Configuring Proxomox for GPU Pass Through

Making GPU pass through work on Proxmox and containers is essentially a two step process:

1. configure the drivers on the server, and
2. configure the containers. 
  1. Configure the Nvidia Drivers on the Server

    Either through Promox web interface or login to the server directly via SSH, we need to have command line access to the server. Since Proxmox is based on Debian, we follow the steps outlined in the Debian wiki to install the drivers on Proxmox. You can find references to package repositories in the Proxmox wiki.

    We need to add the following lines to /etc/apt/sources.list:

     # security updates
     deb http://security.debian.org buster/updates main contrib
        
     # PVE pve-no-subscription repository provided by proxmox.com,
     # NOT recommended for production use
     deb http://download.proxmox.com/debian buster pve-no-subscription
        
     # buster-backports
     deb http://deb.debian.org/debian buster-backports main contrib non-free
    

    I used Proxmox 6.3 in writing this guide and it is based on Debian 10. If you are using a different version of Proxmox, it might be based off a different Debian version. In that case, you need to change the word buster to the corresponding version.

    Update all the packages to the latest versions and reboot

     apt-get update
     apt-get dist-upgrade
     shutdown -r now
    

    We need to verify the kernel version and the corresponding headers with

     uname -r
     apt-cache search pve-header
    

    For me, I got this output

     root@machinename:~# uname -r
     5.4.78-2-pve
     root@machinename:~# apt-cache search pve-header
     pve-headers-5.0.12-1-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.0.15-1-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.0.18-1-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.0.21-1-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.0.21-2-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.0.21-3-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.0.21-4-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.0.21-5-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.0.8-1-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.0.8-2-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.0 - Latest Proxmox VE Kernel Headers
     pve-headers-5.3.1-1-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.3.10-1-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.3.13-1-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.3.13-2-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.3.13-3-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.3.18-1-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.3.18-2-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.3.18-3-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.3.7-1-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.3 - Latest Proxmox VE Kernel Headers
     pve-headers-5.4.22-1-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.4.24-1-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.4.27-1-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.4.30-1-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.4.34-1-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.4.41-1-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.4.44-1-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.4.44-2-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.4.55-1-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.4.60-1-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.4.65-1-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.4.73-1-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.4.78-1-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.4.78-2-pve - The Proxmox PVE Kernel Headers
     pve-headers-5.4 - Latest Proxmox VE Kernel Headers
     pve-headers - Default Proxmox VE Kernel Headers
    

    We can install the proper version with

     apt-get install pve-headers-5.4.78-2-pve
    

    and install Nvidia drivers with

     apt-get install -t buster-backports nvidia-driver
    

    Like I mentioned before, you may need to change the version number if you are installing with a different version of Proxmox.

    You can install some tools to go along with the driver:

     apt-get install i7z nvidia-smi htop iotop
    

    If you check /dev now, there should be some Nvidia related files:

     root@machinename:~# ls -alh /dev/nvid*
     crw-rw-rw- 1 root root 195, 254 Dec 27 01:16 /dev/nvidia-modeset
     crw-rw-rw- 1 root root 235,   0 Dec 27 01:16 /dev/nvidia-uvm
     crw-rw-rw- 1 root root 235,   1 Dec 27 01:16 /dev/nvidia-uvm-tools
     crw-rw-rw- 1 root root 195,   0 Dec 27 01:16 /dev/nvidia0
     crw-rw-rw- 1 root root 195, 255 Dec 27 01:16 /dev/nvidiactl
    

    To ensure that these drivers are loaded at boot time, you need to edit /etc/modules-load.d/modules.conf with your favourite editor and add

     # /etc/modules: kernel modules to load at boot time.
     #
     # This file contains the names of kernel modules that should be loaded
     # at boot time, one per line. Lines beginning with "#" are ignored.
    
     nvidia
     nvidia_uvm
    

    Because nvidia and nvidia_uvm are not automatically created until X-server or nvidia-smi is called, we need to add the following lines to /etc/udev/rules.d/70-nvidia.rules:

     # /etc/udev/rules.d/70-nvidia.rules
     # Create /nvidia0, /dev/nvidia1 … and /nvidiactl when nvidia module is loaded
         KERNEL==”nvidia”, RUN+=”/bin/bash -c ‘/usr/bin/nvidia-smi -L && /bin/chmod 666 /dev/nvidia*’”
     # Create the CUDA node when nvidia_uvm CUDA module is loaded
         KERNEL==”nvidia_uvm”, RUN+=”/bin/bash -c ‘/usr/bin/nvidia-modprobe -c0 -u && /bin/chmod 0666 /dev/nvidia-uvm*’”
    

    Now, reboot the server with shutdown -r now and check if everything worked with nvidia-smi in a new command line.

  2. Configure the Containers:

    Find the container ID in Proxmox’s web interface and then edit the corresponding file at /etc/pve/lxc/. I am using an Ubuntu container with ID 100 so here’s my config file

     #Ubuntu 20.04 with GPU passthrough
     arch: amd64
     cores: 10
     hostname: CT100
     memory: 16384
     net0: name=eth0,bridge=vmbr0,hwaddr=AA:98:43:03:D4:41,ip=dhcp,ip6=dhcp,type=veth
     ostype: ubuntu
     rootfs: local-zfs:basevol-100-disk-1,size=30G
     swap: 16384
     template: 1
     unprivileged: 1
    
     # GPU passthrough configs
     lxc.cgroup.devices.allow: c 195:* rwm
     lxc.cgroup.devices.allow: c 243:* rwm
     lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
     lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
     lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
     lxc.mount.entry: /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file
     lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
    

    Reboot the container for the settings to take effect. After the reboot check to see if the configurations worked with the following commands and outputs:

     # nvidia-smi
     Tue Jan  5 02:57:07 2021
     +-----------------------------------------------------------------------------+
     | NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
     |-------------------------------+----------------------+----------------------+
     | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
     | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
     |                               |                      |               MIG M. |
     |===============================+======================+======================|
     |   0  GeForce GTX 107...  Off  | 00000000:01:00.0  On |                  N/A |
     |  0%   45C    P8     8W / 180W |      1MiB /  8116MiB |      0%      Default |
     |                               |                      |                  N/A |
     +-------------------------------+----------------------+----------------------+
    
     +-----------------------------------------------------------------------------+
     | Processes:                                                                  |
     |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
     |        ID   ID                                                   Usage      |
     |=============================================================================|
     |  No running processes found                                                 |
     +-----------------------------------------------------------------------------+
    

    and

     # ls /dev/nvidia* -l
     -rw-r — r — 1 root root 0 16.01.2017 20:11 /dev/nvidia-modeset
     crw-rw-rw- 1 nobody nobody 243, 0 16.01.2017 20:05 /dev/nvidia-uvm
     -rw-r — r — 1 root root 0 16.01.2017 20:11 /dev/nvidia-uvm-tools
     crw-rw-rw- 1 nobody nobody 195, 0 16.01.2017 20:05 /dev/nvidia0
     crw-rw-rw- 1 nobody nobody 195, 255 16.01.2017 20:05 /dev/nvidiactl
    

Creating Snapshots and Templates

After all this hard work, you should save your progress on the container by creating a snapshot. If you chose ZFS as the base file system, this would be an option in the web interface under the specific container’s menu.

You could also make the new GPU enabled container a template as a basis for other containers. The option of making that cointainer a template can be found via right-clicking on the name of the container.

Final Thoughts

That’s it. You are done. You can now move to more important work of using the home lab for learning or actual work. Good luck and let me know the cool projects you do with the GPU enabled Proxmox home lab.

Adding SSH Key to Remote Server

You could setup ssh keys so that user name and password are not required on every login on remote server.

  1. Generate key-pair on local

    mkdir ~/.ssh #if .ssh does not already exist
    cd ~/.ssh
    ssh-keygen #follow on screen instructions. use no password
    ssh-add -K newly_created_key # add key to ssh-agent
    
  2. Put key on remote server

    ssh-copy-id -i newly_created_key.pub remote_user_name@remote_host
    

    You may see a warning:

    The authenticity of host ‘203.0.113.1 (203.0.113.1)’ can’t be established. ECDSA key fingerprint is fd:fd:d4:f9:77:fe:73:84:e1:55:00:ad:d6:6d:22:fe. Are you sure you want to continue connecting (yes/no)? yes

    You can safely put yes.

  3. Test the setup

     ssh remote_user_name@remote_host
    

Finding a Data Set

Once you have picked a couple of topics based on the guidelines provided in my previous post, you need to find data sets to go with these topics. In this post, we will discuss what makes a data set good for data science projects. Then, we will list a couple of ways to find them.

Picking the Data Set

  1. Narrow, a.k.a. Tall and Skinny, Data

    Of the many components that make up data science, employers tend to focus on machine learning more than other disciplines. Aside from being relevant to the topic of choice, the data set should showcase your machine learning skills. To facilitate machine learning, a data set with many records and relatively few attributes, i.e., a narrow data set is preferred. If you need some examples on what a narrow data set should be like, I highly recommend downloading some of the data sets in Kaggle. The main data set in each competition is usally in the right form.

  2. Right Size

    While a larger data set is preferred in machine learning in general, data set can be too large and impede progress. Available computational power wwould become an issue and it would be difficult to iterate in the project. Until you are comfortable using some advanced tools, such as Apache Spark, you should avoid data set that is too large to fit into memory. Obvioulsy, if you find a great data set, but it is too large, you can always use only a fraction of the data. When down sampling the data, you should be mindful to not alter the distribution in the original data set.

Finding Data Set on the Internet

There are so much data available on the web. I am going to list some standard ways for data scientists to find, scrape, or curate the right data set for their projects.

  1. Search Engines

    i. Try Google and Google Data Search

    Google should be the obvious first tool for finding a data set. It is the most popular search engine in the world because it works well. In fact, Google has a decidated service called Dataset Search for data sets.

    ii. Go beyond the first couple of pages of search results

    Google and Google Dataset Search should be suffice for many data sets, but what if you did not find the data you are looking for? We know search engines are very good for common queries as the top results tend to satisfy. However, a data set search is not a typical query so the result may not appear readily. In the past, I have found the results that are relevant to me ten, maybe twenty, pages down.

    iii. Try other search engines

    While Google is the dominant search engine in the world, there are other search engines out in the wild. I often find useful information from the alternatives.

  2. Scraping

    Sometimes, data might be formated for viewing on the web, but not necessarily readily available for download. While cutting and pasting is a solution, it would likely be too laborious for any data set suitable for machine learning. In some cases, you might be able to scrape it off the pages directly. There are many great tutorials on webscrapping so I am not going to repeat them here.

  3. Finding and Calling Public APIs

    Aside from downloading data sets whole or scraping them from website, you can also find public APIs that would allow you to find the right data. Calling APIs directly is a great way to practice your programming skills as well. You can find many public APIs avaiable using a simple search.

Final Thoughts

I have listed here a couple of ways to find a data set for data science projects. There are many great resources on finding data sets. My hope is that this post will provide you with some ideas to get started.

Picking a Topic

In I mentioned in my previous post, projects are important aspiring data scientists as way to gain experience and exhibit their skills. A question naturally arises from the need to do self-direct projects: What topics are appropriate for such projects? If you are looking to enhance your resume and your Github profile, you do not have constraints from your non-existent manager or client. While this freedom seems wonderful at first, many of my mentees have found it to be overwhelming.

To guide my mentees, I always advise them to pick a topic that is important to them on a personal level. In comparison to the standard topics that can be found on Kaggle and other data science focused sites, there are several advantages:

1. Originality

Anyone who has been around data science long enough would have seen Twitter sentiment analysis or classifying images of cats and dogs a few dozen times. I am personally tired of seeing the same old capstone projects over and over again. While one could always put a new spin on an old topic, this is a difficult task for an inexperienced data scientist.

On the other hand, a topic of personal interest is unlikely to be a retread of the tired old themes in data science. Given fierce competition in the data science job market, any opportunity to stand out from the crowd is good. As I mentioned in the previous post, the purpose of the capstone project is to hone and show off the data science skills. Without constraints from prior works, one could explore the topic freely using any technique and thus able to show off one’s best self.

2. Passion

In all the data science interviews that I have been involved in, both as an interviewee and as an interviewer, there was always a presentation component. It can be difficult to make these presentations engaging because the content are mostly determined by industry standards. In each presentation, one must include an introduction of the topic and the data set, an exploratory data analysis, a statistical or machine learning model, and finally an analysis of the results from the model. Given the mechanical nature of these presentations, a personal passion on the topic will shine through. Actually caring about the topic would inject excitement that is not commonly found in corporate presentations. This small difference may help you stand out from other candidates.

3. Expertise

As a burgeoning data scientist, one is unlikely to be the expert on the typical capstone project during the interview process. In fact, the interviewers would likely have more expertise on both the topic and the techniques involved. In my experience, it is difficult to present to people who are more knowledgeable than me. While you cannot change the fact that the interviewers are technically more proficient, you could at least speak with authority on the topic if it is of your personal interest.

Final Thoughts

Picking a topic based on personal interest is a way to take advantage of the freedom of not having anyone to answer to. I have shown the numerous advantages in picking a project topic that can show-off your passion and knowledge. For the next step in a data science project, there has to be a data set. In my next post, I will discuss how to pick an appropriate data set.