Sunday, August 13, 2023

Expectations from an SRE

(In the words of my interviewer - I didn't pass)

Generally for an SRE-type of engineer I expect a T-shaped like skillset.

So there is a minimum baseline that has to be met across these categories:

  1. Linux fundamentals
  2. Software Engineering skills
  3. Networking
  4. Security best practices
  5. Containers & container orchestration
  6. Troubleshooting various scenarios related to above
  7. Distributed systems & scalability & reliability
  8. Understanding of cloud-based providers
  9. Infrastructure-as-code
So that's the baseline.

Beyond that, at higher levels, folks will be expected to have a much more in-depth skillset in either one of the above categories (like Cloud or IaC or very large scale, etc) and/or depth in a particular domain like build systems, platform building on k8s or the like, etc.

(so perhaps this blog would pivot towards exploring these concepts as I'd try to learn those? Who knows!)

Tuesday, December 20, 2022

lambda functions (=anonymous functions)

You can't import them,

so you cannot unit-test them,

so you cannot reason about them.

Perhaps they are small and inconsequential.

But perhaps they are not.

#polkadot

Monday, April 4, 2016

GraphLab Create via PyCharm

People who want to start using GraphLab Create usually ask two questions:

1) Where do I get GraphLab from?
2) What IDE should I use in my Python projects?

The first answer is to use the Dato Launcher. This is a bundle of Python, GraphLab Create and IPython Notebook. IPython Notebook is the best IDE for data scientists, and is also what we use in the Coursera Machine Learning Foundations course.

The second answer is JetBrains' PyCharm, which is becoming the de-facto standard Python IDE. This is what we use in big projects which also require lots of debugging.

To make PyCharm use the Python bundled in the Dato Launcher, follow the following instructions, which are detailed and include lots of screenshots.

You can download this PDF version of the instructions, which is included in a single file, or the HTML version which is zipped. Unzip all files to see the pictures as well.

This guide was originally written in MarkDown using the Mou editor.


Saturday, October 3, 2015

7 Links To Convince You That Big Data Isn't Your Problem

tl;dr - scroll to the bottom for a list of 7 links you should can read instead of this entire post.

I'll spare you the big bulk of my words and get to the point:

From my personal experience with big data projects, I'll conclude that it's a big (data) fat lie (or a big fat [data] lie if you'd like). Most people don't have big data to begin with, and they use big data technologies for the saying 'we use big data in our product' to acquire more sales. Below is the story in a set of links: read through this links, while bearing in mind the key points I attached to each one of them. I hope that by the time you finish reading this, you'll become more convinced in this point.

And yes, big data technologies still have a place in today's world: buzzwords have always been a part of business, and somebody has to purchase all of these disks and RAM and CPU (packed nicely into "commodity servers"). But if you are a penniless start-up worrying about how will you close the gap and get in the game, rest assured. The game is mostly imaginary - as always!

I dedicate this post to my friend Rony, an economist who asked me for tips on how to incorporate "big data" into her research. A short set of questions proved to me that she was not aware of what big data really is, and that it wasn't relevant for her case. She would use computers, programming and databases - but she should not worry that she is missing any knowledge. So for people like her, whose fears get fed by the same legends big data salespeople spread about it to get more cash, I dedicate this post.

The Beginning: Google Gets There First (as always!)

I mark the beginning of the era of 'big data' with two research papers published their Google.

The first paper described the Google File System (GFS) and you can (and should!) read it here:
http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf
This is a file system that specializes in scalability, in appending incoming data in mass volumes into existing files, and in making high-availability systems (due to redundancy of data). These were all valid concerns relevant for Google's business - but not relevant for just any company.

The second paper described the MapReduce paradigm:
Basically that means that if I have a book and I want to count how many times each word appears in it - I can split it to several chunk, count the words separately in each chunk, and then combine my results. The split+count part is called MAP, the combine part is called REDUCE. This is the nuttiest of all nutshells explanations, but is enough to get some few points across. First, this paradigm fits for batch processing of rather homogenous data. Second, it doesn't fit non-aggregative batch processing tasks. An example is Graph Algorithms - looking at a node in a graph, and retrieving its friends, its friends' friends... etc - is something you don't normally do using MapReduce.

These two concepts are combined into a single system like this: the data is stored using a GFS-like filesystem, then processed using the MapReduce paradigm (the more general term is 'lambda architectures'). Google published these papers but did not release any code. But Doug Cutting and Mika Cafarella, two engineers from Yahoo, decided to implement such a system as part of their work in Yahoo. This became an open-source Apache project named Hadoop. So - Hadoop - open-source GFS+MapReduce. Hadoop became the generic name for a big data processing system and is used widely in academia and industry. But enough about Hadoop at this point.

You might notice that these papers are from 2003 and 2004. Google itself has already got past that point and is using some other technologies which don't have open sources implementations out yet. You can read more about how Hadoop is no longer the cutting-edge big data framework here:


The Disillusionment: Where Big Data Fails in General, and Hadoop fails Specifically

Instead of presenting my own opinion, please look at this KD Nuggets poll:
When asked what is the largest dataset analyzed, data scientists (the people working on analyzing data) respond that a laptop is usually enough for their needs.

(I used to link the previous year's poll when I noticed a new version came out. Still, here it is, you can compare each year's results to see the stability:
http://www.kdnuggets.com/polls/2014/largest-dataset-analyzed-data-mined-2014.html
)

Why is that? For two reasons. The first one is that big data is not always necessary to develop a good data-driven application. Let's say we are trying to create personalized messages for e-commerce websites. Is the entire history of all users necessary? Maybe just the last 3 months? How much do they weight? Is a sample of the data enough for building a good model (rather than the entire corpus)? Is the entire corpus tagged (labelled) so that we can use it for building our model? This is not always the situation. Quantity can not always cover for quality.

The second reason is that most big data systems are aimed at being scalable. That means that such a system deployed on two machines (working together on the computations) will perform better than the same system deployed on a single machine (and it goes on: 3 machines are better than 2, 4 are better than 3... etc). The more machines also allow for robustness - if one of the machines die in a Hadoop cluster, the others can pick up from where it left and finish the job. So big data systems are usually scalability-oriented. Nice!

But do they perform better than a non-scalability-aimed, single-thread based system?

Frank McSherry et. al checked this subject and returned with surprising results. Read their paper here:
Frank McSherry, Michael Isard, Derek G. Murray
Scalability! But at what COST?
http://www.frankmcsherry.org/assets/COST.pdf

A shorter reading would be McSherry's blog-post about this paper:
http://blog.acolyer.org/2015/06/05/scalability-but-at-what-cost/

And also his follow-up post about testing the single laptop on even bigger data:
http://www.frankmcsherry.org/graph/scalability/cost/2015/02/04/COST2.html
For me, this set of publications is what finally took the hot air out of the big data balloon. Most people don't have big data and won't benefit (performance-wise, not business-sexiness-wise) from using those systems. That's what I see when I read the links posted above.

The Truth: Where is Big Data Used Successfully

So is it completely useless to analyze large amounts of data? As I mentioned above - that depends on your goals. Here is an opposite opinion - a positive use-case for using big data:

http://www.ft.com/cms/s/0/304b983e-5a44-11e5-a28b-50226830d644.html
Charles Clover, China’s Baidu searches for AI edge, Financial Times (September 14, 2015 4:24 am)

Here is the (probably copyrighted - sorry about that!) most important part in my opinion:
The company has an advantage in deep-learning algorithms for speech recognition in that most video and audio in China is accompanied by text — nearly all news clips, television shows and films are close-captioned and almost all are available to Baidu and Iqiyi, its video affiliate. 
While a typical academic project uses 2,000 hours of audio data to train voice recognition, says Mr Ng, the troves of data available to China’s version of Google mean he is able to use 100,000 hours. 
He declines to specify just how much the extra 98,000 hours improves the accuracy of his project, but insists it is vital. 
“A lot of people underestimate the difference between 95 per cent and 99 per cent accuracy. It’s not an ‘incremental’ improvement of 4 per cent; it’s the difference between using it occasionally versus using it all the time,” he says.
So in this domain (speech recognition), using this algorithm (deep learning), big data of this high quality (closed captions) is successfully used. This also works well with image recognition done in neural networks over large corpora of tagged images. Google and Facebook have such datasets and they do it successfully. But not every company is Google or Facebook. Are you?

Conclusion

Before you get confused by big-data, please focus on the following points:

Big Data Systems are usually a combination of scalable file systems and scalable processing architectures. The household name in this field is Hadoop, which is based on these two Google papers:
(although Google don't use these technologies nowadays:

Most real data scientists admit they don't really use big data...

... and new evidence shows that these systems are also not very efficient:
Paper by McSherry:
4) http://www.frankmcsherry.org/assets/COST.pdf
Two blog posts he made about this paper:

Big data is sometimes useful - here's an example. Understand whether you are in such a situation:
If you are not - don't feel bad. You're in a good company :)

The Fine Print

I currently work for Dato, the creator of GraphLab Create. Our product is famous for analyzing massive datasets over a single laptop or a single machine. Nevertheless, these opinions do not represent Dato. Also, I got these opinions from experimenting with GraphLab Create, and seeing for myself that good data science can be done over less than petabytes, and in a single machine.

Monday, July 27, 2015

Neural Networks and Deep Learning by Michael Nielsen

1) The best deep learning tutorial I've came across in the last few months is here:

Neural Networks and Deep Learning by Michael Nielsen
http://neuralnetworksanddeeplearning.com/index.html
Read it now! This post is just an attempt to add one more link in the web to this great book.


2) The best deep learning Python (what else??) library I came across is Keras .

http://keras.io/


3) Everyone's favourite deep learning post of the moment is Karpathy's RNN:
http://karpathy.github.io/2015/05/21/rnn-effectiveness/

But imho you should read Yoav Goldberg's response:
http://nbviewer.ipython.org/gist/yoavg/d76121dfde2618422139


And that's it for today.

Tuesday, June 2, 2015

python2.7 + graphlab create on centos

CentOS is notorious for using yum as its package manager (the equivalent of Ubuntu's 'apt-get'). Yum is dependent on the Python version that ships with CentOS, so upgrading to python 2.7 simply kills yum and breaks the system updates. Here's how to solve this.


Long story short: You can follow the instructions here:

https://github.com/h2oai/h2o-2/wiki/Installing-python-2.7-on-centos-6.3.-Follow-this-sequence-exactly-for-centos-machine-only

Or simply use my gist based on this post and my previous post:

https://gist.githubusercontent.com/guy4261/0e9f4081f1c6b078b436/raw/bfded5dd1c53284a83872bc61444511e02d99212/python2.7.9_centos_installation.sh

Monday, May 18, 2015

Learning Torch7 - Part 2: The Basic Tutorial

From this part of the page:
https://github.com/torch/torch7/wiki/Cheatsheet#newbies

I got to the tutorials part:
https://github.com/torch/torch7/wiki/Cheatsheet#tutorials-demos-by-category

And to the first tutorial:
http://code.madbits.com/wiki/doku.php

And to its first part, about the basics:
http://code.madbits.com/wiki/doku.php?id=tutorial_basics

The tutorial is straightforward, but ends with a request to normalize the mnist train data.

This is the code I came up with after a very painful realization.

The realization is what you see on line 4:

train.data = train.data:type(torch.getdefaulttensortype())


It appears that the data is loaded by torch not in tables or torch tensors but in something called "userdata". This is the type assigned to things coming from the dark depths of C / C++ interops with Lua. This userdata had a particular interesting feature: its contents were forced (through wrapping) to be in the range [0, 256) . So storing -26, for instance, would result in the value 256 - 26 = 230 to appear in the data. So casting was the first step to gain back my sanity in this case.

After casting back to torch tensors can you use :mean(), :std() and other tensor methods which make this code short and quick.

By the way - this normalization is called "z-score": http://en.wikipedia.org/wiki/Standard_score
I didn't quite catch the right way to describe the whole act in proper English, but that's what we are doing here (subtracting the mean and dividing by the standard deviation).

More useful things learned along the way:

for k,v in pairs(o) do -- or pairs(getmetatable(o))
   print(k, v)
end

is the equivalent of Python's dir(o) .

I also started working with this IDE: http://studio.zerobrane.com/

Moving on with the tutorials, the supervised learning will be my next post's subject.