Blog E

April 7th, 2018

[Slides] Consistent Hashing: Algorithmic Tradoffs

這幾天閱讀 Damian Gryski 的 blog post “Consistent Hashing: Algorithmic Tradeoffs” 整理出來的投影片 (第一版) 應該會找時間寫寫中文 blog Consistent hashing algorithmic tradeoffs from Evan Lin

繼續閱讀

March 27th, 2018

[GTG30] Introduce vgo

放上這次再 GTG (Golang Taipei Gathering 30)聚會的投影片，主要是介紹可能會放入 go 1.11 的新功能 vgo GTG30: Introduction vgo from Evan Lin

繼續閱讀

March 14th, 2018

[Kubernetes] GPU resource names in Kubernetes between Accelerators and DevicePlugin

Two ways to enable GPU in Kubernetes: If you want to enable GPU resource in Kubernetes and want Kubelet to allocate it. You need config it as following ways: Kubernetes 1.7: Using NVidia container and enable Kuberlet config feature-gates=Accelerators=true Kubernetes 1.9: Using Device Plugin with Kuberlet specific config feature-gates=DevicePlugins=true Check node if it has GPU resource: Using kubectl command kubectl get node YOUR_NODE_NAME -o json to export all node info as json format. You should see something like: ## If you use Kubernetes Accelerator after 1.7 "allocatable": { "cpu": "32", "memory": "263933300Ki", "alpha.kubernetes.io/nvidia-gpu": "4", "pods": "110" }, Detail defined in k8s.io/api/core/v1/types.go ## If you use Kubernetes Device Plugin after 1.9 "allocatable": { "cpu": "32", "memory": "263933300Ki", "nvidia.com/gpu": "4", "pods": "110" }, ## Reference: Managing Compute Resources for Containers Kubernetes: Device Plugin NVIDIA-k8s-device-plugin

繼續閱讀

March 13th, 2018

[Kubernetes] Use activeDeadlineSeconds to automatically terminate (force stop) your jobs

中文前言: 在使用 Kubernetes 的時候，可以選擇透過 Job 的方式來跑一次性的工作．但是如果希望你的工作在特定時間內一定得結束來釋放資源，就得透過這個方式．最近在研究這個的時候，發現有些使用上的小技巧，紀錄一下． Preface: If you want to force to terminate your kubernetes jobs if it exceed specific time. (e.g.: run a job no longer than 2 mins). In this case you can use a watcher to monitor this Kubernetes jobs and terminate it if exceed specific time. Or you can refer K8S Doc:”Job Termination and Cleanup” use activeDeadlineSeconds to force terminare your jobs. How to use activeDeadlineSeconds: It is very easy to setup activeDeadlineSeconds in spec. apiVersion: batch/v1 kind: Job metadata: name: myjob spec: backoffLimit: 5 activeDeadlineSeconds: 100 template: spec: containers: - name: myjob image: busybox command: ["sleep", "300"] restartPolicy: Never In this example, this job will be terminated after 100 seconds (if it works well :p ) Before you use activeDeadlineSeconds If you ever run a job with activeDeadlineSeconds, you will need delete job before you run the same job again. The job...

繼續閱讀

February 14th, 2018

[LineBot] 公告: [Line 台北流浪動物需要你] 更名為 [Line 流浪動物需要你] 並且擴大服務為全台灣

由於之前”台北開放資料”中的流浪動物資料忽然無法連上，造成之前寫的 Line Bot: 台北流浪動物需要你” 無法正常運作．只好去找找政府的資料，”動物認領養” 開放資料． (https://data.gov.tw/dataset/9842#r0) 修改好了，希望大家繼續愛用．如果有過年需要人陪的，可以玩玩看，年後去領養可愛的小毛球家人．簡單說明: 打任意字會隨機出現動物打”狗” 或是 “貓” 會出現相關動物領養資料

繼續閱讀

January 19th, 2018

[好文介紹]Scaling Kubernetes to 2,500 Nodes

原文 Scaling Kubernetes to 2,500 Nodes 緣起最近比較忙碌，都只能夜深人靜才能好好的來閱讀一些文章來充實自己．(公司牛人多到像牧場，只好不斷努力 lol ) OpenAI 最近有一篇技術文章，相當的值得一讀．就是他們分享他們如何管理超過 2500 個節點．當然我們都知道， Kubernetes 自從 1.6 之後就號稱可以乘載 5000 個節點以上，但是從數十台到 2500 台的路上，難道不會遇到一些問題嗎? 這篇文章分享了他們遇到的問題，試著要解決與懷疑的地方，最後找到真正的問題．這篇文章適合所有 DevOp 好好的熟讀．遇到的問題與解決方式第一次遇到問題: 1 ~ 500 個節點之後問題徵兆: kubectl 有時候會發生 timeout. (p.s. `kubectl -v=6 可以顯示所有API 細節指令) 嘗試解決方式: 一開始以為是 kube-apiserver 服務器過分忙碌，試著增加 proxy 來做 replica 來幫忙 load balance 但是超過 10 個備份 master 的時候，他們發現問題一定不是因為 kube-apiserver 無法承受．因為 GKE 透過一台 32-core VM 就可以乘載 500台原因: 扣除掉這些原因，開始要懷疑 master 上剩下的幾個服務． (etcd, kube-proxy) 於是開始試試看來調整 etcd 透過使用 datadog 來調查 etcd 的吞吐量，發現有異常延遲 (latency spiking ~100 ms) 透過 Fio 工具來做效能評估，發現只用到了 10% 的IOPS(Input/Output Per Second) ，由於寫入延遲 (write latency 2ms)，造成整個效能被拖累．試著把 SSD 從網路硬碟變成每台機器有個 local temp drive (依舊是 SSD) 結果從 ~100ms —> 200us 第二次遇到問題: ~1000 個節點的時候問題徵兆: 發現 kube-apiserver 每秒需要從 etcd 上面讀取 500Mb 嘗試解決方式: 透過 Prometheus (一種廣泛用於 Kubernetes 上面作為 Container 資料搜集與監控工具) 來查看每個 container 之間的網路流量原因: 發現 Fluentd (一個作為資料與 log 轉發的工具) 與 Datadog ，太頻繁來抓取每個節點上面的資料．調低這兩個服務的抓取頻率，網路效率馬上就從 500 Mb/s 到幾乎沒有… etcd 小技巧: 透過 --etcd-servers-overrides 可以將 Kubernetes Event 的資料寫入作為切割，分不同機器處理．範例為 : --etcd-servers-overrides=/events#https://0.example.com:2381;https://1.example.com:2381;https://2.example.com:2381...

繼續閱讀