Group items tagged

Filter: All | Bookmarks | Topics Simple Middle

Improving Kubernetes reliability: quicker detection of a Node down | Fatal failure - 0 views

fatalfailure.wordpress.com/...icker-detection-of-a-node-down

kubernetes HA

shared by 張旭 on 21 Jul 21 - No Cached

when a Node gets down, the pods of the broken node are still running for some time and they still get requests, and those requests, will fail.
...

Cancel
1- The Kubelet posts its status to the masters using –node-status-update-frequency=10s 2- A node dies 3- The kube controller manager is the one monitoring the nodes, using –-node-monitor-period=5s it checks, in the masters, the node status reported by the Kubelet. 4- Kube controller manager will see the node is unresponsive, and has this grace period –node-monitor-grace-period=40s until it considers the node unhealthy.
...

Cancel
node-status-update-frequency x (N-1) != node-monitor-grace-period
...

Cancel
...2 more annotations...
5- Once the node is marked as unhealthy, the kube controller manager will remove its pods based on –pod-eviction-timeout=5m0s
...

Cancel
6- Kube proxy has a watcher over the API, so the very first moment the pods are evicted the proxy will notice and update the iptables of the node, removing the endpoints from the services so the failing pods won’t be accessible anymore.
...

Cancel

HowTo/LDAP - FreeIPA - 0 views

www.freeipa.org/LDAP

ldap authorization authentication auth security

shared by 張旭 on 03 Dec 19 - No Cached

The basedn in an IPA installation consists of a set of domain components (dc) for the initial domain that IPA was configured with.
...

Cancel
You will only ever have one basedn, the one defined during installation.
...

Cancel
find your basedn, and other interesting things, in /etc/ipa/default.conf
...

Cancel
...8 more annotations...
IPA uses a flat structure, storing like objects in what we call containers.
...

Cancel
Users: cn=users,cn=accounts,$SUFFIX Groups: cn=groups,cn=accounts,$SUFFIX
...

Cancel
Do not use the Directory Manager account to authenticate remote services to the IPA LDAP server. Use a system account
...

Cancel
The reason to use an account like this rather than creating a normal user account in IPA and using that is that the system account exists only for binding to LDAP. It is not a real POSIX user, can't log into any systems and doesn't own any files.
...

Cancel
This use also has no special rights and is unable to write any data in the IPA LDAP server, only read.
...

Cancel
When possible, configure your LDAP client to communicate over SSL/TLS.
...

Cancel
The IPA CA certificate can be found in /etc/ipa/ca.crt
...

Cancel
/etc/openldap/ldap.conf
...

Cancel

Full Cycle Developers at Netflix - Operate What You Build - 1 views

netflixtechblog.com/lopers-at-netflix-a08c31f83249

devops system

shared by 張旭 on 27 May 20 - No Cached

Researching issues felt like bouncing a rubber ball between teams, hard to catch the root cause and harder yet to stop from bouncing between one another.
...

Cancel
In the past, Edge Engineering had ops-focused teams and SRE specialists who owned the deploy+operate+support parts of the software life cycle
...

Cancel
hearing about those problems second-hand
...

Cancel
...17 more annotations...
devs could push code themselves when needed, and also were responsible for off-hours production issues and support requests
...

Cancel
What were we trying to accomplish and why weren’t we being successful?
...

Cancel
These specialized roles create efficiencies within each segment while potentially creating inefficiencies across the entire life cycle.
...

Cancel
Grouping differing specialists together into one team can reduce silos, but having different people do each role adds communication overhead, introduces bottlenecks, and inhibits the effectiveness of feedback loops.
...

Cancel
devops principles
...

Cancel
develops a system also be responsible for operating and supporting that system
...

Cancel
Each development team owns deployment issues, performance bugs, capacity planning, alerting gaps, partner support, and so on.
...

Cancel
Those centralized teams act as force multipliers by turning their specialized knowledge into reusable building blocks.
...

Cancel
Communication and alignment are the keys to success.
...

Cancel
Full cycle developers are expected to be knowledgeable and effective in all areas of the software life cycle.
...

Cancel
ramping up on areas they haven’t focused on before
...

Cancel
We run dev bootcamps and other forms of ongoing training to impart this knowledge and build up these skills
...

Cancel
“how can I automate what is needed to operate this system?”
...

Cancel
“what self-service tool will enable my partners to answer their questions without needing me to be involved?”
...

Cancel
A full cycle developer thinks and acts like an SWE, SDET, and SRE. At times they create software that solves business problems, at other times they write test cases for that, and still other times they automate operational aspects of that system.
...

Cancel
the need for continuous delivery pipelines, monitoring/observability, and so on.
...

Cancel
Tooling and automation help to scale expertise, but no tool will solve every problem in the developer productivity and operations space
...

Cancel

How to configure a Kubernetes Multi-Pod Deployment - Stack Overflow - 0 views

stackoverflow.com/...ubernetes-multi-pod-deployment

kubernetes deploy

shared by 張旭 on 02 Oct 21 - No Cached

A Deployment is meant to represent a single group of PODs fulfilling a single purpose together.
...

Cancel
Deployments are meant to contain stateless services. If you need to store a state you need to create StatefulSet instead
...

Cancel

張旭 on 02 Oct 21

"A Deployment is meant to represent a single group of PODs fulfilling a single purpose together."

<div class="cArrow"> </div><div class="cContentInner">"A Deployment is meant to represent a single group of PODs fulfilling a single purpose together."</div>

...

Cancel

Tagging AWS resources - AWS General Reference - 0 views

docs.aws.amazon.com/...aws_tagging.html

aws cloud

shared by 張旭 on 20 Jan 22 - No Cached

assign metadata to your AWS resources in the form of tags.
...

Cancel
a user-defined key and value
...

Cancel
Tag keys are case sensitive.
...

Cancel
...17 more annotations...
tag values are case sensitive.
...

Cancel
Tags are accessible to many AWS services, including billing.
...

Cancel
personally identifiable information (PII)
...

Cancel
apply it consistently across all resource types.
...

Cancel
Use automated tools to help manage resource tags.
...

Cancel
Use too many tags rather than too few tags.
...

Cancel
Tag policies let you specify tagging rules that define valid key names and the values that are valid for each key.
...

Cancel
Name – Identify individual resources
...

Cancel
Environment – Distinguish between development, test, and production resources
...

Cancel
Project – Identify projects that the resource supports
...

Cancel
Owner – Identify who is responsible for the resource
...

Cancel
Each resource can have a maximum of 50 user created tags.
...

Cancel
For each resource, each tag key must be unique, and each tag key can have only one value.
...

Cancel
Tag keys and values are case sensitive.
...

Cancel
decide on a strategy for capitalizing tags, and consistently implement that strategy across all resource types.
...

Cancel
AWS Cost Explorer and detailed billing reports let you break down AWS costs by tag.
...

Cancel
An effective tagging strategy uses standardized tags and applies them consistently and programmatically across AWS resources.
...

Cancel

張旭 on 20 Jan 22

"assign metadata to your AWS resources in the form of tags."

<div class="cArrow"> </div><div class="cContentInner">"assign metadata to your AWS resources in the form of tags."</div>

...

Cancel

The Squeaky Blog | Why we don't use a staging environment - 0 views

squeaky.ai/...dont-use-a-staging-environment

devops

shared by 張旭 on 15 Apr 22 - No Cached

Pre-live environments are never at parity with production
...

Cancel
multiple people use staging to validate their changes before release.
...

Cancel
Branches are then constantly out of sync with each other, and problems often surface when you merge, rebase, and backfill hotfixes.
...

Cancel
...10 more annotations...
Big Bang releases
...

Cancel
there is a lengthy suite of tests and checks that run before it is deployed to staging. During this period, which could end up being hours, engineers will likely pick up another task. I’ve seen people merge, and then forget that their changes are on staging, more times than I can count.
...

Cancel
only merge code that is ready to go live
...

Cancel
written sufficient tests and have validated our changes in development.
...

Cancel
All branches are cut from main, and all changes get merged back into main.
...

Cancel
If we ever have an issue in production, we always roll forward.
...

Cancel
Feature flags can be enabled on a per-user basis so we can monitor performance and gather feedback
...

Cancel
Experimental features can be enabled by users in their account settings.
...

Cancel
we have monitoring, logging, and alarms around all of our services. We also blue/green deploy, by draining and replacing a percentage of containers.
...

Cancel
Dropping your staging environment in favour of true continuous integration and deployment can create a different mindset for shipping software.
...

Cancel

張旭 on 15 Apr 22

"Pre-live environments are never at parity with production "

<div class="cArrow"> </div><div class="cContentInner">"Pre-live environments are never at parity with production "</div>

...

Cancel

Configuring a cgroup driver | Kubernetes - 0 views

kubernetes.io/...configure-cgroup-driver

kubernetes docker devops

shared by 張旭 on 25 Jan 23 - No Cached

the systemd driver is recommended for kubeadm based setups instead of the cgroupfs driver, because kubeadm manages the kubelet as a systemd service.
...

Cancel

Installing Addons | Kubernetes - 0 views

kubernetes.io/...addons

kubernetes network networking

shared by 張旭 on 25 Jan 23 - No Cached

Calico is a networking and network policy provider. Calico supports a flexible set of networking options so you can choose the most efficient option for your situation, including non-overlay and overlay networks, with or without BGP. Calico uses the same engine to enforce network policy for hosts, pods, and (if using Istio & Envoy) applications at the service mesh layer.
...

Cancel
Cilium is a networking, observability, and security solution with an eBPF-based data plane. Cilium provides a simple flat Layer 3 network with the ability to span multiple clusters in either a native routing or overlay/encapsulation mode, and can enforce network policies on L3-L7 using an identity-based security model that is decoupled from network addressing. Cilium can act as a replacement for kube-proxy; it also offers additional, opt-in observability and security features.
...

Cancel
CoreDNS is a flexible, extensible DNS server which can be installed as the in-cluster DNS for pods.
...

Cancel
...1 more annotation...
The node problem detector runs on Linux nodes and reports system issues as either Events or Node conditions.
...

Cancel

chaifeng/ufw-docker: To fix the Docker and UFW security flaw without disabling iptables - 0 views

github.com/ufw-docker

iptables docker

shared by 張旭 on 13 Jul 22 - No Cached

It requires to disable docker's iptables function first, but this also means that we give up docker's network management function.
...

Cancel
This causes containers will not be able to access the external network.
...

Cancel
such as -A POSTROUTING ! -o docker0 -s 172.17.0.0/16 -j MASQUERADE. But this only allows containers that belong to network 172.17.0.0/16 can access outside.
...

Cancel
...13 more annotations...
Don't need to disable Docker's iptables and let Docker to manage it's network.
...

Cancel
The public network cannot access ports that published by Docker.
...

Cancel
In a very convenient way to allow/deny public networks to access container ports without additional software and extra configurations
...

Cancel
Enable Docker's iptables feature. Remove all changes like --iptables=false , including configuration file /etc/docker/daemon.json
...

Cancel
Modify the UFW configuration file /etc/ufw/after.rules
...

Cancel
There may be some unknown reasons cause the UFW rules will not take effect after restart UFW, please reboot servers.
...

Cancel
If we publish a port by using option -p 8080:80, we should use the container port 80, not the host port 8080
...

Cancel
allow the private networks to be able to visit each other.
...

Cancel
The following rules block connection requests initiated by all public networks, but allow internal networks to access external networks.
...

Cancel
Since the UDP protocol is stateless, it is not possible to block the handshake signal that initiates the connection request as TCP does.
...

Cancel
For GNU/Linux we can find the local port range in the file /proc/sys/net/ipv4/ip_local_port_range. The default range is 32768 60999
...

Cancel
It not only exposes ports of containers but also exposes ports of the host.
...

Cancel
Cannot expose services running on hosts and containers at the same time by the same command.
...

Cancel

張旭 on 13 Jul 22

"It requires to disable docker's iptables function first, but this also means that we give up docker's network management function."

<div class="cArrow"> </div><div class="cContentInner">"It requires to disable docker's iptables function first, but this also means that we give up docker's network management function."</div>

...

Cancel

我做系统架构的一些原则 | 酷壳 - CoolShell - 0 views

coolshell.cn/21672.html

system architecture

shared by 張旭 on 03 Jan 22 - No Cached

如果不说收益，只是为了技术而技术，而没有任何意义。
...

Cancel
有计划和无计划的停机做相应的解决方案
...

Cancel
经常不断的 human error
...

Cancel
...35 more annotations...
运维又会分成基础运维和应用运维，开发则会分成基础核心开发和业务开发。
...

Cancel
基础运维和开发的同学更多的只是关注资源的利用率和性能，而应用运维和业务开发则更多关注的是应用和服务上的东西。
...

Cancel
有一些系统已经说不清楚是基础层的还是应用层的了，比如像服务治理上的东西，里面即有底层基础技术，也需要业务的同学来配合，包括 k8s 也样，里面即有底层的如网络这样的技术，也有需要业务配合的 readniess和 liveness 这样的健康检查，以及业务应用需要 configMap 等等 ……
...

Cancel
试想一下城市交通的优化，当城市规模到一定程度的时候，整体的性能你是无法通过优化几条路或是几条街区来完成的，你需要对整个城市做整体的功能体的规划才可能达到整体效率的提升
...

Cancel
当系统越来越复杂的时候，用户把他们的  PHP，Python, .NET，或 Node.js 的架构完全都迁移到 Java + Go 的架构上来的案例不断的发生。
...

Cancel
更为工业化的技术
...

Cancel
使用更为成熟更为工业化的技术栈，而不是自己熟悉的技术栈
...

Cancel
不要自己发明轮子，更不要魔改
...

Cancel
完全没有必要。不重新发明轮子，不魔改，不是因为自己技术不能，而是因为，这个世界早已不是自己干所有事的年代了
...

Cancel
好些公司的架构都被技术负责人个人的喜好、擅长和个人经验给绑架了，完全不是从一个客观的角度来进行技术选型
...

Cancel
全中国所有的电商平台，几百家银行，三大电信运营商，所有的保险公司，劵商的系统，医院里的系统，电子政府系统，等等，基本都是用 Java 开发的，包括 AWS 的主流语言也是 Java
...

Cancel
NoSQL 的数据库在 Join 上都表现的太差
...

Cancel
为了不做 Join 就开始冗余数据，然而自己又维护不好冗余数据后带来的数据一致性的问题，导致数据上的各种错乱丢失。
...

Cancel
永远使用完备支持 ACID 的关系型数据库
...

Cancel
性能上的事，总是有解的，手段也是最多的，这个比起架构的完备性和扩展性来说真的不必太过担心。
...

Cancel
很多公司的系统既没有服从业界标准，也没有形成自己公司的标准，感觉就像一群乌合之众一样。
...

Cancel
最典型的例子就是 HTTP 调用的状态返回码。业内给你的标准是 200表示成功，3xx 跳转，4xx 表示调用端出错，5xx 表示服务端出错，我实在是不明白为什么无论成功和失败大家都喜欢返回 200，然后在 body 里指出是否error
...

Cancel
Restful API 的规范。我觉得是非常重要的，这里给两个我觉得写得最好的参考：Paypal 和 Microsoft 。
...

Cancel
监控系统宁可自己死了也不能干扰实际应用。
...

Cancel
一个公司至少一年要有一次软件版本升级的review，然后形成软件版本的统一和一致
...

Cancel
架构和软件不是写好就完的，是需要不断修改不断维护的，80%的软件成本都是在维护上。
...

Cancel
通过服务发现或服务网关来降低服务依赖所带来的运维复杂度
...

Cancel
一定要使用各种软件设计的原则。比如：像SOLID这样的原则（参看《一些软件设计的原则》），IoC/DIP，SOA 或 Spring Cloud 等架构的最佳实践（参看《SteveY对Amazon和Google平台的吐槽》中的 Service Interface 的那几条军规），分布式系统架构的相关实践（参看：《分布式系统的事务处理》，或微软件的《Cloud Design Patterns》）……等等
...

Cancel
没有自动化测试，没有好的软件文档，没有质量好的代码，没有标准和规范
...

Cancel
以前欠下的技术债，都得要还，没打好的地基要重新打，没建配套设施都要建。这些基础设施如果不按照正确科学的方式建立的话，你是不可能有一个好的的系统
...

Cancel
与其花大力气迁就技术债务，不如直接还技术债
...

Cancel
建设没有技术债的“新城区”，并通过“防腐层 ”的架构模型，不要让技术债侵入“新城区”。
...

Cancel
如果有一天你在做技术决定的时候，开始凭自己以往的经验，那么你就已经不可能再成长了。
...

Cancel
做任何决定之前，最好花上一点时间，上网查一下相关的资料，技术博客，文章，论文等，同时，也看看各个公司，或是各个开源软件他们是怎么做的？然后，比较多种方案的 Pros/Cons，最终形成自己的决定
...

Cancel
对于 X-Y 问题，也就是说，用户为了解决 X问题，他觉得用 Y 可以解，于是问我 Y 怎么搞，结果搞到最后，发现原来要解决的 X 问题，这个时候最好的解决方案不是 Y，而是 Z。
...

Cancel
我很喜欢追问为什么，这种追问，会让客户也跟着来一起重新思考。
...

Cancel
激进并不是瞎搞，也不是见新技术就上，而是积极拥抱会改变未来的新技术
...

Cancel
不是不喜欢的就不学了，我对区块链和 Rust 我一样学习，我也知道这些技术的优势，但我不会大规模使用它们。
...

Cancel
进步永远来自于探索，探索是要付出代价的，但是收益更大。
...

Cancel
不敢冒险才是最大的冒险，不敢犯错才是最大的错误，害怕失去会让你失去的更多
...

Cancel

« First ‹ Previous 121 - 130 of 130

Showing 20▼ items per page