Sean Busbey
2018-12-06 15:45:38 UTC
This week I've run into two cases where I needed the set of regions in
transition so I could recover them and I ran into what I think is a
gap in our operator tooling. I'm hoping folks will have some ideas
I've missed.
Depending on how this thread goes, I'll make some follow-on on the
***@hbase list for implementing changes and documentation.
Case 1: HBase 1.2-ish RIT following RS crash
Cluster had a handful of region servers fail and for whatever reason a
few regions were stuck in transition. The operator I was helping
already is used to dealing with the occasional manual recovery. Their
normal process looks like this:
1) Got to Master UI website
2) Scroll down to Regions in Transition list
3) Find a RIT in FAILED_CLOSE / FAILED_OPEN / PENDING_OPEN
4) confirm on RS logs that the RS associated in the above is now in
good health and doesn't expect to do anything with said region
5) run "assign" in the hbase shell for the region
Unfortunately, the cluster's HDFS was under duress and so listing
snapshot information was super slow. This caused the Master UI website
to hang prior to displaying the RIT list.
We ended up looking at the master log file.
Case 2: HBase 2.1-ish RIT following cluster wide crash
AFAICT cluster had experienced a failure of all RS and masters. Upon
coming back up Master was left with ~10% of ~10K regions in a state of
PENDING_OPEN or OPENING all with a RS that had no idea it was involved
with those regions. I'm pretty sure this is a bug; I'm still triaging
it and I don't think it's relevant to the current question.
Once I confirmed the given RS was not currently doing anything for any
of those regions I figured I'd use HBCK2 to run an assigns to get
things fixed. However, since there were like 900 RITs, the Master UI
was unusable for getting a complete list. Also with that many all in
the same state I want to be able to automate running against each of
them.
I ended up greping the master log file and pulling out the WARN
messages about RIT to tease out the list of regions, then passed those
to hbck2.
----
Am I missing some obvious place where I can use a CLI tool to get a
list of RIT? I don't see anything in the ref guide. I looked through
the help of HBCK 1 and the shell and couldn't find anything.
I think I can use Admin.getClusterStatus() and getClusterMetrics() to
get this info from the Java API. That means there's some way to get it
in the hbase shell, but it'll probably be ugly. If there's not already
an easier way I'll want to wrap that so it's a simple command.
transition so I could recover them and I ran into what I think is a
gap in our operator tooling. I'm hoping folks will have some ideas
I've missed.
Depending on how this thread goes, I'll make some follow-on on the
***@hbase list for implementing changes and documentation.
Case 1: HBase 1.2-ish RIT following RS crash
Cluster had a handful of region servers fail and for whatever reason a
few regions were stuck in transition. The operator I was helping
already is used to dealing with the occasional manual recovery. Their
normal process looks like this:
1) Got to Master UI website
2) Scroll down to Regions in Transition list
3) Find a RIT in FAILED_CLOSE / FAILED_OPEN / PENDING_OPEN
4) confirm on RS logs that the RS associated in the above is now in
good health and doesn't expect to do anything with said region
5) run "assign" in the hbase shell for the region
Unfortunately, the cluster's HDFS was under duress and so listing
snapshot information was super slow. This caused the Master UI website
to hang prior to displaying the RIT list.
We ended up looking at the master log file.
Case 2: HBase 2.1-ish RIT following cluster wide crash
AFAICT cluster had experienced a failure of all RS and masters. Upon
coming back up Master was left with ~10% of ~10K regions in a state of
PENDING_OPEN or OPENING all with a RS that had no idea it was involved
with those regions. I'm pretty sure this is a bug; I'm still triaging
it and I don't think it's relevant to the current question.
Once I confirmed the given RS was not currently doing anything for any
of those regions I figured I'd use HBCK2 to run an assigns to get
things fixed. However, since there were like 900 RITs, the Master UI
was unusable for getting a complete list. Also with that many all in
the same state I want to be able to automate running against each of
them.
I ended up greping the master log file and pulling out the WARN
messages about RIT to tease out the list of regions, then passed those
to hbck2.
----
Am I missing some obvious place where I can use a CLI tool to get a
list of RIT? I don't see anything in the ref guide. I looked through
the help of HBCK 1 and the shell and couldn't find anything.
I think I can use Admin.getClusterStatus() and getClusterMetrics() to
get this info from the Java API. That means there's some way to get it
in the hbase shell, but it'll probably be ugly. If there's not already
an easier way I'll want to wrap that so it's a simple command.