Michał Podsiadłowski
2010-01-29 13:01:09 UTC
Hi all!
I'm in the middle of some performance and stability testing of out small
hbase cluster to check if it suitable for out application.
We want to use it as web cache persistence layer for out web app which
handles quite large amount of traffic.
Of course i have lot's of problems with it.
Main one is that client applications (web servers) can persist of retrieve
rows and fail miserably with exceptions like this:
org.apache.hadoop.hbase.client.NoServerForRegionException: No server address
listed in .META. for region
oldAppWebSingleRowCacheStore,filmMenuCuriosities-not_selected\xC2\xAC150,1264766907002
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1048)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:417)
could not retrieve persisted cache id 'filmRanking' for key '3872'
org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact
region server 10.0.100.51:60020 for region
oldAppWebSingleRowCacheStore,filmRanking\xC2\xAC3746,1264766860498, row
'filmRanking\xC2\xAC3872', but failed after 2 attempts.
Exceptions:
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException:
oldAppWebSingleRowCacheStore,filmRankingᅵ3746,1264766860498
at
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2266)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1730)
This happens every time when first region starts to split. As far as i can
see table is set to enabled *false* (web admin), web admin becomes little
bit less responsible - listing table regions shows no regions.
and after a while i can see 500 or more regions. Some of them as exception
shows are not fully available. HDFS doesn't seems to be the main issue. When
i run fsck it says hbase dir is healthy apart from some under replicated
blocks. Occasionaly i saw that some blocks where missing but i think this
was due to "Too many files open" exceptions (to small regions size - now
it's default 64)
Amount of data is not enormous - around 1gb in less then 100k rows then this
problems starts to occur. Request per seconds is i think small - 20-30 per
second.
What else i can say is I've set the max hbase retry to only 2 because we
can't allow clients to wait more for response.
What i would like to know is whether the table is always disabled when
performing region splits? And is it truly disabled then so that clients
can't do anything?
It looks like status says disabled but still requests are processed, though,
with different results (some like above).
My cluster setup can be probably useful -
3 centos virtual machines based on xen running DN/HR and zookeeper + one of
them NodeMaster and Secondary Master.
2 gigs of ram on each. Currently hadoop processes run with Xmx 512 and hbase
with 256 but non of them is swapping nor going out of memory.
GC logs looks normal - stop the world is not occurring ;)
top says cpus are nearly idle on all machines.
It's far from ideal but we need to prove that this can work reliably to get
more toys.
Maybe next week we will be able to test on some better machines but for now
that all what I've got.
Any advices are welcome.
Thanks,
Michal
I'm in the middle of some performance and stability testing of out small
hbase cluster to check if it suitable for out application.
We want to use it as web cache persistence layer for out web app which
handles quite large amount of traffic.
Of course i have lot's of problems with it.
Main one is that client applications (web servers) can persist of retrieve
rows and fail miserably with exceptions like this:
org.apache.hadoop.hbase.client.NoServerForRegionException: No server address
listed in .META. for region
oldAppWebSingleRowCacheStore,filmMenuCuriosities-not_selected\xC2\xAC150,1264766907002
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1048)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:417)
could not retrieve persisted cache id 'filmRanking' for key '3872'
org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact
region server 10.0.100.51:60020 for region
oldAppWebSingleRowCacheStore,filmRanking\xC2\xAC3746,1264766860498, row
'filmRanking\xC2\xAC3872', but failed after 2 attempts.
Exceptions:
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException:
oldAppWebSingleRowCacheStore,filmRankingᅵ3746,1264766860498
at
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2266)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1730)
This happens every time when first region starts to split. As far as i can
see table is set to enabled *false* (web admin), web admin becomes little
bit less responsible - listing table regions shows no regions.
and after a while i can see 500 or more regions. Some of them as exception
shows are not fully available. HDFS doesn't seems to be the main issue. When
i run fsck it says hbase dir is healthy apart from some under replicated
blocks. Occasionaly i saw that some blocks where missing but i think this
was due to "Too many files open" exceptions (to small regions size - now
it's default 64)
Amount of data is not enormous - around 1gb in less then 100k rows then this
problems starts to occur. Request per seconds is i think small - 20-30 per
second.
What else i can say is I've set the max hbase retry to only 2 because we
can't allow clients to wait more for response.
What i would like to know is whether the table is always disabled when
performing region splits? And is it truly disabled then so that clients
can't do anything?
It looks like status says disabled but still requests are processed, though,
with different results (some like above).
My cluster setup can be probably useful -
3 centos virtual machines based on xen running DN/HR and zookeeper + one of
them NodeMaster and Secondary Master.
2 gigs of ram on each. Currently hadoop processes run with Xmx 512 and hbase
with 256 but non of them is swapping nor going out of memory.
GC logs looks normal - stop the world is not occurring ;)
top says cpus are nearly idle on all machines.
It's far from ideal but we need to prove that this can work reliably to get
more toys.
Maybe next week we will be able to test on some better machines but for now
that all what I've got.
Any advices are welcome.
Thanks,
Michal