Hbase fails at moderate load.

Discussion:

Michał Podsiadłowski

2010-01-29 13:01:09 UTC

Hi all!

I'm in the middle of some performance and stability testing of out small
hbase cluster to check if it suitable for out application.
We want to use it as web cache persistence layer for out web app which
handles quite large amount of traffic.
Of course i have lot's of problems with it.

Main one is that client applications (web servers) can persist of retrieve
rows and fail miserably with exceptions like this:

org.apache.hadoop.hbase.client.NoServerForRegionException: No server address
listed in .META. for region
oldAppWebSingleRowCacheStore,filmMenuCuriosities-not_selected\xC2\xAC150,1264766907002
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1048)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:417)

could not retrieve persisted cache id 'filmRanking' for key '3872'
org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact
region server 10.0.100.51:60020 for region
oldAppWebSingleRowCacheStore,filmRanking\xC2\xAC3746,1264766860498, row
'filmRanking\xC2\xAC3872', but failed after 2 attempts.
Exceptions:
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException:
oldAppWebSingleRowCacheStore,filmRankingï¿œ3746,1264766860498
at
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2266)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1730)

This happens every time when first region starts to split. As far as i can
see table is set to enabled *false* (web admin), web admin becomes little
bit less responsible - listing table regions shows no regions.
and after a while i can see 500 or more regions. Some of them as exception
shows are not fully available. HDFS doesn't seems to be the main issue. When
i run fsck it says hbase dir is healthy apart from some under replicated
blocks. Occasionaly i saw that some blocks where missing but i think this
was due to "Too many files open" exceptions (to small regions size - now
it's default 64)
Amount of data is not enormous - around 1gb in less then 100k rows then this
problems starts to occur. Request per seconds is i think small - 20-30 per
second.
What else i can say is I've set the max hbase retry to only 2 because we
can't allow clients to wait more for response.

What i would like to know is whether the table is always disabled when
performing region splits? And is it truly disabled then so that clients
can't do anything?
It looks like status says disabled but still requests are processed, though,
with different results (some like above).

My cluster setup can be probably useful -
3 centos virtual machines based on xen running DN/HR and zookeeper + one of
them NodeMaster and Secondary Master.
2 gigs of ram on each. Currently hadoop processes run with Xmx 512 and hbase
with 256 but non of them is swapping nor going out of memory.
GC logs looks normal - stop the world is not occurring ;)
top says cpus are nearly idle on all machines.

It's far from ideal but we need to prove that this can work reliably to get
more toys.
Maybe next week we will be able to test on some better machines but for now
that all what I've got.

Any advices are welcome.

Thanks,
Michal

Tim Robertson

2010-01-29 13:08:27 UTC

Permalink

Hi Michal,

[Disclaimer: I am not well experienced in HBase]

Those seem like very low memory allocations for HBase from what I've
seen / observed on this list. I was told to not consider less than 8G
for those demons. It could be that you need to increase all the lease
times to allow for the split to happen.

Just an idea, but if you needed to, you could consider Amazon EC2 with
XLarge instances for a very small amount to prove the concept.

Cheers,
Tim

Post by MichaÅ PodsiadÅowski
Hi all!
I'm in the middle of some performance and stability testing of out small
hbase cluster to check if it suitable for out application.
We want to use it as web cache persistence layer for out web app which
handles quite large amount of traffic.
Of course i have lot's of problems with it.
Main one is that client applications (web servers) can persist of retrieve
org.apache.hadoop.hbase.client.NoServerForRegionException: No server address
listed in .META. for region
oldAppWebSingleRowCacheStore,filmMenuCuriosities-not_selected\xC2\xAC150,1264766907002
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1048)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:417)
could not retrieve persisted cache id 'filmRanking' for key '3872'
org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact
region server 10.0.100.51:60020 for region
oldAppWebSingleRowCacheStore,filmRanking\xC2\xAC3746,1264766860498, row
'filmRanking\xC2\xAC3872', but failed after 2 attempts.
oldAppWebSingleRowCacheStore,filmRanking�3746,1264766860498
at
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2266)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1730)
This happens every time when first region starts to split. As far as i can
see table is set to enabled *false* (web admin), web admin becomes little
bit less responsible - listing table regions shows no regions.
and after a while i can see 500 or more regions. Some of them as exception
shows are not fully available. HDFS doesn't seems to be the main issue. When
i run fsck it says hbase dir is healthy apart from some under replicated
blocks. Occasionaly i saw that some blocks where missing but i think this
was due to "Too many files open" exceptions (to small regions size - now
it's default 64)
Amount of data is not enormous - around 1gb in less then 100k rows then this
problems starts to occur. Request per seconds is i think small - 20-30 per
second.
What else i can say is I've set the max hbase retry to only 2 because we
can't allow clients to wait more for response.
What i would like to know is whether the table is always disabled when
performing region splits? And is it truly disabled then so that clients
can't do anything?
It looks like status says disabled but still requests are processed, though,
with different results (some like above).
My cluster setup can be probably useful -
3 centos virtual machines based on xen running DN/HR and zookeeper + one of
them NodeMaster and Secondary Master.
2 gigs of ram on each. Currently hadoop processes run with Xmx 512 and hbase
with 256 but non of them is swapping nor going out of memory.
GC logs looks normal - stop the world is not occurring ;)
top says cpus are nearly idle on all machines.
It's far from ideal but we need to prove that this can work reliably to get
more toys.
Maybe next week we will be able to test on some better machines but for now
that all what I've got.
Any advices are welcome.
Thanks,
Michal

Stack

2010-01-31 00:10:27 UTC

Permalink

What Tim said and then some comments in the below.

What version of hbase?

Post by MichaÅ PodsiadÅowski
This happens every time when first region starts to split. As far as i can
see table is set to enabled *false* (web admin), web admin becomes little
bit less responsible - listing table regions shows no regions.
and after a while i can see 500 or more regions.

You go from zero to 500 regions with nothing showing in between?
Thats pretty impressive. 500 regions in 256M on 3 servers is probably
pushing it

Some of them as exception

Post by MichaÅ PodsiadÅowski
shows are not fully available.

Identify the duff regions by running a full table scan in the shell
with DEBUG enabled on the client. It'll puke when it hits the first
broke region

HDFS doesn't seems to be the main issue. When

Post by MichaÅ PodsiadÅowski
i run fsck it says hbase dir is healthy apart from some under replicated
blocks. Occasionaly i saw that some blocks where missing but i think this
was due to "Too many files open" exceptions (to small regions size - now
it's default 64)

Too many open files is bad. Check out the hbase 'Getting Started'.

Post by MichaÅ PodsiadÅowski
Amount of data is not enormous - around 1gb in less then 100k rows then this
problems starts to occur. Request per seconds is i think small - 20-30 per
second.
What else i can say is I've set the max hbase retry to only 2 because we
can't allow clients to wait more for response.

I would suggest you leave things at default till running smooth then
start in optimizing.

Post by MichaÅ PodsiadÅowski
What i would like to know is whether the table is always disabled when
performing region splits?

No. Region goes offline for some period of time. If machines are
heavily loaded it will take longer for it to come back on line again.

And is it truly disabled then so that clients

Post by MichaÅ PodsiadÅowski
can't do anything?
It looks like status says disabled but still requests are processed, though,
with different results (some like above).

Disabled or 'offline'? Parents of region splits go offline and are
replaced by new daughter splits.

Post by MichaÅ PodsiadÅowski
My cluster setup can be probably useful -
3 centos virtual machines based on xen running DN/HR and zookeeper + one of
them NodeMaster and Secondary Master.
2 gigs of ram on each. Currently hadoop processes run with Xmx 512 and hbase
with 256 but non of them is swapping nor going out of memory.
GC logs looks normal - stop the world is not occurring ;)

Really? No full GCs though only 256 and though about 100 plus regions
per server?

Post by MichaÅ PodsiadÅowski
top says cpus are nearly idle on all machines.
It's far from ideal but we need to prove that this can work reliably to get
more toys.
Maybe next week we will be able to test on some better machines but for now
that all what I've got.

Makes sense. You are starting very small though and virtual machines
have proven a flakey foundation for hbase. Read back over the list
and look for ec2 mentions.

St.Ack

Post by MichaÅ PodsiadÅowski
Any advices are welcome.
Thanks,
Michal

Michał Podsiadłowski

2010-02-01 15:59:44 UTC

Permalink

Hi Stack,
thanks for your last input.

I've started new week with few tweaks of environment. I've put down one of
the web servers so i gained additional node.
I've put there HMaster, both namenodes and zookeeper and requested from our
IT stuff some additional memory to rest of nodes.

Now setup is like this:
Namenode + Secondary Namenode + HMaster @ 1GB + zookeeper @256MB - machine
with 4gb
3 x datanodes/hregions - DataNode @768Mb + HRegion @1GB - machines 2GB
of ram
2 additional zookeepers @256MB on webservers that are uploading to hbase.

Probably more memory for OS cache/buffors on datanodes would be useful but
free -m after quite long upload says:
* total used free shared buffers cached
Mem: 2048 903 1144 0 37 362
-/+ buffers/cache: 503 1544
Swap: 1019 0 1019

*All is based on hadoop 0.20.2 and hbase 0.20.3.

All seems to be much more stable.
Too many open files is no longer a problem (max file size - 16mb was wrong
idea).
But still problem with dividing very first region occured.
For around 1 minute regions were dividing and dividing till they reach total
count around 130.
During that time in .META. some regions were not assigned to servers ( exp.
no address for region in .META.).
But I think i haven't seen problems with hitting wrong regions or not
serving regions.
This is something that really freaks us out, because potential this can
happen every region split
and then whole application can go bananas.
Can someone explain why regions are dividing so rapidly and to such a
quantity?

http://pastebin.com/m73276a36 - here you can find a piece of log from that
moment

Cheers,
Michal

*

*

Post by Stack
What Tim said and then some comments in the below.
What version of hbase?

Post by MichaÅ PodsiadÅowski
This happens every time when first region starts to split. As far as i

can

Post by MichaÅ PodsiadÅowski
see table is set to enabled *false* (web admin), web admin becomes little
bit less responsible - listing table regions shows no regions.
and after a while i can see 500 or more regions.

You go from zero to 500 regions with nothing showing in between?
Thats pretty impressive. 500 regions in 256M on 3 servers is probably
pushing it
Some of them as exception

Post by MichaÅ PodsiadÅowski
shows are not fully available.

Identify the duff regions by running a full table scan in the shell
with DEBUG enabled on the client. It'll puke when it hits the first
broke region
HDFS doesn't seems to be the main issue. When

Too many open files is bad. Check out the hbase 'Getting Started'.

Post by MichaÅ PodsiadÅowski
Amount of data is not enormous - around 1gb in less then 100k rows then

this

Post by MichaÅ PodsiadÅowski
problems starts to occur. Request per seconds is i think small - 20-30

per

Post by MichaÅ PodsiadÅowski
second.
What else i can say is I've set the max hbase retry to only 2 because we
can't allow clients to wait more for response.

I would suggest you leave things at default till running smooth then
start in optimizing.

Post by MichaÅ PodsiadÅowski
What i would like to know is whether the table is always disabled when
performing region splits?

No. Region goes offline for some period of time. If machines are
heavily loaded it will take longer for it to come back on line again.
And is it truly disabled then so that clients

Post by MichaÅ PodsiadÅowski
can't do anything?
It looks like status says disabled but still requests are processed,

though,

Post by MichaÅ PodsiadÅowski
with different results (some like above).

Disabled or 'offline'? Parents of region splits go offline and are
replaced by new daughter splits.

Post by MichaÅ PodsiadÅowski
My cluster setup can be probably useful -
3 centos virtual machines based on xen running DN/HR and zookeeper + one

Post by MichaÅ PodsiadÅowski
them NodeMaster and Secondary Master.
2 gigs of ram on each. Currently hadoop processes run with Xmx 512 and

hbase

Post by MichaÅ PodsiadÅowski
with 256 but non of them is swapping nor going out of memory.
GC logs looks normal - stop the world is not occurring ;)

Really? No full GCs though only 256 and though about 100 plus regions
per server?

Post by MichaÅ PodsiadÅowski
top says cpus are nearly idle on all machines.
It's far from ideal but we need to prove that this can work reliably to

get

Post by MichaÅ PodsiadÅowski
more toys.
Maybe next week we will be able to test on some better machines but for

now

Post by MichaÅ PodsiadÅowski
that all what I've got.

Makes sense. You are starting very small though and virtual machines
have proven a flakey foundation for hbase. Read back over the list
and look for ec2 mentions.
St.Ack

Post by MichaÅ PodsiadÅowski
Any advices are welcome.
Thanks,
Michal

Jean-Daniel Cryans

2010-02-01 18:11:14 UTC

Permalink

2010-02-01 14:05:49,445 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread: region split,
META updated, and report to master all successful. Old region=REGION
=> {NAME => 'oldWebSingleRowCacheStore,,1265029544146', STARTKEY =>
'', ENDKEY => 'filmMenuEditions-not_selected\xC2\xAC1405', ENCODED =>
1899385768, OFFLINE => true, SPLIT => true, TABLE => {{NAME =>
'oldWebSingleRowCacheStore', MAX_FILESIZE => '64', FAMILIES => [{NAME
=> 'content', COMPRESSION => 'NONE', VERSIONS => '3', TTL =>
'2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
=> 'true'}, {NAME => 'description', COMPRESSION => 'NONE', VERSIONS =>
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}]}}, new regions:
oldWebSingleRowCacheStore,,1265029549167,
oldWebSingleRowCacheStore,filmLastTopics\xC2\xAC1155,1265029549167.
Split took 0sec

I see MAX_FILESIZE => '64' which means you have set that table to
split after 64 _bytes_ so either use the default value of 256MB
(256*1024*1024) or even higher if you wish (I set usually set them to
1GB).

J-D

Post by MichaÅ PodsiadÅowski
Hi Stack,
thanks for your last input.
I've started new week with few tweaks of environment. I've put down one of
the web servers so i gained additional node.
I've put there HMaster, both namenodes and zookeeper and requested from our
IT stuff some additional memory to rest of nodes.
with 4gb
of ram
Probably more memory for OS cache/buffors on datanodes would be useful but
* total used free shared buffers cached
Mem: 2048 903 1144 0 37 362
-/+ buffers/cache: 503 1544
Swap: 1019 0 1019
*All is based on hadoop 0.20.2 and hbase 0.20.3.
All seems to be much more stable.
Too many open files is no longer a problem (max file size - 16mb was wrong
idea).
But still problem with dividing very first region occured.
For around 1 minute regions were dividing and dividing till they reach total
count around 130.
During that time in .META. some regions were not assigned to servers ( exp.
no address for region in .META.).
But I think i haven't seen problems with hitting wrong regions or not
serving regions.
This is something that really freaks us out, because potential this can
happen every region split
and then whole application can go bananas.
Can someone explain why regions are dividing so rapidly and to such a
quantity?
http://pastebin.com/m73276a36 - here you can find a piece of log from that
moment
Cheers,
Michal
*
*

Post by Stack
What Tim said and then some comments in the below.
What version of hbase?

Post by MichaÅ PodsiadÅowski
This happens every time when first region starts to split. As far as i

can

You go from zero to 500 regions with nothing showing in between?
Thats pretty impressive. 500 regions in 256M on 3 servers is probably
pushing it
Some of them as exception

Post by MichaÅ PodsiadÅowski
shows are not fully available.

Identify the duff regions by running a full table scan in the shell
with DEBUG enabled on the client. It'll puke when it hits the first
broke region
HDFS doesn't seems to be the main issue. When

Too many open files is bad. Check out the hbase 'Getting Started'.

Post by MichaÅ PodsiadÅowski
Amount of data is not enormous - around 1gb in less then 100k rows then

this

Post by MichaÅ PodsiadÅowski
problems starts to occur. Request per seconds is i think small - 20-30

per

Post by MichaÅ PodsiadÅowski
second.
What else i can say is I've set the max hbase retry to only 2 because we
can't allow clients to wait more for response.

I would suggest you leave things at default till running smooth then
start in optimizing.

Post by MichaÅ PodsiadÅowski
What i would like to know is whether the table is always disabled when
performing region splits?

No. Region goes offline for some period of time. If machines are
heavily loaded it will take longer for it to come back on line again.
And is it truly disabled then so that clients

Post by MichaÅ PodsiadÅowski
can't do anything?
It looks like status says disabled but still requests are processed,

though,

Post by MichaÅ PodsiadÅowski
with different results (some like above).

Disabled or 'offline'? Parents of region splits go offline and are
replaced by new daughter splits.

Post by MichaÅ PodsiadÅowski
My cluster setup can be probably useful -
3 centos virtual machines based on xen running DN/HR and zookeeper + one

Post by MichaÅ PodsiadÅowski
them NodeMaster and Secondary Master.
2 gigs of ram on each. Currently hadoop processes run with Xmx 512 and

hbase

Post by MichaÅ PodsiadÅowski
with 256 but non of them is swapping nor going out of memory.
GC logs looks normal - stop the world is not occurring ;)

Really? No full GCs though only 256 and though about 100 plus regions
per server?

Post by MichaÅ PodsiadÅowski
top says cpus are nearly idle on all machines.
It's far from ideal but we need to prove that this can work reliably to

get

Post by MichaÅ PodsiadÅowski
more toys.
Maybe next week we will be able to test on some better machines but for

now

Post by MichaÅ PodsiadÅowski
that all what I've got.

Makes sense. You are starting very small though and virtual machines
have proven a flakey foundation for hbase. Read back over the list
and look for ec2 mentions.
St.Ack

Post by MichaÅ PodsiadÅowski
Any advices are welcome.
Thanks,
Michal

Michał Podsiadłowski

2010-02-01 18:54:03 UTC

Permalink

But if I get 130 tables out of 64 megs how many i will get after splinting
1gig? Can you tell me what triggers further splits. First one is triggered
by exceeding "split at" size what is expect but after this i should get 2 x
32 megs. And then after one of them will grow up again above max filesize
limit it should be splited. Am i right?

Post by Jean-Daniel Cryans
2010-02-01 14:05:49,445 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread: region split,
META updated, and report to master all successful. Old region=REGION
=> {NAME => 'oldWebSingleRowCacheStore,,1265029544146', STARTKEY =>
'', ENDKEY => 'filmMenuEditions-not_selected\xC2\xAC1405', ENCODED =>
1899385768, OFFLINE => true, SPLIT => true, TABLE => {{NAME =>
'oldWebSingleRowCacheStore', MAX_FILESIZE => '64', FAMILIES => [{NAME
=> 'content', COMPRESSION => 'NONE', VERSIONS => '3', TTL =>
'2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
=> 'true'}, {NAME => 'description', COMPRESSION => 'NONE', VERSIONS =>
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
oldWebSingleRowCacheStore,,1265029549167,
oldWebSingleRowCacheStore,filmLastTopics\xC2\xAC1155,1265029549167.
Split took 0sec
I see MAX_FILESIZE => '64' which means you have set that table to
split after 64 _bytes_ so either use the default value of 256MB
(256*1024*1024) or even higher if you wish (I set usually set them to
1GB).

Jean-Daniel Cryans

2010-02-01 18:58:51 UTC

Permalink

But if I get 130 tables out of 64 megs how many i will get after splinting
1gig? Can you tell me what triggers further splits.

Correction, you set it to 64 bytes, not megabytes. That's 1024*1024
times smaller! The splits happen when one of the families is bigger
than MAX_FILESIZE.

First one is triggered by exceeding "split at" size what is expect but after this i should get 2 x
32 megs. And then after one of them will grow up again above max filesize
limit it should be splited. Am i right?

Well we don't split exactly in the middle, we can't do that since we
could separate a row in 2. We do a best effort to split in the middle
of the row keys.

J-D

Michał Podsiadłowski

2010-02-02 12:29:50 UTC

Permalink

I swear I've seen there "in megabytes" or maybe it was somewhere else.. -
of course after setting the correct value all seems to work like a charm
Again RTFM style problem, great thanks for help.

Cheers,
Michal

Jean-Daniel Cryans

2010-02-02 18:09:09 UTC

Permalink

Glad to hear it's working for you!

J-D

Michał Podsiadłowski

2010-02-03 16:14:39 UTC

Permalink

Hi,
it's me again having problem - hope this is not another misconfiguration
problem ( or maybe it would be better it it was one).
After loading some moderate amount of data - around 3GB some rows are not
available due to strange exceptions

java.io.IOException: java.io.IOException: Cannot open filename
/hbase/filmContributors/1670715971/content/3783592739034234831

When trying to scan the table regions server pukes like this

2010-02-03 16:03:39,060 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 7 on 60020, call next(2813423168169765496, 1) from 10.0.100.50:41364:
error: java.io.IOException: java.lang.RuntimeException: java.io.IOException:
Cannot open filename
/hbase/filmContributors/1670715971/content/3783592739034234831
java.io.IOException: java.lang.RuntimeException: java.io.IOException: Cannot
open filename /hbase/filmContributors/1670715971/content/3783592739034234831
at
org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:872)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:862)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1918)
at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
Caused by: java.lang.RuntimeException: java.io.IOException: Cannot open
filename /hbase/filmContributors/1670715971/content/3783592739034234831
at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:61)
at
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:79)
at
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:164)
at
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:106)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.nextInternal(HRegion.java:1807)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.next(HRegion.java:1771)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1894)
... 5 more
Caused by: java.io.IOException: Cannot open filename
/hbase/filmContributors/1670715971/content/3783592739034234831
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1474)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1800)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1616)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743)
at java.io.DataInputStream.read(DataInputStream.java:132)
at
org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:99)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
at
org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1020)
at
org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:971)
at
org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.next(HFile.java:1163)
at
org.apache.hadoop.hbase.io.HalfHFileReader$1.next(HalfHFileReader.java:125)
at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:58)
... 11 more

greping regionserver log for dir name 1670715971 shows this
2010-02-03 15:32:37,082 DEBUG org.apache.hadoop.hbase.regionserver.Store:
loaded
/hbase/filmContributors/314440477/content/3783592739034234831.1670715971,
isReference=true, sequence id=7541774, length=33390929,
majorCompaction=false
2010-02-03 15:32:37,088 DEBUG org.apache.hadoop.hbase.regionserver.Store:
loaded
/hbase/filmContributors/314440477/content/6518523095287027530.1670715971,
isReference=true, sequence id=7542003, length=7890, majorCompaction=false
2010-02-03 15:32:37,095 DEBUG org.apache.hadoop.hbase.regionserver.Store:
loaded
/hbase/filmContributors/314440477/description/2305635563712489918.1670715971,
isReference=true, sequence id=7542003, length=2256, majorCompaction=false
2010-02-03 15:32:37,101 DEBUG org.apache.hadoop.hbase.regionserver.Store:
loaded
/hbase/filmContributors/314440477/description/6970032752270852156.1670715971,
isReference=true, sequence id=7541774, length=6664268, majorCompaction=false
2010-02-03 15:32:37,129 DEBUG org.apache.hadoop.hbase.regionserver.Store:
loaded
/hbase/filmContributors/1836766931/content/3783592739034234831.1670715971,
isReference=true, sequence id=7541773, length=33390929,
majorCompaction=false
2010-02-03 15:32:37,152 DEBUG org.apache.hadoop.hbase.regionserver.Store:
loaded
/hbase/filmContributors/1836766931/content/6518523095287027530.1670715971,
isReference=true, sequence id=7542002, length=7890, majorCompaction=false
2010-02-03 15:32:37,165 DEBUG org.apache.hadoop.hbase.regionserver.Store:
loaded
/hbase/filmContributors/1836766931/description/2305635563712489918.1670715971,
isReference=true, sequence id=7542002, length=2256, majorCompaction=false
2010-02-03 15:32:37,170 DEBUG org.apache.hadoop.hbase.regionserver.Store:
loaded
/hbase/filmContributors/1836766931/description/6970032752270852156.1670715971,
isReference=true, sequence id=7541773, length=6664268, majorCompaction=false
2010-02-03 15:33:49,943 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read:
java.io.IOException: Cannot open filename
/hbase/filmContributors/1670715971/content/3783592739034234831
and many many times java.io.IOException: Cannot open filename
/hbase/filmContributors/1670715971/content/3783592739034234831

*on different one I found this *

2010-02-03 15:32:35,512 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Cleaned up /hbase/filmContributors/1670715971/splits true
2010-02-03 15:32:35,515 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread: region split, META
updated, and report to master all successful. Old region=REGION => {NAME =>
'filmContributor
s,,1265203126633', STARTKEY => '', ENDKEY => '31587', ENCODED => 1670715971,
OFFLINE => true, SPLIT => true, TABLE => {{NAME => 'filmContributors',
MAX_FILESIZE => '268435456', FAMILIES => [{NAME =

'content', COMPRESSION => 'NONE', VERSIONS => '1', TTL => '2147483647',

BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME =>
'description', COMPRESSION => 'NONE', VERSIONS
=> '1', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}]}}, new regions: filmContributors,,1265207555247,
filmContributors,117416,1265207555247. Split took 0s
ec
**
more details here - http://pastebin.com/d7c52f27a

Also sometimes namenode logs i can see messages like this:

2010-02-03 15:32:38,416 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 3 on 54310, call
complete(/hbase/filmContributors/compaction.dir/1836766931/2633146516707160051,
DFSClient_-902184734)
from 10.0.100.51:49692: error: java.io.IOException: Could not complete write
to file
/hbase/filmContributors/compaction.dir/1836766931/2633146516707160051 by
DFSClient_-902184734
java.io.IOException: Could not complete write to file
/hbase/filmContributors/compaction.dir/1836766931/2633146516707160051 by
DFSClient_-902184734

Please help.

Cheers,
Michal