Discussion:
HBase random access in HDFS and block indices
(too old to reply)
William Kang
2010-10-19 02:48:16 UTC
Permalink
Hi,
Recently I have spent some efforts to try to understand the mechanisms
of HBase to exploit possible performance tunning options. And many
thanks to the folks who helped with my questions in this community, I
have sent a report. But, there are still few questions left.

1. If a HFile block contains more than one keyvalue pair, will the
block index in HFile point out the offset for every keyvalue pair in
that block? Or, the block index will just point out the key ranges
inside that block, so you have to traverse inside the block until you
meet the key you are looking for?

2. When HBase read block to fetching the data or traverse in it, is
this block read into memory?

3. HBase blocks (64k configurable) are inside HDFS blocks (64m
configurable), to read the HBase blocks, we have to random access the
HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
portion of the larger HDFS blocks, it is still a random access. Would
this be slow?

Many thanks. I would be grateful for your answers.


William
William Kang
2010-10-19 02:49:27 UTC
Permalink
Hi,
Recently I have spent some efforts to try to understand the mechanisms
of HBase to exploit possible performance tunning options. And many
thanks to the folks who helped with my questions in this community, I
have sent a report. But, there are still few questions left.

1. If a HFile block contains more than one keyvalue pair, will the
block index in HFile point out the offset for every keyvalue pair in
that block? Or, the block index will just point out the key ranges
inside that block, so you have to traverse inside the block until you
meet the key you are looking for?

2. When HBase read block to fetching the data or traverse in it, is
this block read into memory?

3. HBase blocks (64k configurable) are inside HDFS blocks (64m
configurable), to read the HBase blocks, we have to random access the
HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
portion of the larger HDFS blocks, it is still a random access. Would
this be slow?

Many thanks. I would be grateful for your answers.


William
Ryan Rawson
2010-10-19 02:58:46 UTC
Permalink
Post by William Kang
Hi,
Recently I have spent some efforts to try to understand the mechanisms
of HBase to exploit possible performance tunning options. And many
thanks to the folks who helped with my questions in this community, I
have sent a report. But, there are still few questions left.
1. If a HFile block contains more than one keyvalue pair, will the
block index in HFile point out the offset for every keyvalue pair in
that block? Or, the block index will just point out the key ranges
inside that block, so you have to traverse inside the block until you
meet the key you are looking for?
The block index contains the first key for every block. It therefore
defines in an [a,b) manner the range of each block. Once a block has
been selected to read from, it is read into memory then iterated over
until the key in question has been found (or the closest match has
been found).
Post by William Kang
2. When HBase read block to fetching the data or traverse in it, is
this block read into memory?
yes, the entire block at a time is read in a single read operation.
Post by William Kang
3. HBase blocks (64k configurable) are inside HDFS blocks (64m
configurable), to read the HBase blocks, we have to random access the
HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
portion of the larger HDFS blocks, it is still a random access. Would
this be slow?
Random access reads are not necessarily slow, they require several things:
- disk seeks to the data in question
- disk seeks to the checksum files in question
- checksum computation and verification

While not particularly slow, this could probably be optimized a bit.

Most of the issues with random reads in HDFS is parallelizing the
reads and doing as much io-pushdown/scheduling as possible without
consuming an excess of sockets and threads. The actual speed can be
excellent, or not, depending on how busy the IO subsystem is.
Post by William Kang
Many thanks. I would be grateful for your answers.
William
William Kang
2010-10-19 03:21:21 UTC
Permalink
Hi JG and Ryan,
Thanks for the excellent answers.

So, I am going to push everything to the extremes without considering
the memory first. In theory, if in HBase, every cell size equals to
HBase block size, then there would not be any in block traverse. In
HDFS, very HBase block size equals to each HDFS block size, there
would not be any in-file random access necessary. This would provide
the best performance?

But, the problem is that if the block in HBase is too large, the
memory will run out since HBase load block into memory; if the block
in HDFS is too small, the DN will run out of memory since each HDFS
file takes some memory. So, it is a trade-off problem between memory
and performance. Is it right?

And would it make any difference between random reading the same size
file portion from of a small HDFS block and from a large HDFS block?

Thanks.


William
Post by William Kang
Hi,
Recently I have spent some efforts to try to understand the mechanisms
of HBase to exploit possible performance tunning options. And many
thanks to the folks who helped with my questions in this community, I
have sent a report. But, there are still few questions left.
1. If a HFile block contains more than one keyvalue pair, will the
block index in HFile point out the offset for every keyvalue pair in
that block? Or, the block index will just point out the key ranges
inside that block, so you have to traverse inside the block until you
meet the key you are looking for?
The block index contains the first key for every block.  It therefore
defines in an [a,b) manner the range of each block. Once a block has
been selected to read from, it is read into memory then iterated over
until the key in question has been found (or the closest match has
been found).
Post by William Kang
2. When HBase read block to fetching the data or traverse in it, is
this block read into memory?
yes, the entire block at a time is read in a single read operation.
Post by William Kang
3. HBase blocks (64k configurable) are inside HDFS blocks (64m
configurable), to read the HBase blocks, we have to random access the
HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
portion of the larger HDFS blocks, it is still a random access. Would
this be slow?
- disk seeks to the data in question
- disk seeks to the checksum files in question
- checksum computation and verification
While not particularly slow, this could probably be optimized a bit.
Most of the issues with random reads in HDFS is parallelizing the
reads and doing as much io-pushdown/scheduling as possible without
consuming an excess of sockets and threads.  The actual speed can be
excellent, or not, depending on how busy the IO subsystem is.
Post by William Kang
Many thanks. I would be grateful for your answers.
William
Ryan Rawson
2010-10-19 03:27:55 UTC
Permalink
The primary problem is the namenode memory. It contains entries for every
file and block, so setting hdfs block size small limits your scaleability.

There is nothing inherently wrong with in file random read, Its just That
the hdfs client was written for a single reader to read most of a file.
This to achieve high performance you'd need to do tricks, such as pipelining
sockets and socket pool reuse. Right now for random reads We open a new
socket, read data then close it.
Post by William Kang
Hi JG and Ryan,
Thanks for the excellent answers.
So, I am going to push everything to the extremes without considering
the memory first. In theory, if in HBase, every cell size equals to
HBase block size, then there would not be any in block traverse. In
HDFS, very HBase block size equals to each HDFS block size, there
would not be any in-file random access necessary. This would provide
the best performance?
But, the problem is that if the block in HBase is too large, the
memory will run out since HBase load block into memory; if the block
in HDFS is too small, the DN will run out of memory since each HDFS
file takes some memory. So, it is a trade-off problem between memory
and performance. Is it right?
And would it make any difference between random reading the same size
file portion from of a small HDFS block and from a large HDFS block?
Thanks.
William
Post by Ryan Rawson
Post by William Kang
Hi,
Recently I have spent some efforts to try to understand the mechanisms
of HBase to exploit possible performance tunning options. And many
thanks to the folks who helped with my questions in this community, I
have sent a report. But, there are still few questions left.
1. If a HFile block contains more than one keyvalue pair, will the
block index in HFile point out the offset for every keyvalue pair in
that block? Or, the block index will just point out the key ranges
inside that block, so you have to traverse inside the block until you
meet the key you are looking for?
The block index contains the first key for every block. It therefore
defines in an [a,b) manner the range of each block. Once a block has
been selected to read from, it is read into memory then iterated over
until the key in question has been found (or the closest match has
been found).
Post by William Kang
2. When HBase read block to fetching the data or traverse in it, is
this block read into memory?
yes, the entire block at a time is read in a single read operation.
Post by William Kang
3. HBase blocks (64k configurable) are inside HDFS blocks (64m
configurable), to read the HBase blocks, we have to random access the
HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
portion of the larger HDFS blocks, it is still a random access. Would
this be slow?
- disk seeks to the data in question
- disk seeks to the checksum files in question
- checksum computation and verification
While not particularly slow, this could probably be optimized a bit.
Most of the issues with random reads in HDFS is parallelizing the
reads and doing as much io-pushdown/scheduling as possible without
consuming an excess of sockets and threads. The actual speed can be
excellent, or not, depending on how busy the IO subsystem is.
Post by William Kang
Many thanks. I would be grateful for your answers.
William
Matt Corgan
2010-10-19 03:53:11 UTC
Permalink
Do you guys ever worry about how big an HFile's index will be? For example,
if you have a 512mb HFile with 8k block size, you will have 64,000 blocks.
If each index entry is 50b, then you have a 3.2mb index which is way out of
line with your intention of having a small block size. I believe that's
read all at once so will be slow the first time... is the index cached
somewhere (block cache?) so that index accesses are from memory?

And somewhat related - since the index is stored at the end of the HFile, is
an additional random access required to find its offset? If it was
considered, why was that chosen over putting it in it's own file that could
be accessed directly?

Thanks for all these explanations,
Matt
Post by Ryan Rawson
The primary problem is the namenode memory. It contains entries for every
file and block, so setting hdfs block size small limits your scaleability.
There is nothing inherently wrong with in file random read, Its just That
the hdfs client was written for a single reader to read most of a file.
This to achieve high performance you'd need to do tricks, such as pipelining
sockets and socket pool reuse. Right now for random reads We open a new
socket, read data then close it.
Post by William Kang
Hi JG and Ryan,
Thanks for the excellent answers.
So, I am going to push everything to the extremes without considering
the memory first. In theory, if in HBase, every cell size equals to
HBase block size, then there would not be any in block traverse. In
HDFS, very HBase block size equals to each HDFS block size, there
would not be any in-file random access necessary. This would provide
the best performance?
But, the problem is that if the block in HBase is too large, the
memory will run out since HBase load block into memory; if the block
in HDFS is too small, the DN will run out of memory since each HDFS
file takes some memory. So, it is a trade-off problem between memory
and performance. Is it right?
And would it make any difference between random reading the same size
file portion from of a small HDFS block and from a large HDFS block?
Thanks.
William
Post by Ryan Rawson
Post by William Kang
Hi,
Recently I have spent some efforts to try to understand the mechanisms
of HBase to exploit possible performance tunning options. And many
thanks to the folks who helped with my questions in this community, I
have sent a report. But, there are still few questions left.
1. If a HFile block contains more than one keyvalue pair, will the
block index in HFile point out the offset for every keyvalue pair in
that block? Or, the block index will just point out the key ranges
inside that block, so you have to traverse inside the block until you
meet the key you are looking for?
The block index contains the first key for every block. It therefore
defines in an [a,b) manner the range of each block. Once a block has
been selected to read from, it is read into memory then iterated over
until the key in question has been found (or the closest match has
been found).
Post by William Kang
2. When HBase read block to fetching the data or traverse in it, is
this block read into memory?
yes, the entire block at a time is read in a single read operation.
Post by William Kang
3. HBase blocks (64k configurable) are inside HDFS blocks (64m
configurable), to read the HBase blocks, we have to random access the
HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
portion of the larger HDFS blocks, it is still a random access. Would
this be slow?
Random access reads are not necessarily slow, they require several
- disk seeks to the data in question
- disk seeks to the checksum files in question
- checksum computation and verification
While not particularly slow, this could probably be optimized a bit.
Most of the issues with random reads in HDFS is parallelizing the
reads and doing as much io-pushdown/scheduling as possible without
consuming an excess of sockets and threads. The actual speed can be
excellent, or not, depending on how busy the IO subsystem is.
Post by William Kang
Many thanks. I would be grateful for your answers.
William
Jonathan Gray
2010-10-19 04:00:11 UTC
Permalink
HFiles are generally 256MB and default block size is 64K, so that's 4000 blocks (1/16th what you said). That would have a more reasonable block index of 200K.

But the block index is kept in-memory so you only read it once when the file is first opened. So even if you do lower the block size and increase the block index, this would mostly slowdown the initial open but should not have a major impact on random access performance once cached.

The offset is found from reading the headers of the HFile. Again this only has to be done on open. Indexes used to be kept in separate files but this doubled the number of open files HBase might need, an issue we are always pushing against.

I'm not sure seeking around a single HFile during the initial load of that file is an especially big issue (when talking just 2-3 areas to read from not 100s).

JG
-----Original Message-----
Sent: Monday, October 18, 2010 8:53 PM
To: user
Subject: Re: HBase random access in HDFS and block indices
Do you guys ever worry about how big an HFile's index will be? For example,
if you have a 512mb HFile with 8k block size, you will have 64,000 blocks.
If each index entry is 50b, then you have a 3.2mb index which is way out of
line with your intention of having a small block size. I believe that's
read all at once so will be slow the first time... is the index cached
somewhere (block cache?) so that index accesses are from memory?
And somewhat related - since the index is stored at the end of the HFile, is
an additional random access required to find its offset? If it was
considered, why was that chosen over putting it in it's own file that could
be accessed directly?
Thanks for all these explanations,
Matt
Post by Ryan Rawson
The primary problem is the namenode memory. It contains entries for
every
Post by Ryan Rawson
file and block, so setting hdfs block size small limits your
scaleability.
Post by Ryan Rawson
There is nothing inherently wrong with in file random read, Its just
That
Post by Ryan Rawson
the hdfs client was written for a single reader to read most of a
file.
Post by Ryan Rawson
This to achieve high performance you'd need to do tricks, such as pipelining
sockets and socket pool reuse. Right now for random reads We open a
new
Post by Ryan Rawson
socket, read data then close it.
Post by William Kang
Hi JG and Ryan,
Thanks for the excellent answers.
So, I am going to push everything to the extremes without
considering
Post by Ryan Rawson
Post by William Kang
the memory first. In theory, if in HBase, every cell size equals to
HBase block size, then there would not be any in block traverse. In
HDFS, very HBase block size equals to each HDFS block size, there
would not be any in-file random access necessary. This would
provide
Post by Ryan Rawson
Post by William Kang
the best performance?
But, the problem is that if the block in HBase is too large, the
memory will run out since HBase load block into memory; if the
block
Post by Ryan Rawson
Post by William Kang
in HDFS is too small, the DN will run out of memory since each HDFS
file takes some memory. So, it is a trade-off problem between
memory
Post by Ryan Rawson
Post by William Kang
and performance. Is it right?
And would it make any difference between random reading the same
size
Post by Ryan Rawson
Post by William Kang
file portion from of a small HDFS block and from a large HDFS
block?
Post by Ryan Rawson
Post by William Kang
Thanks.
William
On Mon, Oct 18, 2010 at 7:49 PM, William Kang
Post by William Kang
Hi,
Recently I have spent some efforts to try to understand the
mechanisms
Post by Ryan Rawson
Post by William Kang
Post by William Kang
of HBase to exploit possible performance tunning options. And
many
Post by Ryan Rawson
Post by William Kang
Post by William Kang
thanks to the folks who helped with my questions in this
community, I
Post by Ryan Rawson
Post by William Kang
Post by William Kang
have sent a report. But, there are still few questions left.
1. If a HFile block contains more than one keyvalue pair, will
the
Post by Ryan Rawson
Post by William Kang
Post by William Kang
block index in HFile point out the offset for every keyvalue pair
in
Post by Ryan Rawson
Post by William Kang
Post by William Kang
that block? Or, the block index will just point out the key
ranges
Post by Ryan Rawson
Post by William Kang
Post by William Kang
inside that block, so you have to traverse inside the block until
you
Post by Ryan Rawson
Post by William Kang
Post by William Kang
meet the key you are looking for?
The block index contains the first key for every block. It
therefore
Post by Ryan Rawson
Post by William Kang
defines in an [a,b) manner the range of each block. Once a block
has
Post by Ryan Rawson
Post by William Kang
been selected to read from, it is read into memory then iterated
over
Post by Ryan Rawson
Post by William Kang
until the key in question has been found (or the closest match has
been found).
Post by William Kang
2. When HBase read block to fetching the data or traverse in it,
is
Post by Ryan Rawson
Post by William Kang
Post by William Kang
this block read into memory?
yes, the entire block at a time is read in a single read
operation.
Post by Ryan Rawson
Post by William Kang
Post by William Kang
3. HBase blocks (64k configurable) are inside HDFS blocks (64m
configurable), to read the HBase blocks, we have to random access
the
Post by Ryan Rawson
Post by William Kang
Post by William Kang
HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
portion of the larger HDFS blocks, it is still a random access.
Would
Post by Ryan Rawson
Post by William Kang
Post by William Kang
this be slow?
Random access reads are not necessarily slow, they require several
- disk seeks to the data in question
- disk seeks to the checksum files in question
- checksum computation and verification
While not particularly slow, this could probably be optimized a
bit.
Post by Ryan Rawson
Post by William Kang
Most of the issues with random reads in HDFS is parallelizing the
reads and doing as much io-pushdown/scheduling as possible without
consuming an excess of sockets and threads. The actual speed can
be
Post by Ryan Rawson
Post by William Kang
excellent, or not, depending on how busy the IO subsystem is.
Post by William Kang
Many thanks. I would be grateful for your answers.
William
Ryan Rawson
2010-10-19 04:07:54 UTC
Permalink
Hi,

Since the file is write-once, no random writes, putting the index at
the end is the only choice. The loading goes like this:
- read fixed file trailer, ie: filelen.offset - <fixed size>
- read location of additional variable length sections, eg: block index
- read those indexes, including the variable length 'file-info' section


So to give some background, by default an InputStream from DFSClient
has a single socket and positioned reads are fairly fast. The DFS
just sees 'read bytes from pos X length Y' commands on an open socket.
That is fast. But only 1 thread at a time can use this interface.
So for 'get' requests we use another interface called pread() which
takes a position+length, and returns data. This involves setting up a
1-use socket and tearing it down when we are done. So it is slower by
2-3 tcp RTT, thread instantiation and other misc overhead.


Back to the HFile index, it is indeed stored in ram, not block cache.
Size is generally not an issue, hasn't been yet. We ship with a
default block size of 64k, and I'd recommend sticking with that unless
you have evidential proof your data set performance is better under a
different size. Since the index grows linearly by a factor of 1/64k
with the bytes of the data, it isn't a huge deal. Also the indexes
are spread around the cluster, so you are pushing load to all
machines.
Do you guys ever worry about how big an HFile's index will be?  For example,
if you have a 512mb HFile with 8k block size, you will have 64,000 blocks.
 If each index entry is 50b, then you have a 3.2mb index which is way out of
line with your intention of having a small block size.  I believe that's
read all at once so will be slow the first time... is the index cached
somewhere (block cache?) so that index accesses are from memory?
And somewhat related - since the index is stored at the end of the HFile, is
an additional random access required to find its offset?  If it was
considered, why was that chosen over putting it in it's own file that could
be accessed directly?
Thanks for all these explanations,
Matt
Post by Ryan Rawson
The primary problem is the namenode memory. It contains entries for every
file and block, so setting hdfs block size small limits your scaleability.
There is nothing inherently wrong with in file random read, Its just That
the hdfs client was written for a single reader to read most of a file.
This to achieve high performance you'd need to do tricks, such as pipelining
sockets and socket pool reuse. Right now for random reads We open a new
socket, read data then close it.
Post by William Kang
Hi JG and Ryan,
Thanks for the excellent answers.
So, I am going to push everything to the extremes without considering
the memory first. In theory, if in HBase, every cell size equals to
HBase block size, then there would not be any in block traverse. In
HDFS, very HBase block size equals to each HDFS block size, there
would not be any in-file random access necessary. This would provide
the best performance?
But, the problem is that if the block in HBase is too large, the
memory will run out since HBase load block into memory; if the block
in HDFS is too small, the DN will run out of memory since each HDFS
file takes some memory. So, it is a trade-off problem between memory
and performance. Is it right?
And would it make any difference between random reading the same size
file portion from of a small HDFS block and from a large HDFS block?
Thanks.
William
Post by William Kang
Hi,
Recently I have spent some efforts to try to understand the mechanisms
of HBase to exploit possible performance tunning options. And many
thanks to the folks who helped with my questions in this community, I
have sent a report. But, there are still few questions left.
1. If a HFile block contains more than one keyvalue pair, will the
block index in HFile point out the offset for every keyvalue pair in
that block? Or, the block index will just point out the key ranges
inside that block, so you have to traverse inside the block until you
meet the key you are looking for?
The block index contains the first key for every block.  It therefore
defines in an [a,b) manner the range of each block. Once a block has
been selected to read from, it is read into memory then iterated over
until the key in question has been found (or the closest match has
been found).
Post by William Kang
2. When HBase read block to fetching the data or traverse in it, is
this block read into memory?
yes, the entire block at a time is read in a single read operation.
Post by William Kang
3. HBase blocks (64k configurable) are inside HDFS blocks (64m
configurable), to read the HBase blocks, we have to random access the
HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
portion of the larger HDFS blocks, it is still a random access. Would
this be slow?
Random access reads are not necessarily slow, they require several
- disk seeks to the data in question
- disk seeks to the checksum files in question
- checksum computation and verification
While not particularly slow, this could probably be optimized a bit.
Most of the issues with random reads in HDFS is parallelizing the
reads and doing as much io-pushdown/scheduling as possible without
consuming an excess of sockets and threads.  The actual speed can be
excellent, or not, depending on how busy the IO subsystem is.
Post by William Kang
Many thanks. I would be grateful for your answers.
William
Matt Corgan
2010-10-19 04:30:32 UTC
Permalink
I was envisioning the HFiles being opened and closed more often, but it
sounds like they're held open for long periods and that the indexes are
permanently cached. Is it roughly correct to say that after opening an
HFile and loading its checksum/metadata/index/etc then each random data
block access only requires a single pread, where the pread has some
threading and connection overhead, but theoretically only requires one disk
seek. I'm curious because I'm trying to do a lot of random reads, and given
enough application parallelism, the disk seeks should become the bottleneck
much sooner than the network and threading overhead.

Thanks again,
Matt
Post by Ryan Rawson
Hi,
Since the file is write-once, no random writes, putting the index at
- read fixed file trailer, ie: filelen.offset - <fixed size>
- read location of additional variable length sections, eg: block index
- read those indexes, including the variable length 'file-info' section
So to give some background, by default an InputStream from DFSClient
has a single socket and positioned reads are fairly fast. The DFS
just sees 'read bytes from pos X length Y' commands on an open socket.
That is fast. But only 1 thread at a time can use this interface.
So for 'get' requests we use another interface called pread() which
takes a position+length, and returns data. This involves setting up a
1-use socket and tearing it down when we are done. So it is slower by
2-3 tcp RTT, thread instantiation and other misc overhead.
Back to the HFile index, it is indeed stored in ram, not block cache.
Size is generally not an issue, hasn't been yet. We ship with a
default block size of 64k, and I'd recommend sticking with that unless
you have evidential proof your data set performance is better under a
different size. Since the index grows linearly by a factor of 1/64k
with the bytes of the data, it isn't a huge deal. Also the indexes
are spread around the cluster, so you are pushing load to all
machines.
Post by Matt Corgan
Do you guys ever worry about how big an HFile's index will be? For
example,
Post by Matt Corgan
if you have a 512mb HFile with 8k block size, you will have 64,000
blocks.
Post by Matt Corgan
If each index entry is 50b, then you have a 3.2mb index which is way out
of
Post by Matt Corgan
line with your intention of having a small block size. I believe that's
read all at once so will be slow the first time... is the index cached
somewhere (block cache?) so that index accesses are from memory?
And somewhat related - since the index is stored at the end of the HFile,
is
Post by Matt Corgan
an additional random access required to find its offset? If it was
considered, why was that chosen over putting it in it's own file that
could
Post by Matt Corgan
be accessed directly?
Thanks for all these explanations,
Matt
Post by Ryan Rawson
The primary problem is the namenode memory. It contains entries for
every
Post by Matt Corgan
Post by Ryan Rawson
file and block, so setting hdfs block size small limits your
scaleability.
Post by Matt Corgan
Post by Ryan Rawson
There is nothing inherently wrong with in file random read, Its just
That
Post by Matt Corgan
Post by Ryan Rawson
the hdfs client was written for a single reader to read most of a file.
This to achieve high performance you'd need to do tricks, such as pipelining
sockets and socket pool reuse. Right now for random reads We open a new
socket, read data then close it.
Post by William Kang
Hi JG and Ryan,
Thanks for the excellent answers.
So, I am going to push everything to the extremes without considering
the memory first. In theory, if in HBase, every cell size equals to
HBase block size, then there would not be any in block traverse. In
HDFS, very HBase block size equals to each HDFS block size, there
would not be any in-file random access necessary. This would provide
the best performance?
But, the problem is that if the block in HBase is too large, the
memory will run out since HBase load block into memory; if the block
in HDFS is too small, the DN will run out of memory since each HDFS
file takes some memory. So, it is a trade-off problem between memory
and performance. Is it right?
And would it make any difference between random reading the same size
file portion from of a small HDFS block and from a large HDFS block?
Thanks.
William
On Mon, Oct 18, 2010 at 7:49 PM, William Kang <
Post by William Kang
Hi,
Recently I have spent some efforts to try to understand the
mechanisms
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
of HBase to exploit possible performance tunning options. And many
thanks to the folks who helped with my questions in this community,
I
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
have sent a report. But, there are still few questions left.
1. If a HFile block contains more than one keyvalue pair, will the
block index in HFile point out the offset for every keyvalue pair in
that block? Or, the block index will just point out the key ranges
inside that block, so you have to traverse inside the block until
you
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
meet the key you are looking for?
The block index contains the first key for every block. It therefore
defines in an [a,b) manner the range of each block. Once a block has
been selected to read from, it is read into memory then iterated over
until the key in question has been found (or the closest match has
been found).
Post by William Kang
2. When HBase read block to fetching the data or traverse in it, is
this block read into memory?
yes, the entire block at a time is read in a single read operation.
Post by William Kang
3. HBase blocks (64k configurable) are inside HDFS blocks (64m
configurable), to read the HBase blocks, we have to random access
the
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
portion of the larger HDFS blocks, it is still a random access.
Would
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
this be slow?
Random access reads are not necessarily slow, they require several
- disk seeks to the data in question
- disk seeks to the checksum files in question
- checksum computation and verification
While not particularly slow, this could probably be optimized a bit.
Most of the issues with random reads in HDFS is parallelizing the
reads and doing as much io-pushdown/scheduling as possible without
consuming an excess of sockets and threads. The actual speed can be
excellent, or not, depending on how busy the IO subsystem is.
Post by William Kang
Many thanks. I would be grateful for your answers.
William
Sean Bigdatafun
2010-10-29 13:41:50 UTC
Permalink
I have the same doubt here. Let's say I have a totally random read pattern
(uniformly distributed).

Now let's assume my total data size stored in HBase is 100TB on 10
machines(not a big deal considering nowaday's disks), and the total size of
my RS' memory is 10 * 6G = 60 GB. That translate into a 60/100*1000 = 0.06%
cache hit probablity. Under random read pattern, each read is bound to
experience the "open-> read index -> .... -> read datablock" sequence, which
would be expensive.

Any comment?
Post by Matt Corgan
I was envisioning the HFiles being opened and closed more often, but it
sounds like they're held open for long periods and that the indexes are
permanently cached. Is it roughly correct to say that after opening an
HFile and loading its checksum/metadata/index/etc then each random data
block access only requires a single pread, where the pread has some
threading and connection overhead, but theoretically only requires one disk
seek. I'm curious because I'm trying to do a lot of random reads, and given
enough application parallelism, the disk seeks should become the bottleneck
much sooner than the network and threading overhead.
Thanks again,
Matt
Post by Ryan Rawson
Hi,
Since the file is write-once, no random writes, putting the index at
- read fixed file trailer, ie: filelen.offset - <fixed size>
- read location of additional variable length sections, eg: block index
- read those indexes, including the variable length 'file-info' section
So to give some background, by default an InputStream from DFSClient
has a single socket and positioned reads are fairly fast. The DFS
just sees 'read bytes from pos X length Y' commands on an open socket.
That is fast. But only 1 thread at a time can use this interface.
So for 'get' requests we use another interface called pread() which
takes a position+length, and returns data. This involves setting up a
1-use socket and tearing it down when we are done. So it is slower by
2-3 tcp RTT, thread instantiation and other misc overhead.
Back to the HFile index, it is indeed stored in ram, not block cache.
Size is generally not an issue, hasn't been yet. We ship with a
default block size of 64k, and I'd recommend sticking with that unless
you have evidential proof your data set performance is better under a
different size. Since the index grows linearly by a factor of 1/64k
with the bytes of the data, it isn't a huge deal. Also the indexes
are spread around the cluster, so you are pushing load to all
machines.
Post by Matt Corgan
Do you guys ever worry about how big an HFile's index will be? For
example,
Post by Matt Corgan
if you have a 512mb HFile with 8k block size, you will have 64,000
blocks.
Post by Matt Corgan
If each index entry is 50b, then you have a 3.2mb index which is way
out
Post by Ryan Rawson
of
Post by Matt Corgan
line with your intention of having a small block size. I believe
that's
Post by Ryan Rawson
Post by Matt Corgan
read all at once so will be slow the first time... is the index cached
somewhere (block cache?) so that index accesses are from memory?
And somewhat related - since the index is stored at the end of the
HFile,
Post by Ryan Rawson
is
Post by Matt Corgan
an additional random access required to find its offset? If it was
considered, why was that chosen over putting it in it's own file that
could
Post by Matt Corgan
be accessed directly?
Thanks for all these explanations,
Matt
Post by Ryan Rawson
The primary problem is the namenode memory. It contains entries for
every
Post by Matt Corgan
Post by Ryan Rawson
file and block, so setting hdfs block size small limits your
scaleability.
Post by Matt Corgan
Post by Ryan Rawson
There is nothing inherently wrong with in file random read, Its just
That
Post by Matt Corgan
Post by Ryan Rawson
the hdfs client was written for a single reader to read most of a
file.
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
This to achieve high performance you'd need to do tricks, such as pipelining
sockets and socket pool reuse. Right now for random reads We open a
new
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
socket, read data then close it.
Post by William Kang
Hi JG and Ryan,
Thanks for the excellent answers.
So, I am going to push everything to the extremes without
considering
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
the memory first. In theory, if in HBase, every cell size equals to
HBase block size, then there would not be any in block traverse. In
HDFS, very HBase block size equals to each HDFS block size, there
would not be any in-file random access necessary. This would provide
the best performance?
But, the problem is that if the block in HBase is too large, the
memory will run out since HBase load block into memory; if the block
in HDFS is too small, the DN will run out of memory since each HDFS
file takes some memory. So, it is a trade-off problem between memory
and performance. Is it right?
And would it make any difference between random reading the same
size
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
file portion from of a small HDFS block and from a large HDFS block?
Thanks.
William
On Mon, Oct 18, 2010 at 7:49 PM, William Kang <
Post by William Kang
Hi,
Recently I have spent some efforts to try to understand the
mechanisms
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
of HBase to exploit possible performance tunning options. And many
thanks to the folks who helped with my questions in this
community,
Post by Ryan Rawson
I
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
have sent a report. But, there are still few questions left.
1. If a HFile block contains more than one keyvalue pair, will the
block index in HFile point out the offset for every keyvalue pair
in
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
that block? Or, the block index will just point out the key ranges
inside that block, so you have to traverse inside the block until
you
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
meet the key you are looking for?
The block index contains the first key for every block. It
therefore
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
defines in an [a,b) manner the range of each block. Once a block
has
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
been selected to read from, it is read into memory then iterated
over
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
until the key in question has been found (or the closest match has
been found).
Post by William Kang
2. When HBase read block to fetching the data or traverse in it,
is
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
this block read into memory?
yes, the entire block at a time is read in a single read operation.
Post by William Kang
3. HBase blocks (64k configurable) are inside HDFS blocks (64m
configurable), to read the HBase blocks, we have to random access
the
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
portion of the larger HDFS blocks, it is still a random access.
Would
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
this be slow?
Random access reads are not necessarily slow, they require several
- disk seeks to the data in question
- disk seeks to the checksum files in question
- checksum computation and verification
While not particularly slow, this could probably be optimized a
bit.
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Most of the issues with random reads in HDFS is parallelizing the
reads and doing as much io-pushdown/scheduling as possible without
consuming an excess of sockets and threads. The actual speed can
be
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
excellent, or not, depending on how busy the IO subsystem is.
Post by William Kang
Many thanks. I would be grateful for your answers.
William
Alvin C.L Huang
2010-10-29 16:38:38 UTC
Permalink
@Sean

Consider an expected low hit ratio, cache will not benefit your random get.
This would cause too many java major GCs that pause all thread and thus bad
performance.

Try no cache at all.

--
Alvin C.-L., Huang
ATC, ICL, ITRI, Taiwan
Post by Sean Bigdatafun
I have the same doubt here. Let's say I have a totally random read pattern
(uniformly distributed).
Now let's assume my total data size stored in HBase is 100TB on 10
machines(not a big deal considering nowaday's disks), and the total size of
my RS' memory is 10 * 6G = 60 GB. That translate into a 60/100*1000 = 0.06%
cache hit probablity. Under random read pattern, each read is bound to
experience the "open-> read index -> .... -> read datablock" sequence, which
would be expensive.
Any comment?
Post by Matt Corgan
I was envisioning the HFiles being opened and closed more often, but it
sounds like they're held open for long periods and that the indexes are
permanently cached. Is it roughly correct to say that after opening an
HFile and loading its checksum/metadata/index/etc then each random data
block access only requires a single pread, where the pread has some
threading and connection overhead, but theoretically only requires one
disk
Post by Matt Corgan
seek. I'm curious because I'm trying to do a lot of random reads, and given
enough application parallelism, the disk seeks should become the
bottleneck
Post by Matt Corgan
much sooner than the network and threading overhead.
Thanks again,
Matt
Post by Ryan Rawson
Hi,
Since the file is write-once, no random writes, putting the index at
- read fixed file trailer, ie: filelen.offset - <fixed size>
- read location of additional variable length sections, eg: block index
- read those indexes, including the variable length 'file-info' section
So to give some background, by default an InputStream from DFSClient
has a single socket and positioned reads are fairly fast. The DFS
just sees 'read bytes from pos X length Y' commands on an open socket.
That is fast. But only 1 thread at a time can use this interface.
So for 'get' requests we use another interface called pread() which
takes a position+length, and returns data. This involves setting up a
1-use socket and tearing it down when we are done. So it is slower by
2-3 tcp RTT, thread instantiation and other misc overhead.
Back to the HFile index, it is indeed stored in ram, not block cache.
Size is generally not an issue, hasn't been yet. We ship with a
default block size of 64k, and I'd recommend sticking with that unless
you have evidential proof your data set performance is better under a
different size. Since the index grows linearly by a factor of 1/64k
with the bytes of the data, it isn't a huge deal. Also the indexes
are spread around the cluster, so you are pushing load to all
machines.
Post by Matt Corgan
Do you guys ever worry about how big an HFile's index will be? For
example,
Post by Matt Corgan
if you have a 512mb HFile with 8k block size, you will have 64,000
blocks.
Post by Matt Corgan
If each index entry is 50b, then you have a 3.2mb index which is way
out
Post by Ryan Rawson
of
Post by Matt Corgan
line with your intention of having a small block size. I believe
that's
Post by Ryan Rawson
Post by Matt Corgan
read all at once so will be slow the first time... is the index
cached
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
somewhere (block cache?) so that index accesses are from memory?
And somewhat related - since the index is stored at the end of the
HFile,
Post by Ryan Rawson
is
Post by Matt Corgan
an additional random access required to find its offset? If it was
considered, why was that chosen over putting it in it's own file that
could
Post by Matt Corgan
be accessed directly?
Thanks for all these explanations,
Matt
Post by Ryan Rawson
The primary problem is the namenode memory. It contains entries for
every
Post by Matt Corgan
Post by Ryan Rawson
file and block, so setting hdfs block size small limits your
scaleability.
Post by Matt Corgan
Post by Ryan Rawson
There is nothing inherently wrong with in file random read, Its just
That
Post by Matt Corgan
Post by Ryan Rawson
the hdfs client was written for a single reader to read most of a
file.
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
This to achieve high performance you'd need to do tricks, such as pipelining
sockets and socket pool reuse. Right now for random reads We open a
new
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
socket, read data then close it.
Post by William Kang
Hi JG and Ryan,
Thanks for the excellent answers.
So, I am going to push everything to the extremes without
considering
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
the memory first. In theory, if in HBase, every cell size equals
to
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
HBase block size, then there would not be any in block traverse.
In
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
HDFS, very HBase block size equals to each HDFS block size, there
would not be any in-file random access necessary. This would
provide
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
the best performance?
But, the problem is that if the block in HBase is too large, the
memory will run out since HBase load block into memory; if the
block
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
in HDFS is too small, the DN will run out of memory since each
HDFS
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
file takes some memory. So, it is a trade-off problem between
memory
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
and performance. Is it right?
And would it make any difference between random reading the same
size
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
file portion from of a small HDFS block and from a large HDFS
block?
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Thanks.
William
On Mon, Oct 18, 2010 at 7:49 PM, William Kang <
Post by William Kang
Hi,
Recently I have spent some efforts to try to understand the
mechanisms
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
of HBase to exploit possible performance tunning options. And
many
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
thanks to the folks who helped with my questions in this
community,
Post by Ryan Rawson
I
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
have sent a report. But, there are still few questions left.
1. If a HFile block contains more than one keyvalue pair, will
the
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
block index in HFile point out the offset for every keyvalue
pair
Post by Matt Corgan
in
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
that block? Or, the block index will just point out the key
ranges
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
inside that block, so you have to traverse inside the block
until
Post by Matt Corgan
Post by Ryan Rawson
you
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
meet the key you are looking for?
The block index contains the first key for every block. It
therefore
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
defines in an [a,b) manner the range of each block. Once a block
has
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
been selected to read from, it is read into memory then iterated
over
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
until the key in question has been found (or the closest match
has
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
been found).
Post by William Kang
2. When HBase read block to fetching the data or traverse in it,
is
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
this block read into memory?
yes, the entire block at a time is read in a single read
operation.
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
3. HBase blocks (64k configurable) are inside HDFS blocks (64m
configurable), to read the HBase blocks, we have to random
access
Post by Matt Corgan
Post by Ryan Rawson
the
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
portion of the larger HDFS blocks, it is still a random access.
Would
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
this be slow?
Random access reads are not necessarily slow, they require
several
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
- disk seeks to the data in question
- disk seeks to the checksum files in question
- checksum computation and verification
While not particularly slow, this could probably be optimized a
bit.
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Most of the issues with random reads in HDFS is parallelizing the
reads and doing as much io-pushdown/scheduling as possible
without
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
consuming an excess of sockets and threads. The actual speed can
be
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
excellent, or not, depending on how busy the IO subsystem is.
Post by William Kang
Many thanks. I would be grateful for your answers.
William
Stack
2010-10-29 17:01:24 UTC
Permalink
On Fri, Oct 29, 2010 at 6:41 AM, Sean Bigdatafun
Post by Sean Bigdatafun
I have the same doubt here. Let's say I have a totally random read pattern
(uniformly distributed).
Now let's assume my total data size stored in HBase is 100TB on 10
machines(not a big deal considering nowaday's disks), and the total size of
my RS' memory is 10 * 6G = 60 GB. That translate into a 60/100*1000 = 0.06%
cache hit probablity. Under random read pattern, each read is bound to
experience the "open-> read index -> .... -> read datablock" sequence, which
would be expensive.
Any comment?
If totally random, as per Alvin's suggestion, yes, just turn off block
caching since it is doing you no good.

But totally random is unusual in practise, no?

St.Ack
Michael Segel
2010-11-01 16:28:25 UTC
Permalink
Date: Fri, 29 Oct 2010 10:01:24 -0700
Subject: Re: HBase random access in HDFS and block indices
On Fri, Oct 29, 2010 at 6:41 AM, Sean Bigdatafun
Post by Sean Bigdatafun
I have the same doubt here. Let's say I have a totally random read pattern
(uniformly distributed).
Now let's assume my total data size stored in HBase is 100TB on 10
machines(not a big deal considering nowaday's disks), and the total size of
my RS' memory is 10 * 6G = 60 GB. That translate into a 60/100*1000 = 0.06%
cache hit probablity. Under random read pattern, each read is bound to
experience the "open-> read index -> .... -> read datablock" sequence, which
would be expensive.
Any comment?
If totally random, as per Alvin's suggestion, yes, just turn off block
caching since it is doing you no good.
But totally random is unusual in practise, no?
St.Ack
Uhm... not exactly.

One of the benefits of HBase is that it should scale in a *near* linear fashion.

So if we don't know how the data is to be accessed, or we know that there are a couple of access patterns that are orthogonal to each other, putting the data in to the cloud in a 'random' fashion should provide consistent read access times.

So the design of 'random' stored data shouldn't be that unusual. It just means you're going to have a couple of different indexes. ;-)
Tao Xie
2010-11-02 08:07:29 UTC
Permalink
I read the code and my understanding is when a RS starts StoreFiles of each
Region
will be instantiated. Then HFile.reader.loadFileInfo() will read the the
index and file info.
So each StoreFile is opened only once and block index are cached. The cache
miss are
for blocks. I mean for random Get each read does not need open HFile ->read
index again.
Is that right?
Post by Sean Bigdatafun
I have the same doubt here. Let's say I have a totally random read pattern
(uniformly distributed).
Now let's assume my total data size stored in HBase is 100TB on 10
machines(not a big deal considering nowaday's disks), and the total size of
my RS' memory is 10 * 6G = 60 GB. That translate into a 60/100*1000 = 0.06%
cache hit probablity. Under random read pattern, each read is bound to
experience the "open-> read index -> .... -> read datablock" sequence, which
would be expensive.
Any comment?
Post by Matt Corgan
I was envisioning the HFiles being opened and closed more often, but it
sounds like they're held open for long periods and that the indexes are
permanently cached. Is it roughly correct to say that after opening an
HFile and loading its checksum/metadata/index/etc then each random data
block access only requires a single pread, where the pread has some
threading and connection overhead, but theoretically only requires one
disk
Post by Matt Corgan
seek. I'm curious because I'm trying to do a lot of random reads, and given
enough application parallelism, the disk seeks should become the
bottleneck
Post by Matt Corgan
much sooner than the network and threading overhead.
Thanks again,
Matt
Post by Ryan Rawson
Hi,
Since the file is write-once, no random writes, putting the index at
- read fixed file trailer, ie: filelen.offset - <fixed size>
- read location of additional variable length sections, eg: block index
- read those indexes, including the variable length 'file-info' section
So to give some background, by default an InputStream from DFSClient
has a single socket and positioned reads are fairly fast. The DFS
just sees 'read bytes from pos X length Y' commands on an open socket.
That is fast. But only 1 thread at a time can use this interface.
So for 'get' requests we use another interface called pread() which
takes a position+length, and returns data. This involves setting up a
1-use socket and tearing it down when we are done. So it is slower by
2-3 tcp RTT, thread instantiation and other misc overhead.
Back to the HFile index, it is indeed stored in ram, not block cache.
Size is generally not an issue, hasn't been yet. We ship with a
default block size of 64k, and I'd recommend sticking with that unless
you have evidential proof your data set performance is better under a
different size. Since the index grows linearly by a factor of 1/64k
with the bytes of the data, it isn't a huge deal. Also the indexes
are spread around the cluster, so you are pushing load to all
machines.
Post by Matt Corgan
Do you guys ever worry about how big an HFile's index will be? For
example,
Post by Matt Corgan
if you have a 512mb HFile with 8k block size, you will have 64,000
blocks.
Post by Matt Corgan
If each index entry is 50b, then you have a 3.2mb index which is way
out
Post by Ryan Rawson
of
Post by Matt Corgan
line with your intention of having a small block size. I believe
that's
Post by Ryan Rawson
Post by Matt Corgan
read all at once so will be slow the first time... is the index
cached
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
somewhere (block cache?) so that index accesses are from memory?
And somewhat related - since the index is stored at the end of the
HFile,
Post by Ryan Rawson
is
Post by Matt Corgan
an additional random access required to find its offset? If it was
considered, why was that chosen over putting it in it's own file that
could
Post by Matt Corgan
be accessed directly?
Thanks for all these explanations,
Matt
Post by Ryan Rawson
The primary problem is the namenode memory. It contains entries for
every
Post by Matt Corgan
Post by Ryan Rawson
file and block, so setting hdfs block size small limits your
scaleability.
Post by Matt Corgan
Post by Ryan Rawson
There is nothing inherently wrong with in file random read, Its just
That
Post by Matt Corgan
Post by Ryan Rawson
the hdfs client was written for a single reader to read most of a
file.
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
This to achieve high performance you'd need to do tricks, such as pipelining
sockets and socket pool reuse. Right now for random reads We open a
new
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
socket, read data then close it.
Post by William Kang
Hi JG and Ryan,
Thanks for the excellent answers.
So, I am going to push everything to the extremes without
considering
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
the memory first. In theory, if in HBase, every cell size equals
to
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
HBase block size, then there would not be any in block traverse.
In
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
HDFS, very HBase block size equals to each HDFS block size, there
would not be any in-file random access necessary. This would
provide
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
the best performance?
But, the problem is that if the block in HBase is too large, the
memory will run out since HBase load block into memory; if the
block
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
in HDFS is too small, the DN will run out of memory since each
HDFS
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
file takes some memory. So, it is a trade-off problem between
memory
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
and performance. Is it right?
And would it make any difference between random reading the same
size
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
file portion from of a small HDFS block and from a large HDFS
block?
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Thanks.
William
On Mon, Oct 18, 2010 at 7:49 PM, William Kang <
Post by William Kang
Hi,
Recently I have spent some efforts to try to understand the
mechanisms
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
of HBase to exploit possible performance tunning options. And
many
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
thanks to the folks who helped with my questions in this
community,
Post by Ryan Rawson
I
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
have sent a report. But, there are still few questions left.
1. If a HFile block contains more than one keyvalue pair, will
the
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
block index in HFile point out the offset for every keyvalue
pair
Post by Matt Corgan
in
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
that block? Or, the block index will just point out the key
ranges
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
inside that block, so you have to traverse inside the block
until
Post by Matt Corgan
Post by Ryan Rawson
you
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
meet the key you are looking for?
The block index contains the first key for every block. It
therefore
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
defines in an [a,b) manner the range of each block. Once a block
has
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
been selected to read from, it is read into memory then iterated
over
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
until the key in question has been found (or the closest match
has
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
been found).
Post by William Kang
2. When HBase read block to fetching the data or traverse in it,
is
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
this block read into memory?
yes, the entire block at a time is read in a single read
operation.
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
3. HBase blocks (64k configurable) are inside HDFS blocks (64m
configurable), to read the HBase blocks, we have to random
access
Post by Matt Corgan
Post by Ryan Rawson
the
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
portion of the larger HDFS blocks, it is still a random access.
Would
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
this be slow?
Random access reads are not necessarily slow, they require
several
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
- disk seeks to the data in question
- disk seeks to the checksum files in question
- checksum computation and verification
While not particularly slow, this could probably be optimized a
bit.
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Most of the issues with random reads in HDFS is parallelizing the
reads and doing as much io-pushdown/scheduling as possible
without
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
consuming an excess of sockets and threads. The actual speed can
be
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
excellent, or not, depending on how busy the IO subsystem is.
Post by William Kang
Many thanks. I would be grateful for your answers.
William
Michael Segel
2010-11-02 16:32:00 UTC
Permalink
Once you instantiate the HFile object, you should be able to as many random get() against the table until you close the reference.
Date: Tue, 2 Nov 2010 16:07:29 +0800
Subject: Re: HBase random access in HDFS and block indices
I read the code and my understanding is when a RS starts StoreFiles of each
Region
will be instantiated. Then HFile.reader.loadFileInfo() will read the the
index and file info.
So each StoreFile is opened only once and block index are cached. The cache
miss are
for blocks. I mean for random Get each read does not need open HFile ->read
index again.
Is that right?
Post by Sean Bigdatafun
I have the same doubt here. Let's say I have a totally random read pattern
(uniformly distributed).
Now let's assume my total data size stored in HBase is 100TB on 10
machines(not a big deal considering nowaday's disks), and the total size of
my RS' memory is 10 * 6G = 60 GB. That translate into a 60/100*1000 = 0.06%
cache hit probablity. Under random read pattern, each read is bound to
experience the "open-> read index -> .... -> read datablock" sequence,
which
would be expensive.
Any comment?
Post by Matt Corgan
I was envisioning the HFiles being opened and closed more often, but it
sounds like they're held open for long periods and that the indexes are
permanently cached. Is it roughly correct to say that after opening an
HFile and loading its checksum/metadata/index/etc then each random data
block access only requires a single pread, where the pread has some
threading and connection overhead, but theoretically only requires one
disk
Post by Matt Corgan
seek. I'm curious because I'm trying to do a lot of random reads, and
given
enough application parallelism, the disk seeks should become the
bottleneck
Post by Matt Corgan
much sooner than the network and threading overhead.
Thanks again,
Matt
Post by Ryan Rawson
Hi,
Since the file is write-once, no random writes, putting the index at
- read fixed file trailer, ie: filelen.offset - <fixed size>
- read location of additional variable length sections, eg: block index
- read those indexes, including the variable length 'file-info' section
So to give some background, by default an InputStream from DFSClient
has a single socket and positioned reads are fairly fast. The DFS
just sees 'read bytes from pos X length Y' commands on an open socket.
That is fast. But only 1 thread at a time can use this interface.
So for 'get' requests we use another interface called pread() which
takes a position+length, and returns data. This involves setting up a
1-use socket and tearing it down when we are done. So it is slower by
2-3 tcp RTT, thread instantiation and other misc overhead.
Back to the HFile index, it is indeed stored in ram, not block cache.
Size is generally not an issue, hasn't been yet. We ship with a
default block size of 64k, and I'd recommend sticking with that unless
you have evidential proof your data set performance is better under a
different size. Since the index grows linearly by a factor of 1/64k
with the bytes of the data, it isn't a huge deal. Also the indexes
are spread around the cluster, so you are pushing load to all
machines.
Post by Matt Corgan
Do you guys ever worry about how big an HFile's index will be? For
example,
Post by Matt Corgan
if you have a 512mb HFile with 8k block size, you will have 64,000
blocks.
Post by Matt Corgan
If each index entry is 50b, then you have a 3.2mb index which is way
out
Post by Ryan Rawson
of
Post by Matt Corgan
line with your intention of having a small block size. I believe
that's
Post by Ryan Rawson
Post by Matt Corgan
read all at once so will be slow the first time... is the index
cached
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
somewhere (block cache?) so that index accesses are from memory?
And somewhat related - since the index is stored at the end of the
HFile,
Post by Ryan Rawson
is
Post by Matt Corgan
an additional random access required to find its offset? If it was
considered, why was that chosen over putting it in it's own file that
could
Post by Matt Corgan
be accessed directly?
Thanks for all these explanations,
Matt
Post by Ryan Rawson
The primary problem is the namenode memory. It contains entries for
every
Post by Matt Corgan
Post by Ryan Rawson
file and block, so setting hdfs block size small limits your
scaleability.
Post by Matt Corgan
Post by Ryan Rawson
There is nothing inherently wrong with in file random read, Its just
That
Post by Matt Corgan
Post by Ryan Rawson
the hdfs client was written for a single reader to read most of a
file.
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
This to achieve high performance you'd need to do tricks, such as
pipelining
sockets and socket pool reuse. Right now for random reads We open a
new
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
socket, read data then close it.
Post by William Kang
Hi JG and Ryan,
Thanks for the excellent answers.
So, I am going to push everything to the extremes without
considering
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
the memory first. In theory, if in HBase, every cell size equals
to
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
HBase block size, then there would not be any in block traverse.
In
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
HDFS, very HBase block size equals to each HDFS block size, there
would not be any in-file random access necessary. This would
provide
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
the best performance?
But, the problem is that if the block in HBase is too large, the
memory will run out since HBase load block into memory; if the
block
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
in HDFS is too small, the DN will run out of memory since each
HDFS
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
file takes some memory. So, it is a trade-off problem between
memory
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
and performance. Is it right?
And would it make any difference between random reading the same
size
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
file portion from of a small HDFS block and from a large HDFS
block?
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Thanks.
William
On Mon, Oct 18, 2010 at 7:49 PM, William Kang <
Post by William Kang
Hi,
Recently I have spent some efforts to try to understand the
mechanisms
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
of HBase to exploit possible performance tunning options. And
many
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
thanks to the folks who helped with my questions in this
community,
Post by Ryan Rawson
I
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
have sent a report. But, there are still few questions left.
1. If a HFile block contains more than one keyvalue pair, will
the
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
block index in HFile point out the offset for every keyvalue
pair
Post by Matt Corgan
in
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
that block? Or, the block index will just point out the key
ranges
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
inside that block, so you have to traverse inside the block
until
Post by Matt Corgan
Post by Ryan Rawson
you
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
meet the key you are looking for?
The block index contains the first key for every block. It
therefore
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
defines in an [a,b) manner the range of each block. Once a block
has
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
been selected to read from, it is read into memory then iterated
over
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
until the key in question has been found (or the closest match
has
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
been found).
Post by William Kang
2. When HBase read block to fetching the data or traverse in it,
is
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
this block read into memory?
yes, the entire block at a time is read in a single read
operation.
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
3. HBase blocks (64k configurable) are inside HDFS blocks (64m
configurable), to read the HBase blocks, we have to random
access
Post by Matt Corgan
Post by Ryan Rawson
the
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
portion of the larger HDFS blocks, it is still a random access.
Would
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Post by William Kang
this be slow?
Random access reads are not necessarily slow, they require
several
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
- disk seeks to the data in question
- disk seeks to the checksum files in question
- checksum computation and verification
While not particularly slow, this could probably be optimized a
bit.
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
Most of the issues with random reads in HDFS is parallelizing the
reads and doing as much io-pushdown/scheduling as possible
without
Post by Matt Corgan
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
consuming an excess of sockets and threads. The actual speed can
be
Post by Ryan Rawson
Post by Matt Corgan
Post by Ryan Rawson
Post by William Kang
excellent, or not, depending on how busy the IO subsystem is.
Post by William Kang
Many thanks. I would be grateful for your answers.
William
Stack
2010-10-29 16:59:45 UTC
Permalink
Post by Matt Corgan
I was envisioning the HFiles being opened and closed more often, but it
sounds like they're held open for long periods and that the indexes are
permanently cached.  Is it roughly correct to say that after opening an
HFile and loading its checksum/metadata/index/etc then each random data
block access only requires a single pread, where the pread has some
threading and connection overhead, but theoretically only requires one disk
seek.  I'm curious because I'm trying to do a lot of random reads, and given
enough application parallelism, the disk seeks should become the bottleneck
much sooner than the network and threading overhead.
You have it basically right.

On region deploy, all files that comprise a region are opened and
thereafter held opened. Part of opening is reading in index and file
metadata so opened files occupy some memory. An optimization would be
to let go of unused files reopening on access.

St.Ack
Jonathan Gray
2010-10-19 02:59:11 UTC
Permalink
Hi William. Answers inline.
-----Original Message-----
Sent: Monday, October 18, 2010 7:48 PM
To: hbase-user
Subject: HBase random access in HDFS and block indices
Hi,
Recently I have spent some efforts to try to understand the mechanisms
of HBase to exploit possible performance tunning options. And many
thanks to the folks who helped with my questions in this community, I
have sent a report. But, there are still few questions left.
1. If a HFile block contains more than one keyvalue pair, will the
block index in HFile point out the offset for every keyvalue pair in
that block? Or, the block index will just point out the key ranges
inside that block, so you have to traverse inside the block until you
meet the key you are looking for?
It is the latter. Block index points to the start keys of each block, so you effectively have a range for each block.

Lots of work has gone in recently to seek/reseek/early-out when possible and skip unnecessary blocks.
2. When HBase read block to fetching the data or traverse in it, is
this block read into memory?
Yes. And if the block cache is turned on, it will be put into an LRU cache.
3. HBase blocks (64k configurable) are inside HDFS blocks (64m
configurable), to read the HBase blocks, we have to random access the
HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
portion of the larger HDFS blocks, it is still a random access. Would
this be slow?
Yes, this is still random access. HBase provides the indexing/retrieval/etc on top of HDFS to make the random read access as efficient as possible (and with caching) and makes random writes possible.

JG
Many thanks. I would be grateful for your answers.
William
Continue reading on narkive:
Loading...