Ok, in today's "the sky is the limit" (in terms of space), do we really care about 1 or 2 bytes for an integer field type for a primary key? I'm redesigning FabMate Classic to use a MySQL (MariaDB) database backend now because my client is still experiencing problems with the VFP database (index problems, etc.). His network guy (as usual) says everything on the Windows Server 2016 network is fine and the problems only started when they upgraded the client workstations from Win7 to Win10. Whatever. This thread is about primary key size WHEN YOU KNOW THAT THE MAX VALUE WON'T GO OVER (whatever the limit is at each field type: 255 for tinyint, etc.)
The OCD part of me wants to define the smallest type possible, whereas the "let's set it and never worry about it" part of me says "just make everything INT."
Weigh in on your thoughts on which type you'd use.
tia, --Mike
That is largely dependent on the engine you use in MariaDB. Have you settled on one?
On 2017-10-28 02:05, Wollenhaupt, Christof wrote:
That is largely dependent on the engine you use in MariaDB. Have you settled on one?
Was just going to use MyISAM for non-transactional tables and InnoDB for those where I need transactions (end-data tables). Sometimes I'll use a COMPRESSED or COMPACT Row Format but if I read the docs right, I'm not sure I'm achieving the idea of either.
Wouldn't you just use whatever the default autoincrementing field is?
On 28-Oct-2017 6:25 PM, mbsoftwaresolutions@mbsoftwaresolutions.com wrote:
On 2017-10-28 02:05, Wollenhaupt, Christof wrote:
That is largely dependent on the engine you use in MariaDB. Have you settled on one?
Was just going to use MyISAM for non-transactional tables and InnoDB for those where I need transactions (end-data tables). Sometimes I'll use a COMPRESSED or COMPACT Row Format but if I read the docs right, I'm not sure I'm achieving the idea of either.
[excessive quoting removed by server]
On 10/28/2017 12:48 AM, mbsoftwaresolutions@mbsoftwaresolutions.com wrote:
The OCD part of me wants to define the smallest type possible, whereas the "let's set it and never worry about it" part of me says "just make everything INT."
I like to leave wiggle room generally. But really, using 2, 4, 8, or even 16 bytes is minimal impacts in my experience. So, for example, if you expect the number to never be over a million, I'd probably use Int - which gives up to 2 billion. If you're sure it will never go over 255, I'd still probably go with the 4 byte size.
Maybe you could squeeze that if you know regular maintenance could be done to "reset" values (if feasible). But then, if your system might be integrated with others, ids should usually NEVER be reset.
Whatever the case, please don't do something idiotic like declare "VARCHAR(4000)" for fields that will never have more than 20 characters. That's the most common, stupid, wasteful, inefficient mistake I've seen in so-called "enterprise" apps.
-Charlie
On 2017-10-28 07:17, Charlie-gm wrote:
Whatever the case, please don't do something idiotic like declare "VARCHAR(4000)" for fields that will never have more than 20 characters. That's the most common, stupid, wasteful, inefficient mistake I've seen in so-called "enterprise" apps.
Hi Charlie,
I don't do that but I thought using varchar(n) fields were supposed to be space-efficient in a sense that it just used a extra 1 or 2 bytes up front at the head of the record to identify the true length? It's not really allocating the space like a CHAR(n) field would...right?
On Oct 28, 2017, at 7:58 AM, mbsoftwaresolutions@mbsoftwaresolutions.com wrote:
I don't do that but I thought using varchar(n) fields were supposed to be space-efficient in a sense that it just used a extra 1 or 2 bytes up front at the head of the record to identify the true length? It's not really allocating the space like a CHAR(n) field would...right?
Yes, it only uses the space needed for the value. Again, the primary concern isn't disk storage space, but indexing. If you have *any* variable-width columns in a table, the table no longer has fixed record length, so you lose the fixed offset for searching. The moral of the story is if you use VARCHAR anywhere in a table, you might as well make all string columns VARCHAR.
-- Ed Leafe
--- StripMime Report -- processed MIME parts --- multipart/signed text/plain (text body -- kept) application/pgp-signature ---
On 2017-10-28 09:53, Ed Leafe wrote:
On Oct 28, 2017, at 7:58 AM, mbsoftwaresolutions@mbsoftwaresolutions.com wrote:
I don't do that but I thought using varchar(n) fields were supposed to be space-efficient in a sense that it just used a extra 1 or 2 bytes up front at the head of the record to identify the true length? It's not really allocating the space like a CHAR(n) field would...right?
Yes, it only uses the space needed for the value. Again, the primary concern isn't disk storage space, but indexing. If you have *any* variable-width columns in a table, the table no longer has fixed record length, so you lose the fixed offset for searching. The moral of the story is if you use VARCHAR anywhere in a table, you might as well make all string columns VARCHAR.
Hi Ed,
Yep...I recall hearing that years ago, too (regarding any varchars make the whole row a varchar in string fields, basically).
Would you ever use fixed CHARs with a non-VFP backend for that fixed offset advantage, again with the idea that storage is cheap/available?
tia, --Mike
On Oct 28, 2017, at 9:41 AM, mbsoftwaresolutions@mbsoftwaresolutions.com wrote:
Would you ever use fixed CHARs with a non-VFP backend for that fixed offset advantage, again with the idea that storage is cheap/available?
Sure, but of course it would depend on the details of the data.
-- Ed Leafe
--- StripMime Report -- processed MIME parts --- multipart/signed text/plain (text body -- kept) application/pgp-signature ---
On 10/28/2017 8:58 AM, mbsoftwaresolutions@mbsoftwaresolutions.com wrote:
On 2017-10-28 07:17, Charlie-gm wrote:
Whatever the case, please don't do something idiotic like declare "VARCHAR(4000)" for fields that will never have more than 20
...
I don't do that but I thought using varchar(n) fields were supposed to be space-efficient in a sense that it just used a extra 1 or 2 bytes up front at the head of the record to identify the true length? It's not really allocating the space like a CHAR(n) field would...right?
The main problem is the database engine, and anything that has to allocate memory from data from the database engine, has to constantly assume the field size could jump up to 4000 (or max length, etc), at any time.
While space is not immediately allocated on disk, memory, etc, the engine has to assume it will be at any given instant. Some of them do "guesses" and watch "utilization" of fields, etc and will allocate chunks according to those guesses (recall that most DB servers do not just allocate disk space on each row/field update, they do it in chunks). But it's a waste of processing power and disk storage compared to knowing your data and doing even just a little data analysis.
I've seen "enterprise" app databases that get up to 1TB in size, and the dev team says "we need more disk space!!! we're running out!!!"... I then, took that database design, did a little database analysis on it, and got it into about 10GB of VFP tables (including indexes). Of course, that does not account for engine transaction logs, etc. But those were regularly purged (data archives were kept on different servers altogether). In the DB Server design, I saw varchar('max <whatever that value is>') all over the place. Now this was an Oracle database server, so there may be some differences between engines.
So while some claim that varchar will only use up disk space (and memory) based on the number of characters used, in practice that is not completely accurate. And while char can also be inefficient if too large for what is used, it will "scale" at a predictable rate and require less "processing" burden on the engine.
And, of course, there is a ton of additional "theory" and discussion about this across the web, so you may want to do more research.
-Charlie
On 2017-10-28 09:58, Charlie-gm wrote:
It's not really allocating the space like a CHAR(n) field would...right? The main problem is the database engine, and anything that has to allocate memory from data from the database engine, has to constantly assume the field size could jump up to 4000 (or max length, etc), at any time.
While space is not immediately allocated on disk, memory, etc, the engine has to assume it will be at any given instant. Some of them do "guesses" and watch "utilization" of fields, etc and will allocate chunks according to those guesses (recall that most DB servers do not just allocate disk space on each row/field update, they do it in chunks). But it's a waste of processing power and disk storage compared to knowing your data and doing even just a little data analysis.
I've seen "enterprise" app databases that get up to 1TB in size, and the dev team says "we need more disk space!!! we're running out!!!"... I then, took that database design, did a little database analysis on it, and got it into about 10GB of VFP tables (including indexes). Of course, that does not account for engine transaction logs, etc. But those were regularly purged (data archives were kept on different servers altogether). In the DB Server design, I saw varchar('max <whatever that value is>') all over the place. Now this was an Oracle database server, so there may be some differences between engines.
So while some claim that varchar will only use up disk space (and memory) based on the number of characters used, in practice that is not completely accurate. And while char can also be inefficient if too large for what is used, it will "scale" at a predictable rate and require less "processing" burden on the engine.
Yeah, at my last gig, they used varchar(max) all over the place with SQL Server. I didn't care for that. In the MySQL system I preferred (before they made me switch to the official corporate choice of SQL Server), I imported data into tables with a TEXT field, and the wrote a simple optimization program after that to scan for the largest length in the columns and then do an ALTER TABLE to make them exactly the max they needed to be. Couldn't leave it as TEXT (or varchar-max) because they'd come into VFP as memo fields, which I couldn't export to Excel without ease.
On Oct 27, 2017, at 11:48 PM, mbsoftwaresolutions@mbsoftwaresolutions.com wrote:
The OCD part of me wants to define the smallest type possible, whereas the "let's set it and never worry about it" part of me says "just make everything INT."
It's not just the size of the column in the row stored on the disk; it's the size of any index it's part of, and the size of the memory footprint for the column when filtering, sorting, etc. Generally I don't obsess trying to make it as small as possible, but neither do I just make everything huge so that I don't have to worry about it. Use the sanest size (that's based on your experience being a DBA), and if worse comes to worse, there's always ALTER TABLE. :)
-- Ed Leafe
--- StripMime Report -- processed MIME parts --- multipart/signed text/plain (text body -- kept) application/pgp-signature ---
On 2017-10-28 09:45, Ed Leafe wrote:
On Oct 27, 2017, at 11:48 PM, mbsoftwaresolutions@mbsoftwaresolutions.com wrote:
The OCD part of me wants to define the smallest type possible, whereas the "let's set it and never worry about it" part of me says "just make everything INT."
It's not just the size of the column in the row stored on the disk; it's the size of any index it's part of, and the size of the memory footprint for the column when filtering, sorting, etc. Generally I don't obsess trying to make it as small as possible, but neither do I just make everything huge so that I don't have to worry about it. Use the sanest size (that's based on your experience being a DBA), and if worse comes to worse, there's always ALTER TABLE. :)
Right...I couldn't care less about disk space b/c as I said, that's basically never-ending for my needs. I was thinking for in-memory index trees, mostly.
Disk space is cheap. Database schema changes are not. Updates require testing as well as validation in all of the UI and UX environments.
Indexes of int are easy and great. Indexes of Char, VarChar, or NVarChar are necessary. How the rdbms does such, and how the dba defines the index to accept insert as well as an update is where the true performance lies.
On Sat, Oct 28, 2017 at 8:45 AM, Ed Leafe ed@leafe.com wrote:
On Oct 27, 2017, at 11:48 PM, mbsoftwaresolutions@mbsoftwaresolutions.com wrote:
The OCD part of me wants to define the smallest type possible, whereas
the "let's set it and never worry about it" part of me says "just make everything INT."
It's not just the size of the column in the row stored on the disk; it's the size of any index it's part of, and the size of the memory footprint for the column when filtering, sorting, etc. Generally I don't obsess trying to make it as small as possible, but neither do I just make everything huge so that I don't have to worry about it. Use the sanest size (that's based on your experience being a DBA), and if worse comes to worse, there's always ALTER TABLE. :)
-- Ed Leafe
--- StripMime Report -- processed MIME parts --- multipart/signed text/plain (text body -- kept) application/pgp-signature
[excessive quoting removed by server]
On 2017-10-28 11:54, Stephen Russell wrote:
Disk space is cheap. Database schema changes are not. Updates require testing as well as validation in all of the UI and UX environments.
Indexes of int are easy and great. Indexes of Char, VarChar, or NVarChar are necessary. How the rdbms does such, and how the dba defines the index to accept insert as well as an update is where the true performance lies.
FabNet was designed with app-generated GUIDs using Craig Boyd's createguid logic. Classic FabMate was using auto-incrementing tables in VFP; I was thinking of just using auto-inc keys again (rather than varchar(40) guid keys).
On 30-Oct-2017 4:48 AM, mbsoftwaresolutions@mbsoftwaresolutions.com wrote:
On 2017-10-28 11:54, Stephen Russell wrote:
Disk space is cheap. Database schema changes are not. Updates require testing as well as validation in all of the UI and UX environments.
Indexes of int are easy and great. Indexes of Char, VarChar, or NVarChar are necessary. How the rdbms does such, and how the dba defines the index to accept insert as well as an update is where the true performance lies.
FabNet was designed with app-generated GUIDs using Craig Boyd's createguid logic. Classic FabMate was using auto-incrementing tables in VFP; I was thinking of just using auto-inc keys again (rather than varchar(40) guid keys).
[excessive quoting removed by server]
From a design POV why would you want to give up control and have the
backend responsible for something that takes a reply of, to provide you data necessary for the next set of inserts?
PKMasterID you generate via a GUID.
If that same key is required in 2,3,4+ other tables as FKMasterID, why not generate the pointer and then submit ALL the inserts at once within a transaction on the server? Or do you have procedures on the server that do your insert and all you need to do is pass the params for them?
On Sun, Oct 29, 2017 at 6:18 PM, < mbsoftwaresolutions@mbsoftwaresolutions.com> wrote:
On 2017-10-28 11:54, Stephen Russell wrote:
Disk space is cheap. Database schema changes are not. Updates require testing as well as validation in all of the UI and UX environments.
Indexes of int are easy and great. Indexes of Char, VarChar, or NVarChar are necessary. How the rdbms does such, and how the dba defines the index to accept insert as well as an update is where the true performance lies.
FabNet was designed with app-generated GUIDs using Craig Boyd's createguid logic. Classic FabMate was using auto-incrementing tables in VFP; I was thinking of just using auto-inc keys again (rather than varchar(40) guid keys).
[excessive quoting removed by server]
On 2017-10-30 09:28, Stephen Russell wrote:
From a design POV why would you want to give up control and have the
backend responsible for something that takes a reply of, to provide you data necessary for the next set of inserts?
PKMasterID you generate via a GUID.
If that same key is required in 2,3,4+ other tables as FKMasterID, why not generate the pointer and then submit ALL the inserts at once within a transaction on the server? Or do you have procedures on the server that do your insert and all you need to do is pass the params for them?
That's what I liked about the varchar(40) GUID keys I generated in the app--I could do all that and then save the whole series of datasets (parent/child/grandchild) with one transaction instead of using auto-increment integer keys. However, that comes at a cost of course -- larger indexes, arguably less efficient.
Like everything, there's trade-offs with each path.
Less efficient indexes? Are you talking about space in a db compared to an int for a pointer or are you saying that the time to isolate the data on that row because of the data type of the pointer? The flip side is data insertion.
Can you tell us why you use less efficient?
On Mon, Oct 30, 2017 at 9:47 AM, < mbsoftwaresolutions@mbsoftwaresolutions.com> wrote:
On 2017-10-30 09:28, Stephen Russell wrote:
From a design POV why would you want to give up control and have the
backend responsible for something that takes a reply of, to provide you data necessary for the next set of inserts?
PKMasterID you generate via a GUID.
If that same key is required in 2,3,4+ other tables as FKMasterID, why not generate the pointer and then submit ALL the inserts at once within a transaction on the server? Or do you have procedures on the server that do your insert and all you need to do is pass the params for them?
That's what I liked about the varchar(40) GUID keys I generated in the app--I could do all that and then save the whole series of datasets (parent/child/grandchild) with one transaction instead of using auto-increment integer keys. However, that comes at a cost of course -- larger indexes, arguably less efficient.
Like everything, there's trade-offs with each path.
[excessive quoting removed by server]
On 2017-10-30 11:09, Stephen Russell wrote:
Less efficient indexes? Are you talking about space in a db compared to an int for a pointer or are you saying that the time to isolate the data on that row because of the data type of the pointer? The flip side is data insertion.
Can you tell us why you use less efficient?
Not sure of your wording, if you meant exactly that or not, so let me try to respond:
I like the guid v(40) indexes because if ever I needed to combine data, I'm not running into duplicate keys. Plus, I like defining the key ahead of time and having complete control so I can work with parent/child/grandchild datasets easier (than if I had to contend with auto-inc keys). The negative of this approach as I understood it is that the since the index is 4x larger in size than a 4-byte integer key, it would not be as efficient in memory, and the index tree needs reindexing more often so as to be balanced.
Plenty of good article on the interweb discussing both: http://www.ovaistariq.net/733/understanding-btree-indexes-and-how-they-impac... https://blog.codinghorror.com/primary-keys-ids-versus-guids/ http://web.archive.org/web/20150511162734/http://databases.aspfaq.com/databa...
I think I'll stick with app-generated GUIDs though for the portability and no-collision benefit if I merged/move data. I also want to do replication where their database is stored locally but then replicates to a master database outside their office.
Good URLs you presented.
To me from a performance POV the INSERT of the GUID is the only downside with respect to the index. It has to identify the page and add itself to there. If need be it will tear the page and generate two pages with access holes to accept new index-data going forward. Next. you look at the type of data you are presenting via a velocity of inserts. Are your inserts per min to a table > 10,000? If so the GUID may be the wrong thing. Think of eBay in the closing seconds of an auction, or your stock trader in changes in the market generating A LOT of transactions. These situations are where next int is best because it always going to the last page of the index.
If you are not in that type of data environment you can do either with no problem.
M$ loves using GUID in their internal systems like CRM or SharePoint. It is Massive GUID driven in all of the tables.
On Mon, Oct 30, 2017 at 12:42 PM, < mbsoftwaresolutions@mbsoftwaresolutions.com> wrote:
On 2017-10-30 11:09, Stephen Russell wrote:
Less efficient indexes? Are you talking about space in a db compared to an int for a pointer or are you saying that the time to isolate the data on that row because of the data type of the pointer? The flip side is data insertion.
Can you tell us why you use less efficient?
Not sure of your wording, if you meant exactly that or not, so let me try to respond:
I like the guid v(40) indexes because if ever I needed to combine data, I'm not running into duplicate keys. Plus, I like defining the key ahead of time and having complete control so I can work with parent/child/grandchild datasets easier (than if I had to contend with auto-inc keys). The negative of this approach as I understood it is that the since the index is 4x larger in size than a 4-byte integer key, it would not be as efficient in memory, and the index tree needs reindexing more often so as to be balanced.
Plenty of good article on the interweb discussing both: http://www.ovaistariq.net/733/understanding-btree-indexes-an d-how-they-impact-performance/#.WfdQDHYpCJA https://blog.codinghorror.com/primary-keys-ids-versus-guids/ http://web.archive.org/web/20150511162734/http://databases. aspfaq.com/database/what-should-i-choose-for-my-primary-key.html
I think I'll stick with app-generated GUIDs though for the portability and no-collision benefit if I merged/move data. I also want to do replication where their database is stored locally but then replicates to a master database outside their office.
[excessive quoting removed by server]
I believe NewSequentialID() in SQL Server (or the UuidCreateSequential API call) avoids the paging problem behind Guids.
Frank.
Frank Cazabon
On 30/10/2017 02:17 PM, Stephen Russell wrote:
Good URLs you presented.
To me from a performance POV the INSERT of the GUID is the only downside with respect to the index. It has to identify the page and add itself to there. If need be it will tear the page and generate two pages with access holes to accept new index-data going forward. Next. you look at the type of data you are presenting via a velocity of inserts. Are your inserts per min to a table > 10,000? If so the GUID may be the wrong thing. Think of eBay in the closing seconds of an auction, or your stock trader in changes in the market generating A LOT of transactions. These situations are where next int is best because it always going to the last page of the index.
If you are not in that type of data environment you can do either with no problem.
M$ loves using GUID in their internal systems like CRM or SharePoint. It is Massive GUID driven in all of the tables.
On Mon, Oct 30, 2017 at 12:42 PM, < mbsoftwaresolutions@mbsoftwaresolutions.com> wrote:
On 2017-10-30 11:09, Stephen Russell wrote:
Less efficient indexes? Are you talking about space in a db compared to an int for a pointer or are you saying that the time to isolate the data on that row because of the data type of the pointer? The flip side is data insertion.
Can you tell us why you use less efficient?
Not sure of your wording, if you meant exactly that or not, so let me try to respond:
I like the guid v(40) indexes because if ever I needed to combine data, I'm not running into duplicate keys. Plus, I like defining the key ahead of time and having complete control so I can work with parent/child/grandchild datasets easier (than if I had to contend with auto-inc keys). The negative of this approach as I understood it is that the since the index is 4x larger in size than a 4-byte integer key, it would not be as efficient in memory, and the index tree needs reindexing more often so as to be balanced.
Plenty of good article on the interweb discussing both: http://www.ovaistariq.net/733/understanding-btree-indexes-an d-how-they-impact-performance/#.WfdQDHYpCJA https://blog.codinghorror.com/primary-keys-ids-versus-guids/ http://web.archive.org/web/20150511162734/http://databases. aspfaq.com/database/what-should-i-choose-for-my-primary-key.html
I think I'll stick with app-generated GUIDs though for the portability and no-collision benefit if I merged/move data. I also want to do replication where their database is stored locally but then replicates to a master database outside their office.
[excessive quoting removed by server]
Frank, that puts the logic at the rdbms engine and no longer in your app, in-front of your data.
In reality, you get data all over the place but this bunch will be close. Pound script this a few times and you will see the leading values are what changes an not the last 4-5-6 characters.
CREATE TABLE GUID_Example ( SeqCol uniqueidentifier DEFAULT NewSequentialID() ,IDCol uniqueidentifier DEFAULT NEWID(),) ----Inserting five default values in table INSERT INTO GUID_Example DEFAULT VALUES INSERT INTO GUID_Example DEFAULT VALUES INSERT INTO GUID_Example DEFAULT VALUES INSERT INTO GUID_Example DEFAULT VALUES --------------------------------------------------------- SELECT * FROM GUID_Example
----Clean up database
DROP TABLE GUID_Example
I got this as output : SeqCol IDCol 1E54CB01-A7BD-E711-9C54-D481D71992B4 120C2AD7-ECF1-487F-BE56-0FD36A78237F 1F54CB01-A7BD-E711-9C54-D481D71992B4 E6F5D7B5-61F8-4FD3-988B-F0949A029E29 2054CB01-A7BD-E711-9C54-D481D71992B4 B5C05851-FBA5-4DD7-864A-133EE1BC6C68 2154CB01-A7BD-E711-9C54-D481D71992B4 638865DA-F2E4-4101-9534-E3DB83A0008E
When the performance goes to insert all over the index pages where there is a lot of available room you may not have a performance hit at all. On the flip side using the newsequentialID it may make a compound insert into a page that was starting to get tight and now is tight.
Please remember folks that Fkey index is also a component in the insert event as well. The more indexes you maintain that you really don't need, do get in your way on any platform.
On Mon, Oct 30, 2017 at 1:23 PM, Frank Cazabon frank.cazabon@gmail.com wrote:
I believe NewSequentialID() in SQL Server (or the UuidCreateSequential API call) avoids the paging problem behind Guids.
Frank.
Frank Cazabon
On 30/10/2017 02:17 PM, Stephen Russell wrote:
Good URLs you presented.
To me from a performance POV the INSERT of the GUID is the only downside with respect to the index. It has to identify the page and add itself to there. If need be it will tear the page and generate two pages with access holes to accept new index-data going forward. Next. you look at the type of data you are presenting via a velocity of inserts. Are your inserts per min to a table > 10,000? If so the GUID may be the wrong thing. Think of eBay in the closing seconds of an auction, or your stock trader in changes in the market generating A LOT of transactions. These situations are where next int is best because it always going to the last page of the index.
If you are not in that type of data environment you can do either with no problem.
M$ loves using GUID in their internal systems like CRM or SharePoint. It is Massive GUID driven in all of the tables.
On Mon, Oct 30, 2017 at 12:42 PM, < mbsoftwaresolutions@mbsoftwaresolutions.com> wrote:
On 2017-10-30 11:09, Stephen Russell wrote:
Less efficient indexes? Are you talking about space in a db compared to
an int for a pointer or are you saying that the time to isolate the data on that row because of the data type of the pointer? The flip side is data insertion.
Can you tell us why you use less efficient?
Not sure of your wording, if you meant exactly that or not, so let me
try to respond:
I like the guid v(40) indexes because if ever I needed to combine data, I'm not running into duplicate keys. Plus, I like defining the key ahead of time and having complete control so I can work with parent/child/grandchild datasets easier (than if I had to contend with auto-inc keys). The negative of this approach as I understood it is that the since the index is 4x larger in size than a 4-byte integer key, it would not be as efficient in memory, and the index tree needs reindexing more often so as to be balanced.
Plenty of good article on the interweb discussing both: http://www.ovaistariq.net/733/understanding-btree-indexes-an d-how-they-impact-performance/#.WfdQDHYpCJA https://blog.codinghorror.com/primary-keys-ids-versus-guids/ http://web.archive.org/web/20150511162734/http://databases. aspfaq.com/database/what-should-i-choose-for-my-primary-key.html
I think I'll stick with app-generated GUIDs though for the portability and no-collision benefit if I merged/move data. I also want to do replication where their database is stored locally but then replicates to a master database outside their office.
[excessive quoting removed by server]
I use the API call in my c# applications (UI), so the DBMS logic doesn't come into play.
I don't really follow the rest of your message. I'll try to understand it again tomorrow morning with a fresher brain 😀
On 30 October 2017 15:26:08 GMT-04:00, Stephen Russell srussell705@gmail.com wrote:
Frank, that puts the logic at the rdbms engine and no longer in your app, in-front of your data.
In reality, you get data all over the place but this bunch will be close. Pound script this a few times and you will see the leading values are what changes an not the last 4-5-6 characters.
CREATE TABLE GUID_Example ( SeqCol uniqueidentifier DEFAULT NewSequentialID() ,IDCol uniqueidentifier DEFAULT NEWID(),) ----Inserting five default values in table INSERT INTO GUID_Example DEFAULT VALUES INSERT INTO GUID_Example DEFAULT VALUES INSERT INTO GUID_Example DEFAULT VALUES INSERT INTO GUID_Example DEFAULT VALUES
SELECT * FROM GUID_Example
----Clean up database
DROP TABLE GUID_Example
I got this as output : SeqCol IDCol 1E54CB01-A7BD-E711-9C54-D481D71992B4 120C2AD7-ECF1-487F-BE56-0FD36A78237F 1F54CB01-A7BD-E711-9C54-D481D71992B4 E6F5D7B5-61F8-4FD3-988B-F0949A029E29 2054CB01-A7BD-E711-9C54-D481D71992B4 B5C05851-FBA5-4DD7-864A-133EE1BC6C68 2154CB01-A7BD-E711-9C54-D481D71992B4 638865DA-F2E4-4101-9534-E3DB83A0008E
When the performance goes to insert all over the index pages where there is a lot of available room you may not have a performance hit at all. On the flip side using the newsequentialID it may make a compound insert into a page that was starting to get tight and now is tight.
Please remember folks that Fkey index is also a component in the insert event as well. The more indexes you maintain that you really don't need, do get in your way on any platform.
On Mon, Oct 30, 2017 at 1:23 PM, Frank Cazabon frank.cazabon@gmail.com wrote:
I believe NewSequentialID() in SQL Server (or the
UuidCreateSequential API
call) avoids the paging problem behind Guids.
Frank.
Frank Cazabon
On 30/10/2017 02:17 PM, Stephen Russell wrote:
Good URLs you presented.
To me from a performance POV the INSERT of the GUID is the only
downside
with respect to the index. It has to identify the page and add
itself to
there. If need be it will tear the page and generate two pages with access holes to accept new index-data going forward. Next. you look at the
type
of data you are presenting via a velocity of inserts. Are your
inserts
per min to a table > 10,000? If so the GUID may be the wrong thing.
Think of
eBay in the closing seconds of an auction, or your stock trader in changes in the market generating A LOT of transactions. These
situations
are where next int is best because it always going to the last page
of the
index.
If you are not in that type of data environment you can do either
with no
problem.
M$ loves using GUID in their internal systems like CRM or
SharePoint. It
is Massive GUID driven in all of the tables.
On Mon, Oct 30, 2017 at 12:42 PM, < mbsoftwaresolutions@mbsoftwaresolutions.com> wrote:
On 2017-10-30 11:09, Stephen Russell wrote:
Less efficient indexes? Are you talking about space in a db
compared to
an int for a pointer or are you saying that the time to isolate the
data on
that row because of the data type of the pointer? The flip side
is data
insertion.
Can you tell us why you use less efficient?
Not sure of your wording, if you meant exactly that or not, so let
me
try to respond:
I like the guid v(40) indexes because if ever I needed to combine
data,
I'm not running into duplicate keys. Plus, I like defining the key
ahead
of time and having complete control so I can work with parent/child/grandchild datasets easier (than if I had to contend
with
auto-inc keys). The negative of this approach as I understood it
is that
the since the index is 4x larger in size than a 4-byte integer key,
it
would not be as efficient in memory, and the index tree needs
reindexing
more often so as to be balanced.
Plenty of good article on the interweb discussing both: http://www.ovaistariq.net/733/understanding-btree-indexes-an d-how-they-impact-performance/#.WfdQDHYpCJA https://blog.codinghorror.com/primary-keys-ids-versus-guids/ http://web.archive.org/web/20150511162734/http://databases. aspfaq.com/database/what-should-i-choose-for-my-primary-key.html
I think I'll stick with app-generated GUIDs though for the
portability
and no-collision benefit if I merged/move data. I also want to do replication where their database is stored locally but then replicates to a
master
database outside their office.
[excessive quoting removed by server]
Morning Stephen,
What do you mean by "pound script this"? Do you mean to run it over and over?
I did that I get the SeqCol coming out in order increasing as per your example with the first couple characters in the SeqCol changing and the rest the same.
I still don't understand your last 2 paragraphs. As I understood this when I researched it each new insert means it goes to the end so is always being added at the end so no paging complexity is involved. Could you clarify this for me, please?
Frank.
Frank Cazabon
On 30/10/2017 03:26 PM, Stephen Russell wrote:
Frank, that puts the logic at the rdbms engine and no longer in your app, in-front of your data.
In reality, you get data all over the place but this bunch will be close. Pound script this a few times and you will see the leading values are what changes an not the last 4-5-6 characters.
CREATE TABLE GUID_Example ( SeqCol uniqueidentifier DEFAULT NewSequentialID() ,IDCol uniqueidentifier DEFAULT NEWID(),) ----Inserting five default values in table INSERT INTO GUID_Example DEFAULT VALUES INSERT INTO GUID_Example DEFAULT VALUES INSERT INTO GUID_Example DEFAULT VALUES INSERT INTO GUID_Example DEFAULT VALUES
SELECT * FROM GUID_Example
----Clean up database
DROP TABLE GUID_Example
I got this as output : SeqCol IDCol 1E54CB01-A7BD-E711-9C54-D481D71992B4 120C2AD7-ECF1-487F-BE56-0FD36A78237F 1F54CB01-A7BD-E711-9C54-D481D71992B4 E6F5D7B5-61F8-4FD3-988B-F0949A029E29 2054CB01-A7BD-E711-9C54-D481D71992B4 B5C05851-FBA5-4DD7-864A-133EE1BC6C68 2154CB01-A7BD-E711-9C54-D481D71992B4 638865DA-F2E4-4101-9534-E3DB83A0008E
When the performance goes to insert all over the index pages where there is a lot of available room you may not have a performance hit at all. On the flip side using the newsequentialID it may make a compound insert into a page that was starting to get tight and now is tight.
Please remember folks that Fkey index is also a component in the insert event as well. The more indexes you maintain that you really don't need, do get in your way on any platform.
On Mon, Oct 30, 2017 at 1:23 PM, Frank Cazabon frank.cazabon@gmail.com wrote:
I believe NewSequentialID() in SQL Server (or the UuidCreateSequential API call) avoids the paging problem behind Guids.
Frank.
Frank Cazabon
On 30/10/2017 02:17 PM, Stephen Russell wrote:
Good URLs you presented.
To me from a performance POV the INSERT of the GUID is the only downside with respect to the index. It has to identify the page and add itself to there. If need be it will tear the page and generate two pages with access holes to accept new index-data going forward. Next. you look at the type of data you are presenting via a velocity of inserts. Are your inserts per min to a table > 10,000? If so the GUID may be the wrong thing. Think of eBay in the closing seconds of an auction, or your stock trader in changes in the market generating A LOT of transactions. These situations are where next int is best because it always going to the last page of the index.
If you are not in that type of data environment you can do either with no problem.
M$ loves using GUID in their internal systems like CRM or SharePoint. It is Massive GUID driven in all of the tables.
On Mon, Oct 30, 2017 at 12:42 PM, < mbsoftwaresolutions@mbsoftwaresolutions.com> wrote:
On 2017-10-30 11:09, Stephen Russell wrote:
Less efficient indexes? Are you talking about space in a db compared to
an int for a pointer or are you saying that the time to isolate the data on that row because of the data type of the pointer? The flip side is data insertion.
Can you tell us why you use less efficient?
Not sure of your wording, if you meant exactly that or not, so let me
try to respond:
I like the guid v(40) indexes because if ever I needed to combine data, I'm not running into duplicate keys. Plus, I like defining the key ahead of time and having complete control so I can work with parent/child/grandchild datasets easier (than if I had to contend with auto-inc keys). The negative of this approach as I understood it is that the since the index is 4x larger in size than a 4-byte integer key, it would not be as efficient in memory, and the index tree needs reindexing more often so as to be balanced.
Plenty of good article on the interweb discussing both: http://www.ovaistariq.net/733/understanding-btree-indexes-an d-how-they-impact-performance/#.WfdQDHYpCJA https://blog.codinghorror.com/primary-keys-ids-versus-guids/ http://web.archive.org/web/20150511162734/http://databases. aspfaq.com/database/what-should-i-choose-for-my-primary-key.html
I think I'll stick with app-generated GUIDs though for the portability and no-collision benefit if I merged/move data. I also want to do replication where their database is stored locally but then replicates to a master database outside their office.
[excessive quoting removed by server]
Here is a good video for B+ tree, and how they operate.
https://www.youtube.com/watch?v=tY9xu3vuZ10
About 5 min in it gets to the demonstration of the science. The creation of leaves and how things can bounce around you insert. Just think of Guid, where you are inserting into branches that have 50 or 5,000 are all different, C7948B8E-9B98-48C0-B2EA-11DB5474A99D, 0639AB12-9B34-4A86-9C76-13A38B587C20, 0D14D7A2-DFBA-41C4-9102-14AAB2F77F6D and A59F2BA2-4CBF-4E70-B5E3-20F4E7C577DA. Adding data to the end of the index has a little performance, but adding all over the place you can get into leaf/page tears and resetting of headers on leaves. This is the performance I was referring to.
In SQL you can define your fillfactor that is "free-space" in your index as it rebuilds. Some tables will do well for a lower fillfactor, or more free space for future inserts when you have any sort of character data. A higher value for auto inc column of INT.
When you get an analysis of queries and it says Index Seek, Index Scan the Seek used the these b+tree headers to find the location, and the index scan scans the entire index to find the data you desire. Redundant values in a status column are a seek and not a scan because the variety of OPEN/Closed is very limited.
You can seriously spend way too much time attempting to tune these suckers and as they grow over time you have to come back and tune again. :) Tuning with 5,000 rows and in two years you have 1,500,000 rows you may need to come back and verify a process is still in good shape.
On Tue, Oct 31, 2017 at 6:33 AM, Frank Cazabon frank.cazabon@gmail.com wrote:
Morning Stephen,
What do you mean by "pound script this"? Do you mean to run it over and over?
I did that I get the SeqCol coming out in order increasing as per your example with the first couple characters in the SeqCol changing and the rest the same.
I still don't understand your last 2 paragraphs. As I understood this when I researched it each new insert means it goes to the end so is always being added at the end so no paging complexity is involved. Could you clarify this for me, please?
Frank.
Frank Cazabon
On 30/10/2017 03:26 PM, Stephen Russell wrote:
Frank, that puts the logic at the rdbms engine and no longer in your app, in-front of your data.
In reality, you get data all over the place but this bunch will be close. Pound script this a few times and you will see the leading values are what changes an not the last 4-5-6 characters.
CREATE TABLE GUID_Example ( SeqCol uniqueidentifier DEFAULT NewSequentialID() ,IDCol uniqueidentifier DEFAULT NEWID(),) ----Inserting five default values in table INSERT INTO GUID_Example DEFAULT VALUES INSERT INTO GUID_Example DEFAULT VALUES INSERT INTO GUID_Example DEFAULT VALUES INSERT INTO GUID_Example DEFAULT VALUES
SELECT * FROM GUID_Example
----Clean up database
DROP TABLE GUID_Example
I got this as output : SeqCol IDCol 1E54CB01-A7BD-E711-9C54-D481D71992B4 120C2AD7-ECF1-487F-BE56-0FD36A78237F 1F54CB01-A7BD-E711-9C54-D481D71992B4 E6F5D7B5-61F8-4FD3-988B-F0949A029E29 2054CB01-A7BD-E711-9C54-D481D71992B4 B5C05851-FBA5-4DD7-864A-133EE1BC6C68 2154CB01-A7BD-E711-9C54-D481D71992B4 638865DA-F2E4-4101-9534-E3DB83A0008E
When the performance goes to insert all over the index pages where there is a lot of available room you may not have a performance hit at all. On the flip side using the newsequentialID it may make a compound insert into a page that was starting to get tight and now is tight.
Please remember folks that Fkey index is also a component in the insert event as well. The more indexes you maintain that you really don't need, do get in your way on any platform.
On Mon, Oct 30, 2017 at 1:23 PM, Frank Cazabon frank.cazabon@gmail.com wrote:
I believe NewSequentialID() in SQL Server (or the UuidCreateSequential API
call) avoids the paging problem behind Guids.
Frank.
Frank Cazabon
On 30/10/2017 02:17 PM, Stephen Russell wrote:
Good URLs you presented.
To me from a performance POV the INSERT of the GUID is the only downside with respect to the index. It has to identify the page and add itself to there. If need be it will tear the page and generate two pages with access holes to accept new index-data going forward. Next. you look at the type of data you are presenting via a velocity of inserts. Are your inserts per min to a table > 10,000? If so the GUID may be the wrong thing. Think of eBay in the closing seconds of an auction, or your stock trader in changes in the market generating A LOT of transactions. These situations are where next int is best because it always going to the last page of the index.
If you are not in that type of data environment you can do either with no problem.
M$ loves using GUID in their internal systems like CRM or SharePoint. It is Massive GUID driven in all of the tables.
On Mon, Oct 30, 2017 at 12:42 PM, < mbsoftwaresolutions@mbsoftwaresolutions.com> wrote:
On 2017-10-30 11:09, Stephen Russell wrote:
Less efficient indexes? Are you talking about space in a db compared to
an int for a pointer or are you saying that the time to isolate the data on that row because of the data type of the pointer? The flip side is data insertion.
Can you tell us why you use less efficient?
Not sure of your wording, if you meant exactly that or not, so let me
try to respond:
I like the guid v(40) indexes because if ever I needed to combine data, I'm not running into duplicate keys. Plus, I like defining the key ahead of time and having complete control so I can work with parent/child/grandchild datasets easier (than if I had to contend with auto-inc keys). The negative of this approach as I understood it is that the since the index is 4x larger in size than a 4-byte integer key, it would not be as efficient in memory, and the index tree needs reindexing more often so as to be balanced.
Plenty of good article on the interweb discussing both: http://www.ovaistariq.net/733/understanding-btree-indexes-an d-how-they-impact-performance/#.WfdQDHYpCJA https://blog.codinghorror.com/primary-keys-ids-versus-guids/ http://web.archive.org/web/20150511162734/http://databases. aspfaq.com/database/what-should-i-choose-for-my-primary-key.html
I think I'll stick with app-generated GUIDs though for the portability and no-collision benefit if I merged/move data. I also want to do replication where their database is stored locally but then replicates to a master database outside their office.
[excessive quoting removed by server]