Friday, February 24, 2012

Inserting rows problem

I have a "flat" CSV file that I need to import into a set of 3 relational
tables.
As a "programmer", I thought this would be trivial, so using VB6 and ADO, I
created a recordset of all the import rows and then iterated through the
recordset extracting the data from the fields and putting them into the
relational tables. This worked fine with my test data but rather obviously
takes way too long with real data; for every row, I have a network
round-trip to update the first table and retrieve the primary key's ID, I
then have another round trip to update the second table and get it's ID, and
a third network round trip for the third table. All in all, we're talking >
7 hours to complete.
I need to knock this down to minutes...
I'm guessing that I can do this with a single SQL command sent to the
server, but my SQL skills aren't really up to it. I'm hoping that this is
relatively trivial and that someone could provide a few hints.
If it's any help, I've created some example tables (the real ones are
obviously more complicated) - see end of message.
Thanks in advance
Griff
--
--The table holding the import table:
CREATE TABLE [dbo].[inputData] (
[id] [int] IDENTITY (1, 1) NOT NULL ,
[product] [varchar] (50) COLLATE Latin1_General_CI_AS NOT NULL ,
[pack] [varchar] (50) COLLATE Latin1_General_CI_AS NOT NULL ,
[price] [varchar] (50) COLLATE Latin1_General_CI_AS NOT NULL
) ON [PRIMARY]
-- Data to fill this:
insert into [inputData] (product,pack,price) values ('diary','T','100')
insert into [inputData] (product,pack,price) values ('diary','R','5')
insert into [inputData] (product,pack,price) values ('envelope','T','1000')
insert into [inputData] (product,pack,price) values ('envelope','R','100')
insert into [inputData] (product,pack,price) values ('hatstand','T','5')
insert into [inputData] (product,pack,price) values ('hatstand','R','1')
insert into [inputData] (product,pack,price) values ('lamp','T','10')
insert into [inputData] (product,pack,price) values ('lamp','R','2')
-- The following relational tables
CREATE TABLE [dbo].[Product] (
[id] [int] IDENTITY (1, 1) NOT NULL ,
[catalogueProduct] [varchar] (50) COLLATE Latin1_General_CI_AS NOT NULL
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[Product] ADD
CONSTRAINT [PK_Product] PRIMARY KEY CLUSTERED
(
[id]
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[Pack] (
[id] [int] IDENTITY (1, 1) NOT NULL ,
[Product_ID] [int] NOT NULL ,
[Pack_Details] [varchar] (50) COLLATE Latin1_General_CI_AS NOT NULL
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[Pack] ADD
CONSTRAINT [PK_Pack] PRIMARY KEY CLUSTERED
(
[id]
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[Price] (
[id] [int] IDENTITY (1, 1) NOT NULL ,
[Price_Details] [varchar] (50) COLLATE Latin1_General_CI_AS NOT NULL ,
[Pack_ID] [int] NOT NULL
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[Price] ADD
CONSTRAINT [PK_Price] PRIMARY KEY CLUSTERED
(
[id]
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[Pack] ADD
CONSTRAINT [FK_Pack_Product] FOREIGN KEY
(
[Product_ID]
) REFERENCES [dbo].[Product] (
[id]
)
GO
ALTER TABLE [dbo].[Price] ADD
CONSTRAINT [FK_Price_Pack] FOREIGN KEY
(
[Pack_ID]
) REFERENCES [dbo].[Pack] (
[id]
)
GOYou could create a DTS package that loads tables - perhaps, work tables -
with data via either a data pump, or more likely, a bulk insert.
You didn't post sample data from the flat file, so it's hard to say for
sure.
Tom
----
Thomas A. Moreau, BSc, PhD, MCSE, MCDBA
SQL Server MVP
Columnist, SQL Server Professional
Toronto, ON Canada
www.pinpub.com
.
"Griff" <howling@.the.moon> wrote in message
news:ujMIIulJGHA.3064@.TK2MSFTNGP10.phx.gbl...
I have a "flat" CSV file that I need to import into a set of 3 relational
tables.
As a "programmer", I thought this would be trivial, so using VB6 and ADO, I
created a recordset of all the import rows and then iterated through the
recordset extracting the data from the fields and putting them into the
relational tables. This worked fine with my test data but rather obviously
takes way too long with real data; for every row, I have a network
round-trip to update the first table and retrieve the primary key's ID, I
then have another round trip to update the second table and get it's ID, and
a third network round trip for the third table. All in all, we're talking >
7 hours to complete.
I need to knock this down to minutes...
I'm guessing that I can do this with a single SQL command sent to the
server, but my SQL skills aren't really up to it. I'm hoping that this is
relatively trivial and that someone could provide a few hints.
If it's any help, I've created some example tables (the real ones are
obviously more complicated) - see end of message.
Thanks in advance
Griff
--
--The table holding the import table:
CREATE TABLE [dbo].[inputData] (
[id] [int] IDENTITY (1, 1) NOT NULL ,
[product] [varchar] (50) COLLATE Latin1_General_CI_AS NOT NULL ,
[pack] [varchar] (50) COLLATE Latin1_General_CI_AS NOT NULL ,
[price] [varchar] (50) COLLATE Latin1_General_CI_AS NOT NULL
) ON [PRIMARY]
-- Data to fill this:
insert into [inputData] (product,pack,price) values ('diary','T','100')
insert into [inputData] (product,pack,price) values ('diary','R','5')
insert into [inputData] (product,pack,price) values ('envelope','T','1000')
insert into [inputData] (product,pack,price) values ('envelope','R','100')
insert into [inputData] (product,pack,price) values ('hatstand','T','5')
insert into [inputData] (product,pack,price) values ('hatstand','R','1')
insert into [inputData] (product,pack,price) values ('lamp','T','10')
insert into [inputData] (product,pack,price) values ('lamp','R','2')
-- The following relational tables
CREATE TABLE [dbo].[Product] (
[id] [int] IDENTITY (1, 1) NOT NULL ,
[catalogueProduct] [varchar] (50) COLLATE Latin1_General_CI_AS NOT NULL
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[Product] ADD
CONSTRAINT [PK_Product] PRIMARY KEY CLUSTERED
(
[id]
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[Pack] (
[id] [int] IDENTITY (1, 1) NOT NULL ,
[Product_ID] [int] NOT NULL ,
[Pack_Details] [varchar] (50) COLLATE Latin1_General_CI_AS NOT NULL
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[Pack] ADD
CONSTRAINT [PK_Pack] PRIMARY KEY CLUSTERED
(
[id]
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[Price] (
[id] [int] IDENTITY (1, 1) NOT NULL ,
[Price_Details] [varchar] (50) COLLATE Latin1_General_CI_AS NOT NULL ,
[Pack_ID] [int] NOT NULL
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[Price] ADD
CONSTRAINT [PK_Price] PRIMARY KEY CLUSTERED
(
[id]
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[Pack] ADD
CONSTRAINT [FK_Pack_Product] FOREIGN KEY
(
[Product_ID]
) REFERENCES [dbo].[Product] (
[id]
)
GO
ALTER TABLE [dbo].[Price] ADD
CONSTRAINT [FK_Price_Pack] FOREIGN KEY
(
[Pack_ID]
) REFERENCES [dbo].[Pack] (
[id]
)
GO|||> You didn't post sample data from the flat file, so it's hard to say for
> sure.
The sample data was in the [inputData] table (see insert statements).
I can see the advantage of DTS, but it's going to be difficult to get that
to work in our current application.
I guess what I really want is a way to say, "insert this information into
these three tables, ensuring that the referential integrity is maintained"
Griff|||Griff wrote:
> I have a "flat" CSV file that I need to import into a set of 3 relational
> tables.
> As a "programmer", I thought this would be trivial, so using VB6 and ADO,
I
> created a recordset of all the import rows and then iterated through the
> recordset extracting the data from the fields and putting them into the
> relational tables. This worked fine with my test data but rather obviousl
y
> takes way too long with real data; for every row, I have a network
> round-trip to update the first table and retrieve the primary key's ID, I
> then have another round trip to update the second table and get it's ID, a
nd
> a third network round trip for the third table. All in all, we're talking
>
> 7 hours to complete.
> I need to knock this down to minutes...
> I'm guessing that I can do this with a single SQL command sent to the
> server, but my SQL skills aren't really up to it. I'm hoping that this is
> relatively trivial and that someone could provide a few hints.
> If it's any help, I've created some example tables (the real ones are
> obviously more complicated) - see end of message.
> Thanks in advance
> Griff
> --
You've made the classic mistake of not declaring any alternate keys.
Leaving them out is not an option if the PK is an IDENTITY - otherwise
redundancy and lack of integritry will rule and your problems will be
much harder to solve. Here's what I suggest:
ALTER TABLE product ADD CONSTRAINT ak1_product UNIQUE
(catalogueproduct);
ALTER TABLE pack ADD CONSTRAINT ak1_pack UNIQUE (product_id,
pack_details);
ALTER TABLE price ADD CONSTRAINT ak1_price UNIQUE (pack_id /* ? */);
Now populate all three tables with three INSERTs:
INSERT INTO product (catalogueproduct)
SELECT DISTINCT product
FROM inputdata ;
INSERT INTO pack (product_id, pack_details)
SELECT DISTINCT P.id, pack
FROM inputdata AS D
JOIN product AS P
ON D.product = P.catalogueproduct ;
INSERT INTO price (price_details, pack_id)
SELECT DISTINCT D.price, A.id
FROM inputdata AS D
JOIN product AS P
ON D.product = P.catalogueproduct
JOIN pack AS A
ON D.pack = A.pack_details
AND A.product_id = P.id ;
David Portas, SQL Server MVP
Whenever possible please post enough code to reproduce your problem.
Including CREATE TABLE and INSERT statements usually helps.
State what version of SQL Server you are using and specify the content
of any error messages.
SQL Server Books Online:
http://msdn2.microsoft.com/library/ms130214(en-US,SQL.90).aspx
--|||File layouts can be significant, so posting INSERT statements is not really
helpful. Is what you're doing:
1) Populating inputData
2) Populating the other 3 tables, based on what you've loaded into inputData
If so, then inputData is just a work table. Once loaded - via bulk insert -
you then can kick off 3 SQL queries to insert rows into the other tables.
See David's remarks re FK's.
Tom
----
Thomas A. Moreau, BSc, PhD, MCSE, MCDBA
SQL Server MVP
Columnist, SQL Server Professional
Toronto, ON Canada
www.pinpub.com
"Griff" <howling@.the.moon> wrote in message
news:O8%232VgmJGHA.668@.TK2MSFTNGP11.phx.gbl...
> You didn't post sample data from the flat file, so it's hard to say for
> sure.
The sample data was in the [inputData] table (see insert statements).
I can see the advantage of DTS, but it's going to be difficult to get that
to work in our current application.
I guess what I really want is a way to say, "insert this information into
these three tables, ensuring that the referential integrity is maintained"
Griff|||Thank you Dave & Tom very much for your help.
The example tables were admittedly rather poor - the real ones have
composite primary keys that ensure that the relevant fields are unique -
sorry for not including that in the example tables.
Regarding Tom's comment - I was going to load the data from the CSV file
into a DB table rather than work directly on the CSV file. The reason for
this is that there are several intervening steps that were seemingly
irrelevant for this exercise.
The approach you have suggested should work fine - thanks for the idea.
Griff|||Looks like the DTS package is the route to go:
1) Create 3 SQL Server connections.
2) Load the table via a bulk insert task.
3) Run 2 Execute SQL tasks in parallel - one on each SQL connection - to
load the other tables. You'll be using SELECT DISTINCT in the queries.
Tom
----
Thomas A. Moreau, BSc, PhD, MCSE, MCDBA
SQL Server MVP
Columnist, SQL Server Professional
Toronto, ON Canada
www.pinpub.com
"Griff" <howling@.the.moon> wrote in message
news:uc9RMinJGHA.3856@.TK2MSFTNGP12.phx.gbl...
Thank you Dave & Tom very much for your help.
The example tables were admittedly rather poor - the real ones have
composite primary keys that ensure that the relevant fields are unique -
sorry for not including that in the example tables.
Regarding Tom's comment - I was going to load the data from the CSV file
into a DB table rather than work directly on the CSV file. The reason for
this is that there are several intervening steps that were seemingly
irrelevant for this exercise.
The approach you have suggested should work fine - thanks for the idea.
Griff

No comments:

Post a Comment