[jpa-spec users] [jsr338-experts] Re: JPA schema generation

From: Gordon Yorke <gordon.yorke_at_oracle.com>
Date: Tue, 15 May 2012 17:52:23 -0300

Thanks Linda,
Some clarifications below.

On 15/05/2012 5:37 PM, Linda DeMichiel wrote:
> Hi Gordon, all,
>
> Thanks for the feedback. More in-line below....
>
> -Linda
>
>
> On 5/15/2012 12:22 PM, Gordon Yorke wrote:
>> Hello Linda,
>> Please find some feedback below.
>> --Gordon
>>
>> On 09/05/2012 3:36 PM, Linda DeMichiel wrote:
>>>
>>> In conjunction with support for use of JPA in PaaS environments, we
>>> will need to address the issue of schema generation. After we discuss
>>> this area, I'd like to return to a broader discussion of multitenancy
>>> support with this in view.
>>>
>>>
>>> Here's what I think we'll need to address:
>>>
>>> 1. How/when schema generation happens:
>>> a. at a deployment time phase (or predeployment time phase)
>>> b. at EMF creation time
>>>
>>> 2. Providing an option whereby the application can supply DDL for
>>> schema generation as an alternative to the persistence provider
>>> doing schema generation based on the ORM metadata
>>>
>>> 3. How, once schema generation has happened, tables can get loaded with
>>> data. If we enable the application to provide DDL scripts for
>>> schema generation, then supporting the use of SQL bulk insert scripts
>>> would go hand-in-hand with this, but perhaps there are also other
>>> approaches, e.g., bulk insert via JPQL(?).
>>>
>>> 4. The need for additional ORM metadata: e.g. metadata for index
>>> generation; for foreign key constraints; other(?).
>>>
>>> 5. More precise definition of existing metadata that can be used
>>> for schema generation: what it comprises, and how it can/must
>>> be used.
>>>
>>> 6. Additional persistence.xml level metadata and additional metadata
>>> the platform provider passes to the persistence provider to control
>>> or configure the operations outlined above.
>>>
>>>
>>> I am assuming for that Java EE 7 we need to support both the
>>> separate database and shared database (aka separate schema) approaches.
>>>
>>> I am also assuming that it is the platform provider that does the
>>> provisioning of either the database or the schema respectively (i.e.,
>>> in accordance with whether the separate database or shared database
>>> approach is taken). I.e., in the separate database approach, I'm
>>> assuming that it is the persistence provider that creates the schemas
>>> directly in the database or creates the DDL to generate the schemas
>>> needed for the persistence unit. [I'll use the term "schema
>>> generation" generically below to mean either table generation or
>>> schema+table generation with this assumption in view.]
>>>
>>>
>>> Here's a proposal to get the discussion started.
>>>
>>> SCHEMA GENERATION
>>>
>>> I think we should allow schema generation to be done either as a
>>> deployment (or pre-deployment) phase or at createEMF time.
>> Both would be a good. We currently support ddl generation at
>> createEMF time, expanding the new functionality to this
>> point in time seems like a good idea as well
>>>
>>> Deployment-time (or predeployment-time) schema generation could be
>>> done either by the platform provider calling into the
>>> PersistenceProvider implementation to generate the database schema or
>>> by the platform provider running SQL DDL scripts.
>> Or a combination of both. Should the deployer require specific
>> structures not supported by the persistence provider, the
>> provider could have the bulk of the generation completed by the
>> persistence provider then tweak the db structures with
>> scripts.
>>>
>>> SQL DDL scripts might be provided with the application (bundled as
>>> part of the persistence unit) or the platform provider might invoke
>>> the PersistenceProvider implementation to generate them from the
>>> persistence unit metadata. If scripts are provided or generated, it
>>> should be the platform provider's responsibility to run them against
>> Do you mean it is the platform provider's responsibility to invoke
>> the deployment time interfaces of the persistence
>> provider to have the scripts run? If we are going to have the
>> functionality available in the persistence provider we
>> should keep the roles distinct.
>
> What I meant was that the platform provider would run the scripts.
> The scripts
> might have been packaged with the app or the platform provider might
> have called
> into the persistence provider to have generated them.
It would be better if the Persistence provider was the only script
executor. I suspect the persistence provider will need to have this
functionality anyway.
>
>>> the database. If the application has provided DDL scripts, the
>>> application should also have the option to provide SQL scripts to do
>>> a bulk insert of data. It would be the responsibility of the
>>> platform provider to run these as well.
>>>
>>> An alternative to the use of scripts is for the platform provider to
>>> call into the PersistenceProvider to directly generate the tables in
>>> the database. In the case that a database has been provisioned rather
>>> than only a schema, the persistence provider would generate the
>>> schemas and tables.
>> It may be problematic to have persistence providers generate schema.
>> In many databases user security permissions would
>> need to be updated to allow access to the new schema and there is
>> currently no metadata to define users and their
>> security permissions. Defining that metadata may be more than we want
>> the spec to take on. We should require the
>> platform provider to provision users along with the database.
>
> I was proposing that the persistence provider only create the schemas
> if the
> persistence provider "owned" the database -- i.e., had privileges to
> create
> schemas in the database. This would correspond to the SEPARATE_DATABASE
> approach and to the SaaS approach where tables were striped across
> tenants
> and multiple schemas were used. That said, we could take the
> alternative
> approach where the platform provider always created the schemas, but then
> the platform provider would need to be supplied with the metadata as to
> which schemas to create. (I.e., I want to avoid having the platform
> provider analyze the mapping annotations and orm.xml).
> In the case that the database is a shared database and a tenant
> corresponds
> to a schema, I was proposing that the platform provider create the
> database
> user (aka authid) for the tenant -- which would thus result in the
> creation
> of the default schema for the tenant.
So I was not clear. I was talking about login credentials specifically,
not tenants. We should not expect the persistence provider to generate
login credentials from some new metadata. The platform provider would
be responsible for ensuring the login credentials would have access to
the persistence provider generated schema.
>
>>>
>>> For the sake of argument, let's assume that we add a generateSchema()
>>> method to the PersistenceProvider interface, which the platform
>>> provider invokes during the deployment phase if SQL scripts have not
>>> been provided by the application.
>>>
>>> Information that needs to be passed in would include the following:
>>>
>>> (1) Information that specifies how schema generation is to proceed.
>>> This information could take the form of a SchemaGenerationInfo
>>> interface or it could be in the form of additional standard properties.
>>> (2) Connection to the database or schema. [Is this a Connection or a
>>> DataSource ?]
>> We should use the current createEMF APIs where connection info or a
>> datasource can be provided. We may want to put these
>> new APIs on the EMF
>>> (3) PersistenceUnitInfo
>> We should allow for SE type deployments as well where the persistence
>> provider discovers the PUInfo. The platform
>> provider may not be a EE platform and should not be required to
>> implement that role in the SPI.
>>>
>>>
>>>
>>> The SchemaGenerationInfo should include the following:
>>>
>>> ddl-generation-mode: GENERATE_AND_EXECUTE | GENERATE_ONLY |
>>> EXECUTE_ONLY
>>>
>>> GENERATE_AND_EXECUTE => generate ddl scripts and generate schema (if
>>> applicable) and tables directly in the database
>>>
>>> GENERATE_ONLY => generate ddl scripts
>>>
>>> EXECUTE_ONLY => generate schema (if applicable) and tables directly in
>>> the database
>>>
>>> create-ddl-target: a Writer configured for the persistence provider
>>> for outputting of the DDL scripts for the creation of schema/tables.
>>> This should be null if EXECUTE_ONLY ddl-generation-mode is specified.
>>>
>>> drop-ddl-target: a Writer configured for the persistence provider
>>> for outputting of the DDL scripts for the dropping of schema/tables.
>>> The should be null if EXECUTE_ONLY ddl-generation-mode is specified.
>>>
>>> ddl-execution-mode: Used when executing directly against the database,
>>> rather than generating scripts. Options should include:
>>> CREATE_TABLES | DROP_TABLES | DROP_AND_CREATE_TABLES
>>> Do we also need to distinguish CREATE_SCHEMAS_AND_TABLES ?
>>> DROP_SCHEMA ?
>>> Or can these be implicit?
>> Schema generation should be implicit.
>>> Do we also need an ALTER_TABLES capability here?
>> Perhaps optionally.
>>>
>>> properties: These could include any vendor-specific properties
>>>
>>>
>>> ADDITIONAL METADATA (persistence.xml)
>>>
>>> If scripts are provided with the application (i.e., bundled as part
>>> of the persistence unit), we should add entries in the persistence.xml
>>> to identify them:
>>> create-ddl-script
>>> drop-ddl-script
>>> data-load-script(s) [Do we need to support more than one load script?
>>> If so, the ordering may need to be specified as well.]
>>>
>>> Open issue: do we need metadata in the persistence.xml for
>>> ddl-execution-mode as well? For example, this might be used in
>>> standalone scenarios (?)
>> Properties should be sufficient for this metadata.
>
> Just to be sure -- did you mean for just ddl-execution-mode or for the
> scripts as well? I was proposing separate elements for the scripts.
both, I was suggesting new properties instead of elements. Properties
will be required to allow a predefined PU to perform schema generation
for a particular deployment.
>
>>>
>>>
>>> ORM METADATA (annotations and orm.xml)
>>>
>>> (1) Indexes
>>>
>>> I think we need to support indexes.
>>>
>>> Here's a strawman annotation:
>>>
>>> @Target({}) @Retention(RUNTIME)
>>> public @interface Index {
>>> String name() default "";
>>> String columnList();
>>> }
>>>
>>> The columnList syntax could follow that of the OrderBy annotation:
>>>
>>> columnList::= indexColumn [, indexColumn]*
>>> indexColumn::= columnName [ASC | DESC]
>>>
>>> If ASC or DESC is not specified, ASC is assumed
>>>
>>> We'd also need to add
>>> Index[] indexes() default {};
>>> to Table, Secondary Table, CollectionTable, JoinTable, and
>>> TableGenerator
>>>
>>>
>>> (2) Foreign Key Constraints
>> We also have the user supplied scripts to fall back on. If the
>> persistence provider's functionality is insufficient
>> scripts can be deployed as well. FK definition is likely to devolve
>> into a definition string.
>>>
>>> I see two possible approaches:
>>>
>>> (a) Add a string-valued foreignKeyDefinition element to JoinColumn,
>>> JoinColumns, MapKeyJoinColumn, etc. to specify a SQL fragment
>>> (intended along the lines of columnDefinition) for defining a foreign
>>> key constraint and/or for overriding the persistence provider's
>>> default foreign key definition. It might also be helpful to allow a
>>> name for the foreign key constraint to be specified for the case where
>>> the provider is using its default foreign key generation strategy.
>>>
>>> or
>>>
>>> (b) Add a ForeignKey annotation to specify the foreign key
>>> constraint --
>>> for example,
>>>
>>> public @interface ForeignKey {
>>> String name() default "";
>>> ForeignKeyAction updateAction() default NO_ACTION;
>>> ForeignKeyAction deleteAction() default NO_ACTION;
>>> boolean deferred() default false;
>>> }
>>>
>>> public enum ForeignKeyAction {NO_ACTION, CASCADE, SET_NULL};
>>>
>>> I have some misgivings about approach (b), and, given a lack of
>>> database portability here, we might wind up needing a
>>> foreignKeyDefinition
>>> fragment anyway.
>>>
>>>
>>> (3) Other
>>>
>>> What about the ability to distinguish a CHAR rather than VARCHAR
>>> mapping for strings? Or should we just leave this as belonging in a
>>> SQL fragment (with the understanding that we would be intentionally
>>> discouraging the use of CHAR strings)?
>> The fragment should be sufficient.
>>>
>>> (4) Anything else I'm missing?
>>>
>>>
>>> APIs
>>>
>>> I assumed the addition of a PersistenceProvider.generateSchema() method
>>> above.
>> This generateSchema() method should be placed on the EMF. The there
>> is no need to have a separate API on
>> PersistenceProvider.
>>>
>>> If we also support schema generation at EMF creation time, do we want
>>> another createContainerEntityManagerFactory() method that takes a
>>> SchemaGenerationInfo as well as PersistenceUnitInfo argument ?
>> Having the GenerationInfo contents supplied through properties means
>> we do not need new APIs.
>>>
>>> If generation is done directly in the database, at some point the
>>> container (or application) may need to have the schema/tables dropped.
>>> I'm not sure of the best way to support this. A dropAndClose() method
>>> on the EMF??
>>>
>>>
>>>